1Statistics::ChiSquare(3U)ser Contributed Perl DocumentatiSotnatistics::ChiSquare(3)
2
3
4

NAME

6       "Statistics::ChiSquare" - How well-distributed is your data?
7

SYNOPSIS

9           use Statistics::ChiSquare;
10
11           print chisquare(@array_of_numbers);
12
13       Statistics::ChiSquare is available at a CPAN site near you.
14

DESCRIPTION

16       Suppose you flip a coin 100 times, and it turns up heads 70 times.  Is
17       the coin fair?
18
19       Suppose you roll a die 100 times, and it shows 30 sixes.  Is the die
20       loaded?
21
22       In statistics, the chi-square test calculates how well a series of
23       numbers fits a distribution.  In this module, we only test for whether
24       results fit an even distribution.  It doesn't simply say "yes" or "no".
25       Instead, it gives you a confidence interval, which sets upper and lower
26       bounds on the likelihood that the variation in your data is due to
27       chance.  See the examples below.
28
29       If you've ever studied elementary genetics, you've probably heard about
30       Gregor Mendel.  He was a wacky Austrian botanist who discovered (in
31       1865) that traits could be inherited in a predictable fashion.  He did
32       lots of experiments with cross breeding peas: green peas, yellow peas,
33       smooth peas, wrinkled peas.  A veritable Brave New World of legumes.
34
35       But Mendel faked his data.  A statistician by the name of R. A. Fisher
36       used the chi-square test to prove it.
37

FUNCTIONS

39   chisquare
40       There's just one function in this module: chisquare().  Instead of
41       returning the bounds on the confidence interval in a tidy little two-
42       element array, it returns an English string.  This was a deliberate
43       design choice---many people misinterpret chi-square results, and the
44       string helps clarify the meaning.
45
46       The string returned by chisquare() will always match one of these
47       patterns:
48
49         ".*!" (ie strings ending in an exclamation)
50
51       various error messages for when you supply Obviously Wrong data.
52
53         "There's a >\d+% chance, and a <\d+% chance, that this data is random."
54
55       or
56
57         "There's a <\d+% chance that this data is random."
58
59       or
60
61         "I can't handle \d+ choices without a better table."
62
63       That last one deserves a bit more explanation.  The "modern" chi-square
64       test uses a table of values (based on Pearson's approximation) to avoid
65       expensive calculations.  Thanks to the table, the chisquare()
66       calculation is very fast, but there are some collections of data it
67       can't handle, including any collection with more than 250 slots.  So
68       you can't calculate the randomness of a 500-sided die should such an
69       insane thing ever exist.
70
71       You will also notice that the percentage points that have been
72       tabulated for different numbers of data points - that is, for different
73       degrees of freedom - differ.  The table in Jon Orwant's original
74       version has data tabulated for 100%, 99%, 95%, 90%, 70%, 50%, 30%, 10%,
75       5%, and 1% likelihoods.  Data added later by David Cantrell is
76       tabulated for 100%, 99%, 95%, 90%, 75%, 50%, 25%, 10%, 5%, and 1%
77       likelihoods.
78

EXAMPLES

80       Imagine a coin flipped 1000 times.  The expected outcome is 500 heads
81       and 500 tails:
82
83         @coin = (500, 500);
84         print chisquare(@coin);
85
86       prints "There's a >90% chance, and a <100% chance, that this data is
87       random.
88
89       Imagine a die rolled 60 times that shows sixes just a wee bit too
90       often.
91
92         @die1  = (8, 7, 9, 8, 8, 20);
93         print chisquare(@die1);
94
95       prints "There's a >1% chance, and a <5% chance, that this data is
96       random.
97
98       Imagine a die rolled 600 times that shows sixes way too often.
99
100         @die2  = (80, 70, 90, 80, 80, 200);
101         print chisquare(@die2);
102
103       prints "There's a <1% chance that this data is random."
104
105       How random is rand()?
106
107         srand(time ^ $$);
108         @rands = ();
109         for ($i = 0; $i < 60000; $i++) {
110             $slot = int(rand(6));
111             $rands[$slot]++;
112         }
113         print "@rands\n";
114         print chisquare(@rands);
115
116       prints (on my machine)
117
118         10156 10041 9991 9868 10034 9910
119         There's a >10% chance, and a <50% chance, that this data is random.
120
121       So much for pseudorandom number generation.
122

AUTHORS and LICENCE

124       Jon Orwant, Readable Publications, Inc; orwant@oreilly.com
125
126       Maintained and updated since October 2003 by David Cantrell,
127       david@cantrell.org.uk
128
129       This software is free-as-in-speech software, and may be used,
130       distributed, and modified under the terms of either the GNU General
131       Public Licence version 2 or the Artistic Licence. It's up to you which
132       one you use. The full text of the licences can be found in the files
133       GPL2.txt and ARTISTIC.txt, respectively.
134

DATA SOURCE

136       Data for 31 to 250 degrees of freedom is from Wikibooks
137       <https://en.wikibooks.org/wiki/Engineering_Tables/Chi-
138       Squared_Distibution> and is covered by the Creative Commons
139       attribution-sharealike licence
140       <https://creativecommons.org/licenses/by-sa/3.0/>.
141
142
143
144perl v5.36.0                      2023-01-20          Statistics::ChiSquare(3)
Impressum