1Statistics::DescriptiveU(s3e)r Contributed Perl DocumentaSttiaotnistics::Descriptive(3)
2
3
4
6 Statistics::Descriptive - Module of basic descriptive statistical
7 functions.
8
10 use Statistics::Descriptive;
11 $stat = Statistics::Descriptive::Full->new();
12 $stat->add_data(1,2,3,4); $mean = $stat->mean();
13 $var = $stat->variance();
14 $tm = $stat->trimmed_mean(.25);
15 $Statistics::Descriptive::Tolerance = 1e-10;
16
18 This module provides basic functions used in descriptive statistics.
19 It has an object oriented design and supports two different types of
20 data storage and calculation objects: sparse and full. With the sparse
21 method, none of the data is stored and only a few statistical measures
22 are available. Using the full method, the entire data set is retained
23 and additional functions are available.
24
25 Whenever a division by zero may occur, the denominator is checked to be
26 greater than the value $Statistics::Descriptive::Tolerance, which
27 defaults to 0.0. You may want to change this value to some small
28 positive value such as 1e-24 in order to obtain error messages in case
29 of very small denominators.
30
31 Many of the methods (both Sparse and Full) cache values so that
32 subsequent calls with the same arguments are faster.
33
35 Sparse Methods
36 $stat = Statistics::Descriptive::Sparse->new();
37 Create a new sparse statistics object.
38
39 $stat->clear();
40 Effectively the same as
41
42 my $class = ref($stat);
43 undef $stat;
44 $stat = new $class;
45
46 except more efficient.
47
48 $stat->add_data(1,2,3);
49 Adds data to the statistics variable. The cached statistical
50 values are updated automatically.
51
52 $stat->count();
53 Returns the number of data items.
54
55 $stat->mean();
56 Returns the mean of the data.
57
58 $stat->sum();
59 Returns the sum of the data.
60
61 $stat->variance();
62 Returns the variance of the data. Division by n-1 is used.
63
64 $stat->standard_deviation();
65 Returns the standard deviation of the data. Division by n-1 is
66 used.
67
68 $stat->min();
69 Returns the minimum value of the data set.
70
71 $stat->mindex();
72 Returns the index of the minimum value of the data set.
73
74 $stat->max();
75 Returns the maximum value of the data set.
76
77 $stat->maxdex();
78 Returns the index of the maximum value of the data set.
79
80 $stat->sample_range();
81 Returns the sample range (max - min) of the data set.
82
83 Full Methods
84 Similar to the Sparse Methods above, any Full Method that is called
85 caches the current result so that it doesn't have to be recalculated.
86 In some cases, several values can be cached at the same time.
87
88 $stat = Statistics::Descriptive::Full->new();
89 Create a new statistics object that inherits from
90 Statistics::Descriptive::Sparse so that it contains all the
91 methods described above.
92
93 $stat->add_data(1,2,4,5);
94 Adds data to the statistics variable. All of the sparse
95 statistical values are updated and cached. Cached values from
96 Full methods are deleted since they are no longer valid.
97
98 Note: Calling add_data with an empty array will delete all of
99 your Full method cached values! Cached values for the sparse
100 methods are not changed
101
102 $stat->get_data();
103 Returns a copy of the data array.
104
105 $stat->sort_data();
106 Sort the stored data and update the mindex and maxdex methods.
107 This method uses perl's internal sort.
108
109 $stat->presorted(1);
110 $stat->presorted();
111 If called with a non-zero argument, this method sets a flag that
112 says the data is already sorted and need not be sorted again.
113 Since some of the methods in this class require sorted data, this
114 saves some time. If you supply sorted data to the object, call
115 this method to prevent the data from being sorted again. The flag
116 is cleared whenever add_data is called. Calling the method
117 without an argument returns the value of the flag.
118
119 $stat->skewness();
120 Returns the skewness of the data. A value of zero is no skew,
121 negative is a left skewed tail, positive is a right skewed tail.
122 This is consistent with Excel.
123
124 $stat->kurtosis();
125 Returns the kurtosis of the data. Positive is peaked, negative is
126 flattened.
127
128 $x = $stat->percentile(25);
129 ($x, $index) = $stat->percentile(25);
130 Sorts the data and returns the value that corresponds to the
131 percentile as defined in RFC2330:
132
133 · For example, given the 6 measurements:
134
135 -2, 7, 7, 4, 18, -5
136
137 Then F(-8) = 0, F(-5) = 1/6, F(-5.0001) = 0, F(-4.999) = 1/6,
138 F(7) = 5/6, F(18) = 1, F(239) = 1.
139
140 Note that we can recover the different measured values and how
141 many times each occurred from F(x) -- no information regarding
142 the range in values is lost. Summarizing measurements using
143 histograms, on the other hand, in general loses information
144 about the different values observed, so the EDF is preferred.
145
146 Using either the EDF or a histogram, however, we do lose
147 information regarding the order in which the values were
148 observed. Whether this loss is potentially significant will
149 depend on the metric being measured.
150
151 We will use the term "percentile" to refer to the smallest
152 value of x for which F(x) >= a given percentage. So the 50th
153 percentile of the example above is 4, since F(4) = 3/6 = 50%;
154 the 25th percentile is -2, since F(-5) = 1/6 < 25%, and F(-2)
155 = 2/6 >= 25%; the 100th percentile is 18; and the 0th
156 percentile is -infinity, as is the 15th percentile.
157
158 Care must be taken when using percentiles to summarize a
159 sample, because they can lend an unwarranted appearance of
160 more precision than is really available. Any such summary
161 must include the sample size N, because any percentile
162 difference finer than 1/N is below the resolution of the
163 sample.
164
165 (Taken from: RFC2330 - Framework for IP Performance Metrics,
166 Section 11.3. Defining Statistical Distributions. RFC2330 is
167 available from: <http://www.ietf.org/rfc/rfc2330.txt> .)
168
169 If the percentile method is called in a list context then it will
170 also return the index of the percentile.
171
172 $x = $stat->quantile($Type);
173 Sorts the data and returns estimates of underlying distribution
174 quantiles based on one or two order statistics from the supplied
175 elements.
176
177 This method use the same algorithm as Excel and R language
178 (quantile type 7).
179
180 The generic function quantile produces sample quantiles
181 corresponding to the given probabilities.
182
183 $Type is an integer value between 0 to 4 :
184
185 0 => zero quartile (Q0) : minimal value
186 1 => first quartile (Q1) : lower quartile = lowest cut off (25%) of data = 25th percentile
187 2 => second quartile (Q2) : median = it cuts data set in half = 50th percentile
188 3 => third quartile (Q3) : upper quartile = highest cut off (25%) of data, or lowest 75% = 75th percentile
189 4 => fourth quartile (Q4) : maximal value
190
191 Exemple :
192
193 my @data = (1..10);
194 my $stat = Statistics::Descriptive::Full->new();
195 $stat->add_data(@data);
196 print $stat->quantile(0); # => 1
197 print $stat->quantile(1); # => 3.25
198 print $stat->quantile(2); # => 5.5
199 print $stat->quantile(3); # => 7.75
200 print $stat->quantile(4); # => 10
201
202 $stat->median();
203 Sorts the data and returns the median value of the data.
204
205 $stat->harmonic_mean();
206 Returns the harmonic mean of the data. Since the mean is
207 undefined if any of the data are zero or if the sum of the
208 reciprocals is zero, it will return undef for both of those cases.
209
210 $stat->geometric_mean();
211 Returns the geometric mean of the data.
212
213 $stat->mode();
214 Returns the mode of the data.
215
216 $stat->trimmed_mean(ltrim[,utrim]);
217 "trimmed_mean(ltrim)" returns the mean with a fraction "ltrim" of
218 entries at each end dropped. "trimmed_mean(ltrim,utrim)" returns
219 the mean after a fraction "ltrim" has been removed from the lower
220 end of the data and a fraction "utrim" has been removed from the
221 upper end of the data. This method sorts the data before
222 beginning to analyze it.
223
224 All calls to trimmed_mean() are cached so that they don't have to
225 be calculated a second time.
226
227 $stat->frequency_distribution_ref($partitions);
228 $stat->frequency_distribution_ref(\@bins);
229 $stat->frequency_distribution_ref();
230 "frequency_distribution_ref($partitions)" slices the data into
231 $partition sets (where $partition is greater than 1) and counts
232 the number of items that fall into each partition. It returns a
233 reference to a hash where the keys are the numerical values of the
234 partitions used. The minimum value of the data set is not a key
235 and the maximum value of the data set is always a key. The number
236 of entries for a particular partition key are the number of items
237 which are greater than the previous partition key and less then or
238 equal to the current partition key. As an example,
239
240 $stat->add_data(1,1.5,2,2.5,3,3.5,4);
241 $f = $stat->frequency_distribution_ref(2);
242 for (sort {$a <=> $b} keys %$f) {
243 print "key = $_, count = $f->{$_}\n";
244 }
245
246 prints
247
248 key = 2.5, count = 4
249 key = 4, count = 3
250
251 since there are four items less than or equal to 2.5, and 3 items
252 greater than 2.5 and less than 4.
253
254 "frequency_distribution_refs(\@bins)" provides the bins that are
255 to be used for the distribution. This allows for non-uniform
256 distributions as well as trimmed or sample distributions to be
257 found. @bins must be monotonic and contain at least one element.
258 Note that unless the set of bins contains the range that the total
259 counts returned will be less than the sample size.
260
261 Calling "frequency_distribution_ref()" with no arguments returns
262 the last distribution calculated, if such exists.
263
264 my %hash = $stat->frequency_distribution($partitions);
265 my %hash = $stat->frequency_distribution(\@bins);
266 my %hash = $stat->frequency_distribution();
267 Same as "frequency_distribution_ref()" except that returns the
268 hash clobbered into the return list. Kept for compatibility
269 reasons with previous versions of Statistics::Descriptive and
270 using it is discouraged.
271
272 $stat->least_squares_fit();
273 $stat->least_squares_fit(@x);
274 "least_squares_fit()" performs a least squares fit on the data,
275 assuming a domain of @x or a default of 1..$stat->count(). It
276 returns an array of four elements "($q, $m, $r, $rms)" where
277
278 "$q and $m"
279 satisfy the equation C($y = $m*$x + $q).
280
281 $r is the Pearson linear correlation cofficient.
282
283 $rms
284 is the root-mean-square error.
285
286 If case of error or division by zero, the empty list is returned.
287
288 The array that is returned can be "coerced" into a hash structure
289 by doing the following:
290
291 my %hash = ();
292 @hash{'q', 'm', 'r', 'err'} = $stat->least_squares_fit();
293
294 Because calling "least_squares_fit()" with no arguments defaults
295 to using the current range, there is no caching of the results.
296
298 I read my email frequently, but since adopting this module I've added 2
299 children and 1 dog to my family, so please be patient about my response
300 times. When reporting errors, please include the following to help me
301 out:
302
303 · Your version of perl. This can be obtained by typing perl "-v" at
304 the command line.
305
306 · Which version of Statistics::Descriptive you're using. As you can
307 see below, I do make mistakes. Unfortunately for me, right now
308 there are thousands of CD's with the version of this module with
309 the bugs in it. Fortunately for you, I'm a very patient module
310 maintainer.
311
312 · Details about what the error is. Try to narrow down the scope of
313 the problem and send me code that I can run to verify and track it
314 down.
315
317 Current maintainer:
318
319 Shlomi Fish, <http://www.shlomifish.org/> , "shlomif@cpan.org"
320
321 Previously:
322
323 Colin Kuskie
324
325 My email address can be found at http://www.perl.com under Who's Who or
326 at: http://search.cpan.org/author/COLINK/.
327
329 RFC2330, Framework for IP Performance Metrics
330
331 The Art of Computer Programming, Volume 2, Donald Knuth.
332
333 Handbook of Mathematica Functions, Milton Abramowitz and Irene Stegun.
334
335 Probability and Statistics for Engineering and the Sciences, Jay
336 Devore.
337
339 Copyright (c) 1997,1998 Colin Kuskie. All rights reserved. This
340 program is free software; you can redistribute it and/or modify it
341 under the same terms as Perl itself.
342
343 Copyright (c) 1998 Andrea Spinelli. All rights reserved. This program
344 is free software; you can redistribute it and/or modify it under the
345 same terms as Perl itself.
346
347 Copyright (c) 1994,1995 Jason Kastner. All rights reserved. This
348 program is free software; you can redistribute it and/or modify it
349 under the same terms as Perl itself.
350
352 This program is free software; you can redistribute it and/or modify it
353 under the same terms as Perl itself.
354
355
356
357perl v5.12.1 2010-06-23 Statistics::Descriptive(3)