1Descriptive(3) User Contributed Perl Documentation Descriptive(3)
2
3
4
6 Statistics::Descriptive - Module of basic descriptive statistical func‐
7 tions.
8
10 use Statistics::Descriptive;
11 $stat = Statistics::Descriptive::Full->new();
12 $stat->add_data(1,2,3,4); $mean = $stat->mean();
13 $var = $stat->variance();
14 $tm = $stat->trimmed_mean(.25);
15 $Statistics::Descriptive::Tolerance = 1e-10;
16
18 This module provides basic functions used in descriptive statistics.
19 It has an object oriented design and supports two different types of
20 data storage and calculation objects: sparse and full. With the sparse
21 method, none of the data is stored and only a few statistical measures
22 are available. Using the full method, the entire data set is retained
23 and additional functions are available.
24
25 Whenever a division by zero may occur, the denominator is checked to be
26 greater than the value $Statistics::Descriptive::Tolerance, which
27 defaults to 0.0. You may want to change this value to some small posi‐
28 tive value such as 1e-24 in order to obtain error messages in case of
29 very small denominators.
30
31 Many of the methods (both Sparse and Full) cache values so that subse‐
32 quent calls with the same arguments are faster.
33
35 Sparse Methods
36
37 $stat = Statistics::Descriptive::Sparse->new();
38 Create a new sparse statistics object.
39
40 $stat->add_data(1,2,3);
41 Adds data to the statistics variable. The cached statistical val‐
42 ues are updated automatically.
43
44 $stat->count();
45 Returns the number of data items.
46
47 $stat->mean();
48 Returns the mean of the data.
49
50 $stat->sum();
51 Returns the sum of the data.
52
53 $stat->variance();
54 Returns the variance of the data. Division by n-1 is used.
55
56 $stat->standard_deviation();
57 Returns the standard deviation of the data. Division by n-1 is
58 used.
59
60 $stat->min();
61 Returns the minimum value of the data set.
62
63 $stat->mindex();
64 Returns the index of the minimum value of the data set.
65
66 $stat->max();
67 Returns the maximum value of the data set.
68
69 $stat->maxdex();
70 Returns the index of the maximum value of the data set.
71
72 $stat->sample_range();
73 Returns the sample range (max - min) of the data set.
74
75 Full Methods
76
77 Similar to the Sparse Methods above, any Full Method that is called
78 caches the current result so that it doesn't have to be recalculated.
79 In some cases, several values can be cached at the same time.
80
81 $stat = Statistics::Descriptive::Full->new();
82 Create a new statistics object that inherits from Statis‐
83 tics::Descriptive::Sparse so that it contains all the methods
84 described above.
85
86 $stat->add_data(1,2,4,5);
87 Adds data to the statistics variable. All of the sparse statisti‐
88 cal values are updated and cached. Cached values from Full meth‐
89 ods are deleted since they are no longer valid.
90
91 Note: Calling add_data with an empty array will delete all of
92 your Full method cached values! Cached values for the sparse
93 methods are not changed
94
95 $stat->get_data();
96 Returns a copy of the data array.
97
98 $stat->sort_data();
99 Sort the stored data and update the mindex and maxdex methods.
100 This method uses perl's internal sort.
101
102 $stat->presorted(1);
103 $stat->presorted();
104 If called with a non-zero argument, this method sets a flag that
105 says the data is already sorted and need not be sorted again.
106 Since some of the methods in this class require sorted data, this
107 saves some time. If you supply sorted data to the object, call
108 this method to prevent the data from being sorted again. The flag
109 is cleared whenever add_data is called. Calling the method with‐
110 out an argument returns the value of the flag.
111
112 $x = $stat->percentile(25);
113 ($x, $index) = $stat->percentile(25);
114 Sorts the data and returns the value that corresponds to the per‐
115 centile as defined in RFC2330:
116
117 * For example, given the 6 measurements:
118
119 -2, 7, 7, 4, 18, -5
120
121 Then F(-8) = 0, F(-5) = 1/6, F(-5.0001) = 0, F(-4.999) = 1/6,
122 F(7) = 5/6, F(18) = 1, F(239) = 1.
123
124 Note that we can recover the different measured values and how
125 many times each occurred from F(x) -- no information regarding
126 the range in values is lost. Summarizing measurements using
127 histograms, on the other hand, in general loses information
128 about the different values observed, so the EDF is preferred.
129
130 Using either the EDF or a histogram, however, we do lose
131 information regarding the order in which the values were
132 observed. Whether this loss is potentially significant will
133 depend on the metric being measured.
134
135 We will use the term "percentile" to refer to the smallest
136 value of x for which F(x) >= a given percentage. So the 50th
137 percentile of the example above is 4, since F(4) = 3/6 = 50%;
138 the 25th percentile is -2, since F(-5) = 1/6 < 25%, and F(-2)
139 = 2/6 >= 25%; the 100th percentile is 18; and the 0th per‐
140 centile is -infinity, as is the 15th percentile.
141
142 Care must be taken when using percentiles to summarize a sam‐
143 ple, because they can lend an unwarranted appearance of more
144 precision than is really available. Any such summary must
145 include the sample size N, because any percentile difference
146 finer than 1/N is below the resolution of the sample.
147
148 (Taken from: RFC2330 - Framework for IP Performance Metrics, Sec‐
149 tion 11.3. Defining Statistical Distributions. RFC2330 is avail‐
150 able from: http://www.cis.ohio-state.edu/htbin/rfc/rfc2330.html.)
151
152 If the percentile method is called in a list context then it will
153 also return the index of the percentile.
154
155 $stat->median();
156 Sorts the data and returns the median value of the data.
157
158 $stat->harmonic_mean();
159 Returns the harmonic mean of the data. Since the mean is unde‐
160 fined if any of the data are zero or if the sum of the reciprocals
161 is zero, it will return undef for both of those cases.
162
163 $stat->geometric_mean();
164 Returns the geometric mean of the data.
165
166 $stat->mode();
167 Returns the mode of the data.
168
169 $stat->trimmed_mean(ltrim[,utrim]);
170 "trimmed_mean(ltrim)" returns the mean with a fraction "ltrim" of
171 entries at each end dropped. "trimmed_mean(ltrim,utrim)" returns
172 the mean after a fraction "ltrim" has been removed from the lower
173 end of the data and a fraction "utrim" has been removed from the
174 upper end of the data. This method sorts the data before begin‐
175 ning to analyze it.
176
177 All calls to trimmed_mean() are cached so that they don't have to
178 be calculated a second time.
179
180 $stat->frequency_distribution($partitions);
181 $stat->frequency_distribution(\@bins);
182 $stat->frequency_distribution();
183 "frequency_distribution($partitions)" slices the data into $parti‐
184 tion sets (where $partition is greater than 1) and counts the num‐
185 ber of items that fall into each partition. It returns an associa‐
186 tive array where the keys are the numerical values of the parti‐
187 tions used. The minimum value of the data set is not a key and the
188 maximum value of the data set is always a key. The number of
189 entries for a particular partition key are the number of items
190 which are greater than the previous partition key and less then or
191 equal to the current partition key. As an example,
192
193 $stat->add_data(1,1.5,2,2.5,3,3.5,4);
194 %f = $stat->frequency_distribution(2);
195 for (sort {$a <=> $b} keys %f) {
196 print "key = $_, count = $f{$_}\n";
197 }
198
199 prints
200
201 key = 2.5, count = 4
202 key = 4, count = 3
203
204 since there are four items less than or equal to 2.5, and 3 items
205 greater than 2.5 and less than 4.
206
207 "frequency_distribution(\@bins)" provides the bins that are to be
208 used for the distribution. This allows for non-uniform distribu‐
209 tions as well as trimmed or sample distributions to be found.
210 @bins must be monotonic and contain at least one element. Note
211 that unless the set of bins contains the range that the total
212 counts returned will be less than the sample size.
213
214 Calling "frequency_distribution()" with no arguments returns the
215 last distribution calculated, if such exists.
216
217 $stat->least_squares_fit();
218 $stat->least_squares_fit(@x);
219 "least_squares_fit()" performs a least squares fit on the data,
220 assuming a domain of @x or a default of 1..$stat->count(). It
221 returns an array of four elements "($q, $m, $r, $rms)" where
222
223 "$q and $m"
224 satisfy the equation C($y = $m*$x + $q).
225
226 $r is the Pearson linear correlation cofficient.
227
228 $rms
229 is the root-mean-square error.
230
231 If case of error or division by zero, the empty list is returned.
232
233 The array that is returned can be "coerced" into a hash structure
234 by doing the following:
235
236 my %hash = ();
237 @hash{'q', 'm', 'r', 'err'} = $stat->least_squares_fit();
238
239 Because calling "least_squares_fit()" with no arguments defaults
240 to using the current range, there is no caching of the results.
241
243 I read my email frequently, but since adopting this module I've added 2
244 children and 1 dog to my family, so please be patient about my response
245 times. When reporting errors, please include the following to help me
246 out:
247
248 · Your version of perl. This can be obtained by typing perl "-v" at
249 the command line.
250
251 · Which version of Statistics::Descriptive you're using. As you can
252 see below, I do make mistakes. Unfortunately for me, right now
253 there are thousands of CD's with the version of this module with
254 the bugs in it. Fortunately for you, I'm a very patient module
255 maintainer.
256
257 · Details about what the error is. Try to narrow down the scope of
258 the problem and send me code that I can run to verify and track it
259 down.
260
262 Colin Kuskie
263
264 My email address can be found at http://www.perl.com under Who's Who or
265 at: http://search.cpan.org/author/COLINK/.
266
268 RFC2330, Framework for IP Performance Metrics
269
270 The Art of Computer Programming, Volume 2, Donald Knuth.
271
272 Handbook of Mathematica Functions, Milton Abramowitz and Irene Stegun.
273
274 Probability and Statistics for Engineering and the Sciences, Jay
275 Devore.
276
278 Copyright (c) 1997,1998 Colin Kuskie. All rights reserved. This pro‐
279 gram is free software; you can redistribute it and/or modify it under
280 the same terms as Perl itself.
281
282 Copyright (c) 1998 Andrea Spinelli. All rights reserved. This program
283 is free software; you can redistribute it and/or modify it under the
284 same terms as Perl itself.
285
286 Copyright (c) 1994,1995 Jason Kastner. All rights reserved. This pro‐
287 gram is free software; you can redistribute it and/or modify it under
288 the same terms as Perl itself.
289
291 v2.3 Rolled into November 1998
292
293 Code provided by Andrea Spinelli to prevent division by zero and to
294 make consistent return values for undefined behavior. Andrea also pro‐
295 vided a test bench for the module.
296
297 A bug fix for the calculation of frequency distributions. Thanks to
298 Nick Tolli for alerting this to me.
299
300 Added 4 lines of code to Makefile.PL to make it easier for the ActiveS‐
301 tate installation tool to use. Changes work fine in perl5.004_04,
302 haven't tested them under perl5.005xx yet.
303
304 v2.2 Rolled into March 1998.
305
306 Fixed problem with sending 0's and -1's as data. The old 0 : true ?
307 false thing. Use defined to fix.
308
309 Provided a fix for AUTOLOAD/DESTROY/Carp bug. Very strange.
310
311 v2.1 August 1997
312
313 Fixed errors in statistics algorithms caused by changing the interface.
314
315 v2.0 August 1997
316
317 Fixed errors in removing cached values (they weren't being removed!)
318 and added sort_data and presorted methods.
319
320 June 1997
321
322 Transferred ownership of the module from Jason to Colin.
323
324 Rewrote OO interface, modified function distribution, added mindex,
325 maxdex.
326
327 v1.1 April 1995
328
329 Added LeastSquaresFit and FrequencyDistribution.
330
331 v1.0 March 1995
332
333 Released to comp.lang.perl and placed on archive sites.
334
335 v.20 December 1994
336
337 Complete rewrite after extensive and invaluable e-mail correspondence
338 with Anno Siegel.
339
340 v.10 December 1994
341
342 Initital concept, released to perl5-porters list.
343
344
345
346perl v5.8.8 2002-10-10 Descriptive(3)