Statistics::Descriptive(3pm)

1Descriptive(3)        User Contributed Perl Documentation       Descriptive(3)
2
3
4

NAME

6       Statistics::Descriptive - Module of basic descriptive statistical func‐
7       tions.
8

SYNOPSIS

10         use Statistics::Descriptive;
11         $stat = Statistics::Descriptive::Full->new();
12         $stat->add_data(1,2,3,4); $mean = $stat->mean();
13         $var  = $stat->variance();
14         $tm   = $stat->trimmed_mean(.25);
15         $Statistics::Descriptive::Tolerance = 1e-10;
16

DESCRIPTION

18       This module provides basic functions used in descriptive statistics.
19       It has an object oriented design and supports two different types of
20       data storage and calculation objects: sparse and full. With the sparse
21       method, none of the data is stored and only a few statistical measures
22       are available. Using the full method, the entire data set is retained
23       and additional functions are available.
24
25       Whenever a division by zero may occur, the denominator is checked to be
26       greater than the value $Statistics::Descriptive::Tolerance, which
27       defaults to 0.0. You may want to change this value to some small posi‐
28       tive value such as 1e-24 in order to obtain error messages in case of
29       very small denominators.
30
31       Many of the methods (both Sparse and Full) cache values so that subse‐
32       quent calls with the same arguments are faster.
33

METHODS

35       Sparse Methods
36
37       $stat = Statistics::Descriptive::Sparse->new();
38            Create a new sparse statistics object.
39
40       $stat->add_data(1,2,3);
41            Adds data to the statistics variable. The cached statistical val‐
42            ues are updated automatically.
43
44       $stat->count();
45            Returns the number of data items.
46
47       $stat->mean();
48            Returns the mean of the data.
49
50       $stat->sum();
51            Returns the sum of the data.
52
53       $stat->variance();
54            Returns the variance of the data.  Division by n-1 is used.
55
56       $stat->standard_deviation();
57            Returns the standard deviation of the data. Division by n-1 is
58            used.
59
60       $stat->min();
61            Returns the minimum value of the data set.
62
63       $stat->mindex();
64            Returns the index of the minimum value of the data set.
65
66       $stat->max();
67            Returns the maximum value of the data set.
68
69       $stat->maxdex();
70            Returns the index of the maximum value of the data set.
71
72       $stat->sample_range();
73            Returns the sample range (max - min) of the data set.
74
75       Full Methods
76
77       Similar to the Sparse Methods above, any Full Method that is called
78       caches the current result so that it doesn't have to be recalculated.
79       In some cases, several values can be cached at the same time.
80
81       $stat = Statistics::Descriptive::Full->new();
82            Create a new statistics object that inherits from Statis‐
83            tics::Descriptive::Sparse so that it contains all the methods
84            described above.
85
86       $stat->add_data(1,2,4,5);
87            Adds data to the statistics variable.  All of the sparse statisti‐
88            cal values are updated and cached.  Cached values from Full meth‐
89            ods are deleted since they are no longer valid.
90
91            Note:  Calling add_data with an empty array will delete all of
92            your Full method cached values!  Cached values for the sparse
93            methods are not changed
94
95       $stat->get_data();
96            Returns a copy of the data array.
97
98       $stat->sort_data();
99            Sort the stored data and update the mindex and maxdex methods.
100            This method uses perl's internal sort.
101
102       $stat->presorted(1);
103       $stat->presorted();
104            If called with a non-zero argument, this method sets a flag that
105            says the data is already sorted and need not be sorted again.
106            Since some of the methods in this class require sorted data, this
107            saves some time.  If you supply sorted data to the object, call
108            this method to prevent the data from being sorted again. The flag
109            is cleared whenever add_data is called.  Calling the method with‐
110            out an argument returns the value of the flag.
111
112       $x = $stat->percentile(25);
113       ($x, $index) = $stat->percentile(25);
114            Sorts the data and returns the value that corresponds to the per‐
115            centile as defined in RFC2330:
116
117            *   For example, given the 6 measurements:
118
119                -2, 7, 7, 4, 18, -5
120
121                Then F(-8) = 0, F(-5) = 1/6, F(-5.0001) = 0, F(-4.999) = 1/6,
122                F(7) = 5/6, F(18) = 1, F(239) = 1.
123
124                Note that we can recover the different measured values and how
125                many times each occurred from F(x) -- no information regarding
126                the range in values is lost.  Summarizing measurements using
127                histograms, on the other hand, in general loses information
128                about the different values observed, so the EDF is preferred.
129
130                Using either the EDF or a histogram, however, we do lose
131                information regarding the order in which the values were
132                observed.  Whether this loss is potentially significant will
133                depend on the metric being measured.
134
135                We will use the term "percentile" to refer to the smallest
136                value of x for which F(x) >= a given percentage.  So the 50th
137                percentile of the example above is 4, since F(4) = 3/6 = 50%;
138                the 25th percentile is -2, since F(-5) = 1/6 < 25%, and F(-2)
139                = 2/6 >= 25%; the 100th percentile is 18; and the 0th per‐
140                centile is -infinity, as is the 15th percentile.
141
142                Care must be taken when using percentiles to summarize a sam‐
143                ple, because they can lend an unwarranted appearance of more
144                precision than is really available.  Any such summary must
145                include the sample size N, because any percentile difference
146                finer than 1/N is below the resolution of the sample.
147
148            (Taken from: RFC2330 - Framework for IP Performance Metrics, Sec‐
149            tion 11.3.  Defining Statistical Distributions.  RFC2330 is avail‐
150            able from: http://www.cis.ohio-state.edu/htbin/rfc/rfc2330.html.)
151
152            If the percentile method is called in a list context then it will
153            also return the index of the percentile.
154
155       $stat->median();
156            Sorts the data and returns the median value of the data.
157
158       $stat->harmonic_mean();
159            Returns the harmonic mean of the data.  Since the mean is unde‐
160            fined if any of the data are zero or if the sum of the reciprocals
161            is zero, it will return undef for both of those cases.
162
163       $stat->geometric_mean();
164            Returns the geometric mean of the data.
165
166       $stat->mode();
167            Returns the mode of the data.
168
169       $stat->trimmed_mean(ltrim[,utrim]);
170            "trimmed_mean(ltrim)" returns the mean with a fraction "ltrim" of
171            entries at each end dropped. "trimmed_mean(ltrim,utrim)" returns
172            the mean after a fraction "ltrim" has been removed from the lower
173            end of the data and a fraction "utrim" has been removed from the
174            upper end of the data.  This method sorts the data before begin‐
175            ning to analyze it.
176
177            All calls to trimmed_mean() are cached so that they don't have to
178            be calculated a second time.
179
180       $stat->frequency_distribution($partitions);
181       $stat->frequency_distribution(\@bins);
182       $stat->frequency_distribution();
183            "frequency_distribution($partitions)" slices the data into $parti‐
184            tion sets (where $partition is greater than 1) and counts the num‐
185            ber of items that fall into each partition. It returns an associa‐
186            tive array where the keys are the numerical values of the parti‐
187            tions used. The minimum value of the data set is not a key and the
188            maximum value of the data set is always a key. The number of
189            entries for a particular partition key are the number of items
190            which are greater than the previous partition key and less then or
191            equal to the current partition key. As an example,
192
193               $stat->add_data(1,1.5,2,2.5,3,3.5,4);
194               %f = $stat->frequency_distribution(2);
195               for (sort {$a <=> $b} keys %f) {
196                  print "key = $_, count = $f{$_}\n";
197               }
198
199            prints
200
201               key = 2.5, count = 4
202               key = 4, count = 3
203
204            since there are four items less than or equal to 2.5, and 3 items
205            greater than 2.5 and less than 4.
206
207            "frequency_distribution(\@bins)" provides the bins that are to be
208            used for the distribution.  This allows for non-uniform distribu‐
209            tions as well as trimmed or sample distributions to be found.
210            @bins must be monotonic and contain at least one element.  Note
211            that unless the set of bins contains the range that the total
212            counts returned will be less than the sample size.
213
214            Calling "frequency_distribution()" with no arguments returns the
215            last distribution calculated, if such exists.
216
217       $stat->least_squares_fit();
218       $stat->least_squares_fit(@x);
219            "least_squares_fit()" performs a least squares fit on the data,
220            assuming a domain of @x or a default of 1..$stat->count().  It
221            returns an array of four elements "($q, $m, $r, $rms)" where
222
223            "$q and $m"
224                satisfy the equation C($y = $m*$x + $q).
225
226            $r  is the Pearson linear correlation cofficient.
227
228            $rms
229                is the root-mean-square error.
230
231            If case of error or division by zero, the empty list is returned.
232
233            The array that is returned can be "coerced" into a hash structure
234            by doing the following:
235
236              my %hash = ();
237              @hash{'q', 'm', 'r', 'err'} = $stat->least_squares_fit();
238
239            Because calling "least_squares_fit()" with no arguments defaults
240            to using the current range, there is no caching of the results.
241

REPORTING ERRORS

243       I read my email frequently, but since adopting this module I've added 2
244       children and 1 dog to my family, so please be patient about my response
245       times.  When reporting errors, please include the following to help me
246       out:
247
248       ·   Your version of perl.  This can be obtained by typing perl "-v" at
249           the command line.
250
251       ·   Which version of Statistics::Descriptive you're using.  As you can
252           see below, I do make mistakes.  Unfortunately for me, right now
253           there are thousands of CD's with the version of this module with
254           the bugs in it.  Fortunately for you, I'm a very patient module
255           maintainer.
256
257       ·   Details about what the error is.  Try to narrow down the scope of
258           the problem and send me code that I can run to verify and track it
259           down.
260

AUTHOR

262       Colin Kuskie
263
264       My email address can be found at http://www.perl.com under Who's Who or
265       at: http://search.cpan.org/author/COLINK/.
266

REFERENCES

268       RFC2330, Framework for IP Performance Metrics
269
270       The Art of Computer Programming, Volume 2, Donald Knuth.
271
272       Handbook of Mathematica Functions, Milton Abramowitz and Irene Stegun.
273
274       Probability and Statistics for Engineering and the Sciences, Jay
275       Devore.
276

COPYRIGHT

278       Copyright (c) 1997,1998 Colin Kuskie. All rights reserved.  This pro‐
279       gram is free software; you can redistribute it and/or modify it under
280       the same terms as Perl itself.
281
282       Copyright (c) 1998 Andrea Spinelli. All rights reserved.  This program
283       is free software; you can redistribute it and/or modify it under the
284       same terms as Perl itself.
285
286       Copyright (c) 1994,1995 Jason Kastner. All rights reserved.  This pro‐
287       gram is free software; you can redistribute it and/or modify it under
288       the same terms as Perl itself.
289

REVISION HISTORY

291       v2.3 Rolled into November 1998
292
293       Code provided by Andrea Spinelli to prevent division by zero and to
294       make consistent return values for undefined behavior.  Andrea also pro‐
295       vided a test bench for the module.
296
297       A bug fix for the calculation of frequency distributions.  Thanks to
298       Nick Tolli for alerting this to me.
299
300       Added 4 lines of code to Makefile.PL to make it easier for the ActiveS‐
301       tate installation tool to use.  Changes work fine in perl5.004_04,
302       haven't tested them under perl5.005xx yet.
303
304       v2.2 Rolled into March 1998.
305
306       Fixed problem with sending 0's and -1's as data.  The old 0 : true ?
307       false thing.  Use defined to fix.
308
309       Provided a fix for AUTOLOAD/DESTROY/Carp bug.  Very strange.
310
311       v2.1 August 1997
312
313       Fixed errors in statistics algorithms caused by changing the interface.
314
315       v2.0 August 1997
316
317       Fixed errors in removing cached values (they weren't being removed!)
318       and added sort_data and presorted methods.
319
320       June 1997
321
322       Transferred ownership of the module from Jason to Colin.
323
324       Rewrote OO interface, modified function distribution, added mindex,
325       maxdex.
326
327       v1.1 April 1995
328
329       Added LeastSquaresFit and FrequencyDistribution.
330
331       v1.0 March 1995
332
333       Released to comp.lang.perl and placed on archive sites.
334
335       v.20 December 1994
336
337       Complete rewrite after extensive and invaluable e-mail correspondence
338       with Anno Siegel.
339
340       v.10 December 1994
341
342       Initital concept, released to perl5-porters list.
343
344
345
346perl v5.8.8                       2002-10-10                    Descriptive(3)