1Statistics::DescriptiveU(s3e)r Contributed Perl DocumentaSttiaotnistics::Descriptive(3)
2
3
4

NAME

6       Statistics::Descriptive - Module of basic descriptive statistical
7       functions.
8

SYNOPSIS

10         use Statistics::Descriptive;
11         $stat = Statistics::Descriptive::Full->new();
12         $stat->add_data(1,2,3,4); $mean = $stat->mean();
13         $var  = $stat->variance();
14         $tm   = $stat->trimmed_mean(.25);
15         $Statistics::Descriptive::Tolerance = 1e-10;
16

DESCRIPTION

18       This module provides basic functions used in descriptive statistics.
19       It has an object oriented design and supports two different types of
20       data storage and calculation objects: sparse and full. With the sparse
21       method, none of the data is stored and only a few statistical measures
22       are available. Using the full method, the entire data set is retained
23       and additional functions are available.
24
25       Whenever a division by zero may occur, the denominator is checked to be
26       greater than the value $Statistics::Descriptive::Tolerance, which
27       defaults to 0.0. You may want to change this value to some small
28       positive value such as 1e-24 in order to obtain error messages in case
29       of very small denominators.
30
31       Many of the methods (both Sparse and Full) cache values so that
32       subsequent calls with the same arguments are faster.
33

METHODS

35   Sparse Methods
36       $stat = Statistics::Descriptive::Sparse->new();
37            Create a new sparse statistics object.
38
39       $stat->clear();
40            Effectively the same as
41
42              my $class = ref($stat);
43              undef $stat;
44              $stat = new $class;
45
46            except more efficient.
47
48       $stat->add_data(1,2,3);
49            Adds data to the statistics variable. The cached statistical
50            values are updated automatically.
51
52       $stat->count();
53            Returns the number of data items.
54
55       $stat->mean();
56            Returns the mean of the data.
57
58       $stat->sum();
59            Returns the sum of the data.
60
61       $stat->variance();
62            Returns the variance of the data.  Division by n-1 is used.
63
64       $stat->standard_deviation();
65            Returns the standard deviation of the data. Division by n-1 is
66            used.
67
68       $stat->min();
69            Returns the minimum value of the data set.
70
71       $stat->mindex();
72            Returns the index of the minimum value of the data set.
73
74       $stat->max();
75            Returns the maximum value of the data set.
76
77       $stat->maxdex();
78            Returns the index of the maximum value of the data set.
79
80       $stat->sample_range();
81            Returns the sample range (max - min) of the data set.
82
83   Full Methods
84       Similar to the Sparse Methods above, any Full Method that is called
85       caches the current result so that it doesn't have to be recalculated.
86       In some cases, several values can be cached at the same time.
87
88       $stat = Statistics::Descriptive::Full->new();
89            Create a new statistics object that inherits from
90            Statistics::Descriptive::Sparse so that it contains all the
91            methods described above.
92
93       $stat->add_data(1,2,4,5);
94            Adds data to the statistics variable.  All of the sparse
95            statistical values are updated and cached.  Cached values from
96            Full methods are deleted since they are no longer valid.
97
98            Note:  Calling add_data with an empty array will delete all of
99            your Full method cached values!  Cached values for the sparse
100            methods are not changed
101
102       $stat->get_data();
103            Returns a copy of the data array.
104
105       $stat->sort_data();
106            Sort the stored data and update the mindex and maxdex methods.
107            This method uses perl's internal sort.
108
109       $stat->presorted(1);
110       $stat->presorted();
111            If called with a non-zero argument, this method sets a flag that
112            says the data is already sorted and need not be sorted again.
113            Since some of the methods in this class require sorted data, this
114            saves some time.  If you supply sorted data to the object, call
115            this method to prevent the data from being sorted again. The flag
116            is cleared whenever add_data is called.  Calling the method
117            without an argument returns the value of the flag.
118
119       $stat->skewness();
120            Returns the skewness of the data.  A value of zero is no skew,
121            negative is a left skewed tail, positive is a right skewed tail.
122            This is consistent with Excel.
123
124       $stat->kurtosis();
125            Returns the kurtosis of the data.  Positive is peaked, negative is
126            flattened.
127
128       $x = $stat->percentile(25);
129       ($x, $index) = $stat->percentile(25);
130            Sorts the data and returns the value that corresponds to the
131            percentile as defined in RFC2330:
132
133            ·   For example, given the 6 measurements:
134
135                -2, 7, 7, 4, 18, -5
136
137                Then F(-8) = 0, F(-5) = 1/6, F(-5.0001) = 0, F(-4.999) = 1/6,
138                F(7) = 5/6, F(18) = 1, F(239) = 1.
139
140                Note that we can recover the different measured values and how
141                many times each occurred from F(x) -- no information regarding
142                the range in values is lost.  Summarizing measurements using
143                histograms, on the other hand, in general loses information
144                about the different values observed, so the EDF is preferred.
145
146                Using either the EDF or a histogram, however, we do lose
147                information regarding the order in which the values were
148                observed.  Whether this loss is potentially significant will
149                depend on the metric being measured.
150
151                We will use the term "percentile" to refer to the smallest
152                value of x for which F(x) >= a given percentage.  So the 50th
153                percentile of the example above is 4, since F(4) = 3/6 = 50%;
154                the 25th percentile is -2, since F(-5) = 1/6 < 25%, and F(-2)
155                = 2/6 >= 25%; the 100th percentile is 18; and the 0th
156                percentile is -infinity, as is the 15th percentile.
157
158                Care must be taken when using percentiles to summarize a
159                sample, because they can lend an unwarranted appearance of
160                more precision than is really available.  Any such summary
161                must include the sample size N, because any percentile
162                difference finer than 1/N is below the resolution of the
163                sample.
164
165            (Taken from: RFC2330 - Framework for IP Performance Metrics,
166            Section 11.3.  Defining Statistical Distributions.  RFC2330 is
167            available from: <http://www.ietf.org/rfc/rfc2330.txt> .)
168
169            If the percentile method is called in a list context then it will
170            also return the index of the percentile.
171
172       $x = $stat->quantile($Type);
173            Sorts the data and returns estimates of underlying distribution
174            quantiles based on one or two order statistics from the supplied
175            elements.
176
177            This method use the same algorithm as Excel and R language
178            (quantile type 7).
179
180            The generic function quantile produces sample quantiles
181            corresponding to the given probabilities.
182
183            $Type is an integer value between 0 to 4 :
184
185              0 => zero quartile (Q0) : minimal value
186              1 => first quartile (Q1) : lower quartile = lowest cut off (25%) of data = 25th percentile
187              2 => second quartile (Q2) : median = it cuts data set in half = 50th percentile
188              3 => third quartile (Q3) : upper quartile = highest cut off (25%) of data, or lowest 75% = 75th percentile
189              4 => fourth quartile (Q4) : maximal value
190
191            Exemple :
192
193              my @data = (1..10);
194              my $stat = Statistics::Descriptive::Full->new();
195              $stat->add_data(@data);
196              print $stat->quantile(0); # => 1
197              print $stat->quantile(1); # => 3.25
198              print $stat->quantile(2); # => 5.5
199              print $stat->quantile(3); # => 7.75
200              print $stat->quantile(4); # => 10
201
202       $stat->median();
203            Sorts the data and returns the median value of the data.
204
205       $stat->harmonic_mean();
206            Returns the harmonic mean of the data.  Since the mean is
207            undefined if any of the data are zero or if the sum of the
208            reciprocals is zero, it will return undef for both of those cases.
209
210       $stat->geometric_mean();
211            Returns the geometric mean of the data.
212
213       $stat->mode();
214            Returns the mode of the data.
215
216       $stat->trimmed_mean(ltrim[,utrim]);
217            "trimmed_mean(ltrim)" returns the mean with a fraction "ltrim" of
218            entries at each end dropped. "trimmed_mean(ltrim,utrim)" returns
219            the mean after a fraction "ltrim" has been removed from the lower
220            end of the data and a fraction "utrim" has been removed from the
221            upper end of the data.  This method sorts the data before
222            beginning to analyze it.
223
224            All calls to trimmed_mean() are cached so that they don't have to
225            be calculated a second time.
226
227       $stat->frequency_distribution_ref($partitions);
228       $stat->frequency_distribution_ref(\@bins);
229       $stat->frequency_distribution_ref();
230            "frequency_distribution_ref($partitions)" slices the data into
231            $partition sets (where $partition is greater than 1) and counts
232            the number of items that fall into each partition. It returns a
233            reference to a hash where the keys are the numerical values of the
234            partitions used. The minimum value of the data set is not a key
235            and the maximum value of the data set is always a key. The number
236            of entries for a particular partition key are the number of items
237            which are greater than the previous partition key and less then or
238            equal to the current partition key. As an example,
239
240               $stat->add_data(1,1.5,2,2.5,3,3.5,4);
241               $f = $stat->frequency_distribution_ref(2);
242               for (sort {$a <=> $b} keys %$f) {
243                  print "key = $_, count = $f->{$_}\n";
244               }
245
246            prints
247
248               key = 2.5, count = 4
249               key = 4, count = 3
250
251            since there are four items less than or equal to 2.5, and 3 items
252            greater than 2.5 and less than 4.
253
254            "frequency_distribution_refs(\@bins)" provides the bins that are
255            to be used for the distribution.  This allows for non-uniform
256            distributions as well as trimmed or sample distributions to be
257            found.  @bins must be monotonic and contain at least one element.
258            Note that unless the set of bins contains the range that the total
259            counts returned will be less than the sample size.
260
261            Calling "frequency_distribution_ref()" with no arguments returns
262            the last distribution calculated, if such exists.
263
264       my %hash = $stat->frequency_distribution($partitions);
265       my %hash = $stat->frequency_distribution(\@bins);
266       my %hash = $stat->frequency_distribution();
267            Same as "frequency_distribution_ref()" except that returns the
268            hash clobbered into the return list. Kept for compatibility
269            reasons with previous versions of Statistics::Descriptive and
270            using it is discouraged.
271
272       $stat->least_squares_fit();
273       $stat->least_squares_fit(@x);
274            "least_squares_fit()" performs a least squares fit on the data,
275            assuming a domain of @x or a default of 1..$stat->count().  It
276            returns an array of four elements "($q, $m, $r, $rms)" where
277
278            "$q and $m"
279                satisfy the equation C($y = $m*$x + $q).
280
281            $r  is the Pearson linear correlation cofficient.
282
283            $rms
284                is the root-mean-square error.
285
286            If case of error or division by zero, the empty list is returned.
287
288            The array that is returned can be "coerced" into a hash structure
289            by doing the following:
290
291              my %hash = ();
292              @hash{'q', 'm', 'r', 'err'} = $stat->least_squares_fit();
293
294            Because calling "least_squares_fit()" with no arguments defaults
295            to using the current range, there is no caching of the results.
296

REPORTING ERRORS

298       I read my email frequently, but since adopting this module I've added 2
299       children and 1 dog to my family, so please be patient about my response
300       times.  When reporting errors, please include the following to help me
301       out:
302
303       ·   Your version of perl.  This can be obtained by typing perl "-v" at
304           the command line.
305
306       ·   Which version of Statistics::Descriptive you're using.  As you can
307           see below, I do make mistakes.  Unfortunately for me, right now
308           there are thousands of CD's with the version of this module with
309           the bugs in it.  Fortunately for you, I'm a very patient module
310           maintainer.
311
312       ·   Details about what the error is.  Try to narrow down the scope of
313           the problem and send me code that I can run to verify and track it
314           down.
315

AUTHOR

317       Current maintainer:
318
319       Shlomi Fish, <http://www.shlomifish.org/> , "shlomif@cpan.org"
320
321       Previously:
322
323       Colin Kuskie
324
325       My email address can be found at http://www.perl.com under Who's Who or
326       at: http://search.cpan.org/author/COLINK/.
327

REFERENCES

329       RFC2330, Framework for IP Performance Metrics
330
331       The Art of Computer Programming, Volume 2, Donald Knuth.
332
333       Handbook of Mathematica Functions, Milton Abramowitz and Irene Stegun.
334
335       Probability and Statistics for Engineering and the Sciences, Jay
336       Devore.
337
339       Copyright (c) 1997,1998 Colin Kuskie. All rights reserved.  This
340       program is free software; you can redistribute it and/or modify it
341       under the same terms as Perl itself.
342
343       Copyright (c) 1998 Andrea Spinelli. All rights reserved.  This program
344       is free software; you can redistribute it and/or modify it under the
345       same terms as Perl itself.
346
347       Copyright (c) 1994,1995 Jason Kastner. All rights reserved.  This
348       program is free software; you can redistribute it and/or modify it
349       under the same terms as Perl itself.
350

LICENSE

352       This program is free software; you can redistribute it and/or modify it
353       under the same terms as Perl itself.
354
355
356
357perl v5.12.1                      2010-06-23        Statistics::Descriptive(3)
Impressum