1Statistics::Basic(3)  User Contributed Perl Documentation Statistics::Basic(3)
2
3
4

NAME

6       Statistics::Basic - A collection of very basic statistics modules
7

SYNOPSIS

9           use Statistics::Basic qw(:all);
10
11       These actually return objects, not numbers.  The objects will
12       interpolate as nicely formated numbers (using Number::Format).  Or the
13       actual number will be returned when the object is used as a number.
14
15           my $median = median( 1,2,3 );
16           my $mean   = mean(  [1,2,3]); # array refs are ok too
17
18           my $variance = variance( 1,2,3 );
19           my $stddev   = stddev(   1,2,3 );
20
21       Although passing unblessed numbers and array refs to these functions
22       works, it's sometimes better to pass vector objects so the objects can
23       reuse calculated values.
24
25           my $v1       = $mean->query_vector;
26           my $variance = variance( $v1 );
27           my $stddev   = stddev(   $v1 );
28
29       Here, the mean used by the variance and the variance used by the
30       standard deviation will not need to be recalculated.  Now consider
31       these two calculations.
32
33           my $covariance  = covariance(  [1 .. 3], [1 .. 3] );
34           my $correlation = correlation( [1 .. 3], [1 .. 3] );
35
36       The covariance above would need to be recalculated by the correlation
37       when these functions are called this way.  But, if we instead built
38       vectors first, that wouldn't happen:
39
40           # $v1 is defined above
41           my $v2  = vector(1,2,3);
42           my $cov = covariance(  $v1, $v2 );
43           my $cor = correlation( $v1, $v2 );
44
45       Now $cor can reuse the variance calculated in $cov.
46
47       All of the functions above return objects that interpolate or evaluate
48       as a single string or as a number.  Statistics::Basic::LeastSquareFit
49       and Statistics::Basic::Mode are different:
50
51           my $unimodal   = mode(1,2,3,3);
52           my $multimodal = mode(1,2,3);
53
54           print "The modes are: $unimodal and $multimodal.\n";
55           print "The first is multimodal... " if $unimodal->is_multimodal;
56           print "The second is multimodal.\n" if $multimodal->is_multimodal;
57
58       In the first case, $unimodal will interpolate as a string and function
59       correctly as a number.  However, in the second case, trying to use
60       $multimodal as a number will "croak" an error -- it still interpolates
61       fine though.
62
63           my $lsf = leastsquarefit($v1, $v2);
64
65       This $lsf will interpolate fine, showing "LSF( alpha: $alpha, beta:
66       $beta )", but it will "croak" if you try to use the object as a number.
67
68           my $v3             = $multimodal->query;
69           my ($alpha, $beta) = $lsf->query;
70           my $average        = $mean->query;
71
72       All of the objects allow you to explicitly query, if you're not in the
73       mood to use overload.
74
75           my @answers = (
76               $mode->query,
77               $median->query,
78               $stddev->query,
79           );
80

SHORTCUTS

82       The following shortcut functions can be used in place of calling the
83       module's "new()" method directly.
84
85       They all take either array refs or lists as arguments, with the
86       exception of the shortcuts that need two vectors to process (e.g.
87       Statistics::Basic::Correlation).
88
89       vector()
90           Returns a Statistics::Basic::Vector object.  Arguments to
91           "vector()" can be any of: an array ref, a list of numbers, or a
92           blessed vector object.  If passed a blessed vector object, vector
93           will just return the vector passed in.
94
95       mean() average() avg()
96           Returns a Statistics::Basic::Mean object.  You can choose to call
97           "mean()" as "average()" or "avg()".  Arguments can be any of: an
98           array ref, a list of numbers, or a blessed vector object.
99
100       median()
101           Returns a Statistics::Basic::Median object.  Arguments can be any
102           of: an array ref, a list of numbers, or a blessed vector object.
103
104       mode()
105           Returns a Statistics::Basic::Mode object.  Arguments can be any of:
106           an array ref, a list of numbers, or a blessed vector object.
107
108       variance() var()
109           Returns a Statistics::Basic::Variance object.  You can choose to
110           call "variance()" as "var()".  Arguments can be any of: an array
111           ref, a list of numbers, or a blessed vector object.  If you will
112           also be calculating the mean of the same list of numbers it's
113           recommended to do this:
114
115               my $vec  = vector(1,2,3);
116               my $mean = mean($vec);
117               my $var  = variance($vec);
118
119           This would also work:
120
121               my $mean = mean(1,2,3);
122               my $var  = variance($mean->query_vector);
123
124           This will calculate the same mean twice:
125
126               my $mean = mean(1,2,3);
127               my $var  = variance(1,2,3);
128
129           If you really only need the variance, ignore the above and this is
130           fine:
131
132               my $variance = variance(1,2,3,4,5);
133
134       stddev()
135           Returns a Statistics::Basic::StdDev object.  Arguments can be any
136           of: an array ref, a list of numbers, or a blessed vector object.
137           Pass a vector object to "stddev()" to avoid recalculating the
138           variance and mean if applicable (see "variance()").
139
140       covariance() cov()
141           Returns a Statistics::Basic::Covariance object.  Arguments to
142           "covariance()" or "cov()" must be array ref or vector objects.
143           There must be precisely two arguments (or none, setting the vectors
144           to two empty ones), and they must be the same length.
145
146       correlation() cor() corr()
147           Returns a Statistics::Basic::Correlation object.  Arguments to
148           "correlation()" or "cor()"/"corr()" must be array ref or vector
149           objects.  There must be precisely two arguments (or none, setting
150           the vectors to two empty ones), and they must be the same length.
151
152       leastsquarefit() LSF() lsf()
153           Returns a Statistics::Basic::LeastSquareFit object.  Arguments to
154           "leastsquarefit()" or "lsf()"/"LSF()" must be array ref or vector
155           objects.  There must be precisely two arguments (or none, setting
156           the vectors to two empty ones), and they must be the same length.
157
158       computed()
159           Returns a Statistics::Basic::ComputedVector object.  Argument must
160           be a blessed vector object.  See the section on "COMPUTED VECTORS"
161           for more information on this.
162
163       handle_missing_values() handle_missing()
164           Returns two Statistics::Basic::ComputedVector objects.  Arguments
165           to this function should be two vector arguments.  See the section
166           on "MISSING VALUES" for further information on this function.
167

COMPUTED VECTORS

169       Sometimes it will be handy to have a vector computed from another (or
170       at least that updates based on the first).  Consider the case of
171       outliers:
172
173           my @a = ( (1,2,3) x 7, 15 );
174           my @b = ( (1,2,3) x 7 );
175
176           my $v1 = vector(@a);
177           my $v2 = vector(@b);
178           my $v3 = computed($v1);
179              $v3->set_filter(sub {
180                  my $m = mean($v1);
181                  my $s = stddev($v1);
182
183                  grep { abs($_-$m) <= $s } @_;
184              });
185
186       This filter sets $v3 to always be equal to $v1 such that all the
187       elements that differ from the mean by more than a standard deviation
188       are removed.  As such, "$v2" eq "$v3" since 15 is clearly an outlier by
189       inspection.
190
191           print "$v1\n";
192           print "$v3\n";
193
194       ... prints:
195
196           [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 15]
197           [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]
198

MISSING VALUES

200       Something I get asked about quite a lot is, "can S::B handle missing
201       values?"  The answer used to be, "that really depends on your data set,
202       use grep," but I recently decided (5/29/09) that it was time to just go
203       ahead and add this feature.
204
205       Strictly speaking, the feature was already there.  You simply need to
206       add a couple filters to your data.  See "t/75_filtered_missings.t" for
207       the test example.
208
209       This is what people usually mean when they ask if S::B can "handle"
210       missing data:
211
212           my $v1 = vector(1,2,3,undef,4);
213           my $v2 = vector(1,2,3,4, undef);
214           my $v3 = computed($v1);
215           my $v4 = computed($v2);
216
217           $v3->set_filter(sub {
218               my @v = $v2->query;
219               map {$_[$_]} grep { defined $v[$_] and defined $_[$_] } 0 .. $#_;
220           });
221
222           $v4->set_filter(sub {
223               my @v = $v1->query;
224               map {$_[$_]} grep { defined $v[$_] and defined $_[$_] } 0 .. $#_;
225           });
226
227           print "$v1 $v2\n"; # prints: [1, 2, 3, _, 4] [1, 2, 3, 4, _]
228           print "$v3 $v4\n"; # prints: [1, 2, 3] [1, 2, 3]
229
230       But I've made it even simpler.  Since this is such a common request, I
231       have provided a helper function to build the filters automatically:
232
233           my $v1 = vector(1,2,3,undef,4);
234           my $v2 = vector(1,2,3,4, undef);
235
236           my ($f1, $f2) = handle_missing_values($v1, $v2);
237
238           print "$f1 $f2\n"; # prints: [1, 2, 3] [1, 2, 3]
239
240       Note that in practice, you would still manipulate (insert, and shift)
241       $v1 and $v2, not the computed vectors.  But for correlations and the
242       like, you would use $f1 and $f2.
243
244           $v1->insert(5);
245           $v2->insert(6);
246
247           my $correlation = correlation($f1, $f2);
248
249       You can still insert on $f1 and $f2, but it updates the input vector
250       rather than the computed one (which is just a filter handler).
251

REUSE DETAILS

253       Most of the objects have a variety of query functions that allow you to
254       extract the objects used within.  Although, the objects are smart
255       enough to prevent needless duplication.  That is, the following would
256       test would pass:
257
258           use Statistics::Basic qw(:all);
259
260           my $v1 = vector(1,2,3,4,5);
261           my $v2 = vector($v1);
262           my $sd = stddev( $v1 );
263           my $v3 = $sd->query_vector;
264           my $m1 = mean( $v1 );
265           my $m2 = $sd->query_mean;
266           my $m3 = Statistics::Basic::Mean->new( $v1 );
267           my $v4 = $m3->query_vector;
268
269           use Scalar::Util qw(refaddr);
270           use Test; plan tests => 5;
271
272           ok( refaddr($v1), refaddr($v2) );
273           ok( refaddr($v2), refaddr($v3) );
274           ok( refaddr($m1), refaddr($m2) );
275           ok( refaddr($m2), refaddr($m3) );
276           ok( refaddr($v3), refaddr($v4) );
277
278           # this is t/54_* in the distribution
279
280       Also, note that the mean is only calculated once even though we've
281       calculated a variance and a standard deviation above.
282
283       Suppose you'd like a copy of the Statistics::Basic::Variance object
284       that the Statistics::Basic::StdDev object is using.  All of the objects
285       within should be accessible with query functions as follows.
286

QUERY FUNCTIONS

288       query()
289           This method exists in all of the objects.
290           Statistics::Basic::LeastSquareFit is the only one that returns two
291           values (alpha and beta) as a list.  Statistics::Basic::Vector
292           returns either the list of elements in the vector, or reference to
293           that array (depending on the context).  All of the other "query()"
294           methods return a single number, the number the module purports to
295           calculate.
296
297       query_mean()
298           Returns the Statistics::Basic::Mean object used by
299           Statistics::Basic::Variance and Statistics::Basic::StdDev.
300
301       query_mean1()
302           Returns the first Statistics::Basic::Mean object used by
303           Statistics::Basic::Covariance, Statistics::Basic::Correlation and
304           Statistics::Basic::LeastSquareFit.
305
306       query_mean2()
307           Returns the second Statistics::Basic::Mean object used by
308           Statistics::Basic::Covariance, and Statistics::Basic::Correlation.
309
310       query_covariance()
311           Returns the Statistics::Basic::Covariance object used by
312           Statistics::Basic::Correlation and
313           Statistics::Basic::LeastSquareFit.
314
315       query_variance()
316           Returns the Statistics::Basic::Variance object used by
317           Statistics::Basic::StdDev.
318
319       query_variance1()
320           Returns the first Statistics::Basic::Variance object used by
321           Statistics::Basic::LeastSquareFit.
322
323       query_vector()
324           Returns the Statistics::Basic::Vector object used by any of the
325           single vector modules.
326
327       query_vector1()
328           Returns the first Statistics::Basic::Vector object used by any of
329           the two vector modules.
330
331       query_vector2()
332           Returns the second Statistics::Basic::Vector object used by any of
333           the two vector modules.
334
335       is_multimodal()
336           Statistics::Basic::Mode objects sometimes return
337           Statistics::Basic::Vector objects instead of numbers.  When
338           "is_multimodal()" is true, the mode is a vector, not a scalar.
339
340       y_given_x()
341           Statistics::Basic::LeastSquareFit is meant for finding a line of
342           best fit.  This function can be used to find the "y" for a given
343           "x" based on the calculated $beta (slope) and $alpha (y-offset).
344
345       x_given_y()
346           Statistics::Basic::LeastSquareFit is meant for finding a line of
347           best fit.  This function can be used to find the "x" for a given
348           "y" based on the calculated $beta (slope) and $alpha (y-offset).
349
350           This function can produce divide-by-zero errors since it must
351           divide by the slope to find the "x" value.  (The slope should
352           rarely be zero though, that's a vertical line and would represent
353           very odd data points.)
354

INSERT and SET FUNCTIONS

356       These objects are all intended to be useful while processing long
357       columns of data, like data you'd find in a database.
358
359       insert()
360           Vectors try to stay the same size when they accept new elements,
361           FIFO style.
362
363               my $v1 = vector(1,2,3); # a 3 touple
364                  $v1->insert(4); # still a 3 touple
365
366               print "$v1\n"; # prints: [2, 3, 4]
367
368               $v1->insert(7); # still a 3 touple
369               print "$v1\n"; # prints: [3, 4, 7]
370
371           All of the other Statistics::Basic modules have this function too.
372           The modules that track two vectors will need two arguments to
373           insert though.
374
375               my $mean = mean([1,2,3]);
376                  $mean->insert(4);
377
378               print "mean: $mean\n"; # prints 3 ... (2+3+4)/3
379
380               my $correlation = correlation($mean->query_vector,
381                   $mean->query_vector->copy);
382
383               print "correlation: $correlation\n"; # 1
384
385               $correlation->insert(3,4);
386               print "correlation: $correlation\n"; # 0.5
387
388           Also, note that the underlying vectors keep track of recalculating
389           automatically.
390
391               my $v = vector(1,2,3);
392               my $m = mean($v);
393               my $s = stddev($v);
394
395           The mean has not been calculated yet.
396
397               print "$s; $m\n"; # 0.82; 2
398
399           The mean has been calculated once (even though the
400           Statistics::Basic::StdDev uses it).
401
402               $v->insert(4); print "$s; $m\n"; 0.82; 3
403               $m->insert(5); print "$s; $m\n"; 0.82; 4
404               $s->insert(6); print "$s; $m\n"; 0.82; 5
405
406           The mean has been calculated thrice more and only thrice more.
407
408       append() ginsert()
409           You can grow the vectors instead of sliding them (FIFO). For this,
410           use "append()" (or "ginsert()", same thing).
411
412               my $v = vector(1,2,3);
413               my $m = mean($v);
414               my $s = stddev($v);
415
416               $v->append(4); print "$s; $m\n"; 1.12; 2.5
417               $m->append(5); print "$s; $m\n"; 1.41; 3
418               $s->append(6); print "$s; $m\n"; 1.71; 1.71
419
420               print "$v\n"; # [1, 2, 3, 4, 5, 6]
421               print "$s\n"; # 1.71
422
423           Of course, with a correlation, or a covariance, it'd look more like
424           this:
425
426               my $c = correlation([1,2,3], [3,4,5]);
427                  $c->append(7,7);
428
429               print "c=$c\n"; # c=0.98
430
431       set_vector()
432           This allows you to set the vector to a known state.  It takes
433           either array ref or vector objects.
434
435               my $v1 = vector(1,2,3);
436               my $v2 = $v1->copy;
437                  $v2->set_vector([4,5,6]);
438
439               my $m = mean();
440
441               $m->set_vector([1,2,3]);
442               $m->set_vector($v2);
443
444               my $c = correlation();
445
446               $c->set_vector($v1,$v2);
447               $c->set_vector([1,2,3], [4,5,6]);
448
449       set_size()
450           This sets the size of the vector.  When the vector is made bigger,
451           the vector is filled to the new length with leading zeros (i.e.,
452           they are the first to be kicked out after new "insert()"s.
453
454               my $v = vector(1,2,3);
455                  $v->set_size(7);
456
457               print "$v\n"; # [0, 0, 0, 0, 1, 2, 3]
458
459               my $m = mean();
460                  $m->set_size(7);
461
462               print "", $m->query_vector, "\n";
463                # [0, 0, 0, 0, 0, 0, 0]
464
465               my $c = correlation([3],[3]);
466                  $c->set_size(7);
467
468               print "", $c->query_vector1, "\n";
469               print "", $c->query_vector2, "\n";
470                # [0, 0, 0, 0, 0, 0, 3]
471                # [0, 0, 0, 0, 0, 0, 3]
472

OPTIONS

474       Each of the following options can be specified on package import like
475       this.
476
477           use Statistics::Basic qw(unbias=0); # start with unbias disabled
478           use Statistics::Basic qw(unbias=1); # start with unbias enabled
479
480       When specified on import, each option has certain defaults.
481
482           use Statistics::Basic qw(unbias); # start with unbias enabled
483           use Statistics::Basic qw(nofill); # start with nofill enabled
484           use Statistics::Basic qw(toler);  # start with toler disabled
485           use Statistics::Basic qw(ipres);  # start with ipres=2
486
487       Additionally, with the exception of "ignore_env", they can all be
488       accessed via package variables of the same name in all upper case.
489       Example:
490
491           # code code code
492
493           $Statistics::Basic::UNBIAS = 0; # turn UNBIAS off
494
495           # code code code
496
497           $Statistics::Basic::UNBIAS = 1; # turn it back on
498
499           # code code code
500
501           {
502               local $Statistics::Basic::DEBUG_STATS_B = 1; # debug, this block only
503           }
504
505       Special caveat: "toler" can in fact be changed via the package var
506       (e.g., "$Statistics::Basic::TOLER=0.0001").  But, for speed reasons, it
507       must be defined before any other packages are imported or it will not
508       actually do anything when changed.
509
510       unbias
511           This module uses the sum(X - mean(X))/N definition of variance.
512
513           If you wish to use the unbiased, sum(X-mean(X)/(N-1) definition,
514           then set the $Statistics::Basic::UNBIAS true (possibly with "use
515           Statistics::Basic qw(unbias)").
516
517           This can be changed at any time with the package variable or at
518           compile time.
519
520           This feature was requested by "Robert McGehee
521           <xxxxxxxx@wso.williams.edu>".
522
523           [NOTE 2008-11-06:
524           <http://cpanratings.perl.org/dist/Statistics-Basic>, this can also
525           be called "population (n)" vs "sample (n-1)" and is indeed fully
526           addressed right here!]
527
528       ipres
529           "ipres" defaults to 2.  It is passed to Number::Format as the
530           second argument to format_number() during string interpolation
531           (see: overload).
532
533       toler
534           When set, $Statistics::Basic::TOLER (which is not enabled by
535           default), instructs the stats objects to test true when within some
536           tolerable range, pretty much like this:
537
538               sub is_equal {
539                   return abs($_[0]-$_[1])<$Statistics::Basic::TOLER
540                       if defined($Statistics::Basic::TOLER)
541
542                   return $_[0] == $_[1]
543               }
544
545           For performance reasons, this must be defined before the import of
546           any other Statistics::Basic modules or the modules will fail to
547           overload the "==" operator.
548
549           $Statistics::Basic::TOLER totally disabled:
550
551               use Statistics::Basic qw(:all toler);
552
553           $Statistics::Basic::TOLER disabled, but changeable:
554
555               use Statistics::Basic qw(:all toler=0);
556
557               $Statistics::Basic::TOLER = 0.000_001;
558
559           You can change the tolerance at runtime, but it must be set (or
560           unset) at compile time before the packages load.
561
562       nofill
563           Normally when you set the size of a vector it automatically fills
564           with zeros on the first-out side of the vector.  You can disable
565           the autofilling with this option.  It can be changed at any time.
566
567       debug
568           Enable debugging with "use Statistics::Basic qw(debug)" or disable
569           a specific level (including 0 to disable) with "use
570           Statistics::Basic qw(debug=2)".
571
572           This is also accessible at runtime using
573           $Statistics::Basic::DEBUG_STATS_B and can be switched on and off at
574           any time.
575
576       ignore_env
577           Normally the defaults for these options can be changed in the
578           environment of the program.  Example:
579
580               UNBIAS=1 perl ./myprog.pl
581
582           This does the same thing as "$Statistics::Basic::UNBIAS=1" or "use
583           Statistics::Basic qw(unbias)" unless you disable the %ENV checking
584           with this option.
585
586               use Statistics::Basic qw(ignore_env);
587

ENVIRONMENT VARIABLES

589       You can change the defaults (assuming ignore_env is not used) from your
590       bash prompt.  Example:
591
592           DEBUG_STATS_B=1 perl ./myprog.pl
593
594       $ENV{DEBUG_STATS_B}
595           Sets the default value of "debug".
596
597       $ENV{UNBIAS}
598           Sets the default value of "unbias".
599
600       $ENV{NOFILL}
601           Sets the default value of "nofill".
602
603       $ENV{IPRES}
604           Sets the default value of "ipres".
605
606       $ENV{TOLER}
607           Sets the default value of "toler".
608

OVERLOADS

610       All of the objects are true in numeric context.  All of the objects
611       print useful strings when evaluated as a string.  Most of the objects
612       evaluate usefully as numbers, although Statistics::Basic::Vector
613       objects, Statistics::Basic::ComputedVector objects, and
614       Statistics::Basic::LeastSquareFit objects do not -- they instead raise
615       an error.
616

Author's note on Statistics::Descriptive

618       I've been asked a couple times now why I don't link to
619       Statistics::Descriptive in my see also section.  As a rule, I only link
620       to packages there that I think are related or that I actually used in
621       the package construction.  I've never personally used Descriptive, but
622       it surely seems to do quite a lot more.  In a sense, this package
623       really doesn't do statistics, not like a scientist would think about it
624       anyway.  So I always figured people could find their own way to
625       Descriptive anyway.
626
627       The one thing this package does do, that I don't think Descriptive does
628       (correct me if I'm wrong) is time difference computations.  If there
629       are say, 200 things in the mean object, then after inserting (using
630       this package) there'll still be 200 things, allowing the computation of
631       a moving average, moving stddev, moving correlation, etc.  You might
632       argue that this is rarely needed, but it is really the only time I need
633       to compute these things.
634
635         while( $data = $fetch_sth->fetchrow_arrayref ) {
636             $mean->insert($data);
637             $moving_avg_sth->execute(0 + $mean);
638         }
639
640       Since I opened the topic I'd also like to mention that I find this
641       package easier to use.  That is a matter of taste and since I wrote
642       this, you might say I'm a little biased.  Your mileage may vary.
643

AUTHOR

645       Paul Miller "<jettero@cpan.org>"
646
647       I am using this software in my own projects...  If you find bugs,
648       please please please let me know. :) Actually, let me know if you find
649       it handy at all.  Half the fun of releasing this stuff is knowing that
650       people use it.
651
653       Copyright 2012 Paul Miller -- Licensed under the LGPL version 2.
654

SEE ALSO

656       perl(1), Number::Format, overload, Statistics::Basic::Vector,
657       Statistics::Basic::ComputedVector, Statistics::Basic::_OneVectorBase,
658       Statistics::Basic::Mean, Statistics::Basic::Median,
659       Statistics::Basic::Mode, Statistics::Basic::Variance,
660       Statistics::Basic::StdDev, Statistics::Basic::_TwoVectorBase,
661       Statistics::Basic::Correlation, Statistics::Basic::Covariance,
662       Statistics::Basic::LeastSquareFit
663
664
665
666perl v5.32.0                      2020-07-28              Statistics::Basic(3)
Impressum