1Statistics::Basic(3) User Contributed Perl Documentation Statistics::Basic(3)
2
3
4
6 Statistics::Basic - A collection of very basic statistics modules
7
9 use Statistics::Basic qw(:all);
10
11 These actually return objects, not numbers. The objects will
12 interpolate as nicely formated numbers (using Number::Format). Or the
13 actual number will be returned when the object is used as a number.
14
15 my $median = median( 1,2,3 );
16 my $mean = mean( [1,2,3]); # array refs are ok too
17
18 my $variance = variance( 1,2,3 );
19 my $stddev = stddev( 1,2,3 );
20
21 Although passing unblessed numbers and array refs to these functions
22 works, it's sometimes better to pass vector objects so the objects can
23 reuse calculated values.
24
25 my $v1 = $mean->query_vector;
26 my $variance = variance( $v1 );
27 my $stddev = stddev( $v1 );
28
29 Here, the mean used by the variance and the variance used by the
30 standard deviation will not need to be recalculated. Now consider
31 these two calculations.
32
33 my $covariance = covariance( [1 .. 3], [1 .. 3] );
34 my $correlation = correlation( [1 .. 3], [1 .. 3] );
35
36 The covariance above would need to be recalculated by the correlation
37 when these functions are called this way. But, if we instead built
38 vectors first, that wouldn't happen:
39
40 # $v1 is defined above
41 my $v2 = vector(1,2,3);
42 my $cov = covariance( $v1, $v2 );
43 my $cor = correlation( $v1, $v2 );
44
45 Now $cor can reuse the variance calculated in $cov.
46
47 All of the functions above return objects that interpolate or evaluate
48 as a single string or as a number. Statistics::Basic::LeastSquareFit
49 and Statistics::Basic::Mode are different:
50
51 my $unimodal = mode(1,2,3,3);
52 my $multimodal = mode(1,2,3);
53
54 print "The modes are: $unimodal and $multimodal.\n";
55 print "The first is multimodal... " if $unimodal->is_multimodal;
56 print "The second is multimodal.\n" if $multimodal->is_multimodal;
57
58 In the first case, $unimodal will interpolate as a string and function
59 correctly as a number. However, in the second case, trying to use
60 $multimodal as a number will "croak" an error -- it still interpolates
61 fine though.
62
63 my $lsf = leastsquarefit($v1, $v2);
64
65 This $lsf will interpolate fine, showing "LSF( alpha: $alpha, beta:
66 $beta )", but it will "croak" if you try to use the object as a number.
67
68 my $v3 = $multimodal->query;
69 my ($alpha, $beta) = $lsf->query;
70 my $average = $mean->query;
71
72 All of the objects allow you to explicitly query, if you're not in the
73 mood to use overload.
74
75 my @answers = (
76 $mode->query,
77 $median->query,
78 $stddev->query,
79 );
80
82 The following shortcut functions can be used in place of calling the
83 module's "new()" method directly.
84
85 They all take either array refs or lists as arguments, with the
86 exception of the shortcuts that need two vectors to process (e.g.
87 Statistics::Basic::Correlation).
88
89 vector()
90 Returns a Statistics::Basic::Vector object. Arguments to
91 "vector()" can be any of: an array ref, a list of numbers, or a
92 blessed vector object. If passed a blessed vector object, vector
93 will just return the vector passed in.
94
95 mean() average() avg()
96 Returns a Statistics::Basic::Mean object. You can choose to call
97 "mean()" as "average()" or "avg()". Arguments can be any of: an
98 array ref, a list of numbers, or a blessed vector object.
99
100 median()
101 Returns a Statistics::Basic::Median object. Arguments can be any
102 of: an array ref, a list of numbers, or a blessed vector object.
103
104 mode()
105 Returns a Statistics::Basic::Mode object. Arguments can be any of:
106 an array ref, a list of numbers, or a blessed vector object.
107
108 variance() var()
109 Returns a Statistics::Basic::Variance object. You can choose to
110 call "variance()" as "var()". Arguments can be any of: an array
111 ref, a list of numbers, or a blessed vector object. If you will
112 also be calculating the mean of the same list of numbers it's
113 recommended to do this:
114
115 my $vec = vector(1,2,3);
116 my $mean = mean($vec);
117 my $var = variance($vec);
118
119 This would also work:
120
121 my $mean = mean(1,2,3);
122 my $var = variance($mean->query_vector);
123
124 This will calculate the same mean twice:
125
126 my $mean = mean(1,2,3);
127 my $var = variance(1,2,3);
128
129 If you really only need the variance, ignore the above and this is
130 fine:
131
132 my $variance = variance(1,2,3,4,5);
133
134 stddev()
135 Returns a Statistics::Basic::StdDev object. Arguments can be any
136 of: an array ref, a list of numbers, or a blessed vector object.
137 Pass a vector object to "stddev()" to avoid recalculating the
138 variance and mean if applicable (see "variance()").
139
140 covariance() cov()
141 Returns a Statistics::Basic::Covariance object. Arguments to
142 "covariance()" or "cov()" must be array ref or vector objects.
143 There must be precisely two arguments (or none, setting the vectors
144 to two empty ones), and they must be the same length.
145
146 correlation() cor() corr()
147 Returns a Statistics::Basic::Correlation object. Arguments to
148 "correlation()" or "cor()"/"corr()" must be array ref or vector
149 objects. There must be precisely two arguments (or none, setting
150 the vectors to two empty ones), and they must be the same length.
151
152 leastsquarefit() LSF() lsf()
153 Returns a Statistics::Basic::LeastSquareFit object. Arguments to
154 "leastsquarefit()" or "lsf()"/"LSF()" must be array ref or vector
155 objects. There must be precisely two arguments (or none, setting
156 the vectors to two empty ones), and they must be the same length.
157
158 computed()
159 Returns a Statistics::Basic::ComputedVector object. Argument must
160 be a blessed vector object. See the section on "COMPUTED VECTORS"
161 for more information on this.
162
163 handle_missing_values() handle_missing()
164 Returns two Statistics::Basic::ComputedVector objects. Arguments
165 to this function should be two vector arguments. See the section
166 on "MISSING VALUES" for further information on this function.
167
169 Sometimes it will be handy to have a vector computed from another (or
170 at least that updates based on the first). Consider the case of
171 outliers:
172
173 my @a = ( (1,2,3) x 7, 15 );
174 my @b = ( (1,2,3) x 7 );
175
176 my $v1 = vector(@a);
177 my $v2 = vector(@b);
178 my $v3 = computed($v1);
179 $v3->set_filter(sub {
180 my $m = mean($v1);
181 my $s = stddev($v1);
182
183 grep { abs($_-$m) <= $s } @_;
184 });
185
186 This filter sets $v3 to always be equal to $v1 such that all the
187 elements that differ from the mean by more than a standard deviation
188 are removed. As such, "$v2" eq "$v3" since 15 is clearly an outlier by
189 inspection.
190
191 print "$v1\n";
192 print "$v3\n";
193
194 ... prints:
195
196 [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 15]
197 [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]
198
200 Something I get asked about quite a lot is, "can S::B handle missing
201 values?" The answer used to be, "that really depends on your data set,
202 use grep," but I recently decided (5/29/09) that it was time to just go
203 ahead and add this feature.
204
205 Strictly speaking, the feature was already there. You simply need to
206 add a couple filters to your data. See "t/75_filtered_missings.t" for
207 the test example.
208
209 This is what people usually mean when they ask if S::B can "handle"
210 missing data:
211
212 my $v1 = vector(1,2,3,undef,4);
213 my $v2 = vector(1,2,3,4, undef);
214 my $v3 = computed($v1);
215 my $v4 = computed($v2);
216
217 $v3->set_filter(sub {
218 my @v = $v2->query;
219 map {$_[$_]} grep { defined $v[$_] and defined $_[$_] } 0 .. $#_;
220 });
221
222 $v4->set_filter(sub {
223 my @v = $v1->query;
224 map {$_[$_]} grep { defined $v[$_] and defined $_[$_] } 0 .. $#_;
225 });
226
227 print "$v1 $v2\n"; # prints: [1, 2, 3, _, 4] [1, 2, 3, 4, _]
228 print "$v3 $v4\n"; # prints: [1, 2, 3] [1, 2, 3]
229
230 But I've made it even simpler. Since this is such a common request, I
231 have provided a helper function to build the filters automatically:
232
233 my $v1 = vector(1,2,3,undef,4);
234 my $v2 = vector(1,2,3,4, undef);
235
236 my ($f1, $f2) = handle_missing_values($v1, $v2);
237
238 print "$f1 $f2\n"; # prints: [1, 2, 3] [1, 2, 3]
239
240 Note that in practice, you would still manipulate (insert, and shift)
241 $v1 and $v2, not the computed vectors. But for correlations and the
242 like, you would use $f1 and $f2.
243
244 $v1->insert(5);
245 $v2->insert(6);
246
247 my $correlation = correlation($f1, $f2);
248
249 You can still insert on $f1 and $f2, but it updates the input vector
250 rather than the computed one (which is just a filter handler).
251
253 Most of the objects have a variety of query functions that allow you to
254 extract the objects used within. Although, the objects are smart
255 enough to prevent needless duplication. That is, the following would
256 test would pass:
257
258 use Statistics::Basic qw(:all);
259
260 my $v1 = vector(1,2,3,4,5);
261 my $v2 = vector($v1);
262 my $sd = stddev( $v1 );
263 my $v3 = $sd->query_vector;
264 my $m1 = mean( $v1 );
265 my $m2 = $sd->query_mean;
266 my $m3 = Statistics::Basic::Mean->new( $v1 );
267 my $v4 = $m3->query_vector;
268
269 use Scalar::Util qw(refaddr);
270 use Test; plan tests => 5;
271
272 ok( refaddr($v1), refaddr($v2) );
273 ok( refaddr($v2), refaddr($v3) );
274 ok( refaddr($m1), refaddr($m2) );
275 ok( refaddr($m2), refaddr($m3) );
276 ok( refaddr($v3), refaddr($v4) );
277
278 # this is t/54_* in the distribution
279
280 Also, note that the mean is only calculated once even though we've
281 calculated a variance and a standard deviation above.
282
283 Suppose you'd like a copy of the Statistics::Basic::Variance object
284 that the Statistics::Basic::StdDev object is using. All of the objects
285 within should be accessible with query functions as follows.
286
288 query()
289 This method exists in all of the objects.
290 Statistics::Basic::LeastSquareFit is the only one that returns two
291 values (alpha and beta) as a list. Statistics::Basic::Vector
292 returns either the list of elements in the vector, or reference to
293 that array (depending on the context). All of the other "query()"
294 methods return a single number, the number the module purports to
295 calculate.
296
297 query_mean()
298 Returns the Statistics::Basic::Mean object used by
299 Statistics::Basic::Variance and Statistics::Basic::StdDev.
300
301 query_mean1()
302 Returns the first Statistics::Basic::Mean object used by
303 Statistics::Basic::Covariance, Statistics::Basic::Correlation and
304 Statistics::Basic::LeastSquareFit.
305
306 query_mean2()
307 Returns the second Statistics::Basic::Mean object used by
308 Statistics::Basic::Covariance, and Statistics::Basic::Correlation.
309
310 query_covariance()
311 Returns the Statistics::Basic::Covariance object used by
312 Statistics::Basic::Correlation and
313 Statistics::Basic::LeastSquareFit.
314
315 query_variance()
316 Returns the Statistics::Basic::Variance object used by
317 Statistics::Basic::StdDev.
318
319 query_variance1()
320 Returns the first Statistics::Basic::Variance object used by
321 Statistics::Basic::LeastSquareFit.
322
323 query_vector()
324 Returns the Statistics::Basic::Vector object used by any of the
325 single vector modules.
326
327 query_vector1()
328 Returns the first Statistics::Basic::Vector object used by any of
329 the two vector modules.
330
331 query_vector2()
332 Returns the second Statistics::Basic::Vector object used by any of
333 the two vector modules.
334
335 is_multimodal()
336 Statistics::Basic::Mode objects sometimes return
337 Statistics::Basic::Vector objects instead of numbers. When
338 "is_multimodal()" is true, the mode is a vector, not a scalar.
339
340 y_given_x()
341 Statistics::Basic::LeastSquareFit is meant for finding a line of
342 best fit. This function can be used to find the "y" for a given
343 "x" based on the calculated $beta (slope) and $alpha (y-offset).
344
345 x_given_y()
346 Statistics::Basic::LeastSquareFit is meant for finding a line of
347 best fit. This function can be used to find the "x" for a given
348 "y" based on the calculated $beta (slope) and $alpha (y-offset).
349
350 This function can produce divide-by-zero errors since it must
351 divide by the slope to find the "x" value. (The slope should
352 rarely be zero though, that's a vertical line and would represent
353 very odd data points.)
354
356 These objects are all intended to be useful while processing long
357 columns of data, like data you'd find in a database.
358
359 insert()
360 Vectors try to stay the same size when they accept new elements,
361 FIFO style.
362
363 my $v1 = vector(1,2,3); # a 3 touple
364 $v1->insert(4); # still a 3 touple
365
366 print "$v1\n"; # prints: [2, 3, 4]
367
368 $v1->insert(7); # still a 3 touple
369 print "$v1\n"; # prints: [3, 4, 7]
370
371 All of the other Statistics::Basic modules have this function too.
372 The modules that track two vectors will need two arguments to
373 insert though.
374
375 my $mean = mean([1,2,3]);
376 $mean->insert(4);
377
378 print "mean: $mean\n"; # prints 3 ... (2+3+4)/3
379
380 my $correlation = correlation($mean->query_vector,
381 $mean->query_vector->copy);
382
383 print "correlation: $correlation\n"; # 1
384
385 $correlation->insert(3,4);
386 print "correlation: $correlation\n"; # 0.5
387
388 Also, note that the underlying vectors keep track of recalculating
389 automatically.
390
391 my $v = vector(1,2,3);
392 my $m = mean($v);
393 my $s = stddev($v);
394
395 The mean has not been calculated yet.
396
397 print "$s; $m\n"; # 0.82; 2
398
399 The mean has been calculated once (even though the
400 Statistics::Basic::StdDev uses it).
401
402 $v->insert(4); print "$s; $m\n"; 0.82; 3
403 $m->insert(5); print "$s; $m\n"; 0.82; 4
404 $s->insert(6); print "$s; $m\n"; 0.82; 5
405
406 The mean has been calculated thrice more and only thrice more.
407
408 append() ginsert()
409 You can grow the vectors instead of sliding them (FIFO). For this,
410 use "append()" (or "ginsert()", same thing).
411
412 my $v = vector(1,2,3);
413 my $m = mean($v);
414 my $s = stddev($v);
415
416 $v->append(4); print "$s; $m\n"; 1.12; 2.5
417 $m->append(5); print "$s; $m\n"; 1.41; 3
418 $s->append(6); print "$s; $m\n"; 1.71; 1.71
419
420 print "$v\n"; # [1, 2, 3, 4, 5, 6]
421 print "$s\n"; # 1.71
422
423 Of course, with a correlation, or a covariance, it'd look more like
424 this:
425
426 my $c = correlation([1,2,3], [3,4,5]);
427 $c->append(7,7);
428
429 print "c=$c\n"; # c=0.98
430
431 set_vector()
432 This allows you to set the vector to a known state. It takes
433 either array ref or vector objects.
434
435 my $v1 = vector(1,2,3);
436 my $v2 = $v1->copy;
437 $v2->set_vector([4,5,6]);
438
439 my $m = mean();
440
441 $m->set_vector([1,2,3]);
442 $m->set_vector($v2);
443
444 my $c = correlation();
445
446 $c->set_vector($v1,$v2);
447 $c->set_vector([1,2,3], [4,5,6]);
448
449 set_size()
450 This sets the size of the vector. When the vector is made bigger,
451 the vector is filled to the new length with leading zeros (i.e.,
452 they are the first to be kicked out after new "insert()"s.
453
454 my $v = vector(1,2,3);
455 $v->set_size(7);
456
457 print "$v\n"; # [0, 0, 0, 0, 1, 2, 3]
458
459 my $m = mean();
460 $m->set_size(7);
461
462 print "", $m->query_vector, "\n";
463 # [0, 0, 0, 0, 0, 0, 0]
464
465 my $c = correlation([3],[3]);
466 $c->set_size(7);
467
468 print "", $c->query_vector1, "\n";
469 print "", $c->query_vector2, "\n";
470 # [0, 0, 0, 0, 0, 0, 3]
471 # [0, 0, 0, 0, 0, 0, 3]
472
474 Each of the following options can be specified on package import like
475 this.
476
477 use Statistics::Basic qw(unbias=0); # start with unbias disabled
478 use Statistics::Basic qw(unbias=1); # start with unbias enabled
479
480 When specified on import, each option has certain defaults.
481
482 use Statistics::Basic qw(unbias); # start with unbias enabled
483 use Statistics::Basic qw(nofill); # start with nofill enabled
484 use Statistics::Basic qw(toler); # start with toler disabled
485 use Statistics::Basic qw(ipres); # start with ipres=2
486
487 Additionally, with the exception of "ignore_env", they can all be
488 accessed via package variables of the same name in all upper case.
489 Example:
490
491 # code code code
492
493 $Statistics::Basic::UNBIAS = 0; # turn UNBIAS off
494
495 # code code code
496
497 $Statistics::Basic::UNBIAS = 1; # turn it back on
498
499 # code code code
500
501 {
502 local $Statistics::Basic::DEBUG_STATS_B = 1; # debug, this block only
503 }
504
505 Special caveat: "toler" can in fact be changed via the package var
506 (e.g., "$Statistics::Basic::TOLER=0.0001"). But, for speed reasons, it
507 must be defined before any other packages are imported or it will not
508 actually do anything when changed.
509
510 unbias
511 This module uses the sum(X - mean(X))/N definition of variance.
512
513 If you wish to use the unbiased, sum(X-mean(X)/(N-1) definition,
514 then set the $Statistics::Basic::UNBIAS true (possibly with "use
515 Statistics::Basic qw(unbias)").
516
517 This can be changed at any time with the package variable or at
518 compile time.
519
520 This feature was requested by "Robert McGehee
521 <xxxxxxxx@wso.williams.edu>".
522
523 [NOTE 2008-11-06:
524 <http://cpanratings.perl.org/dist/Statistics-Basic>, this can also
525 be called "population (n)" vs "sample (n-1)" and is indeed fully
526 addressed right here!]
527
528 ipres
529 "ipres" defaults to 2. It is passed to Number::Format as the
530 second argument to format_number() during string interpolation
531 (see: overload).
532
533 toler
534 When set, $Statistics::Basic::TOLER (which is not enabled by
535 default), instructs the stats objects to test true when within some
536 tolerable range, pretty much like this:
537
538 sub is_equal {
539 return abs($_[0]-$_[1])<$Statistics::Basic::TOLER
540 if defined($Statistics::Basic::TOLER)
541
542 return $_[0] == $_[1]
543 }
544
545 For performance reasons, this must be defined before the import of
546 any other Statistics::Basic modules or the modules will fail to
547 overload the "==" operator.
548
549 $Statistics::Basic::TOLER totally disabled:
550
551 use Statistics::Basic qw(:all toler);
552
553 $Statistics::Basic::TOLER disabled, but changeable:
554
555 use Statistics::Basic qw(:all toler=0);
556
557 $Statistics::Basic::TOLER = 0.000_001;
558
559 You can change the tolerance at runtime, but it must be set (or
560 unset) at compile time before the packages load.
561
562 nofill
563 Normally when you set the size of a vector it automatically fills
564 with zeros on the first-out side of the vector. You can disable
565 the autofilling with this option. It can be changed at any time.
566
567 debug
568 Enable debugging with "use Statistics::Basic qw(debug)" or disable
569 a specific level (including 0 to disable) with "use
570 Statistics::Basic qw(debug=2)".
571
572 This is also accessible at runtime using
573 $Statistics::Basic::DEBUG_STATS_B and can be switched on and off at
574 any time.
575
576 ignore_env
577 Normally the defaults for these options can be changed in the
578 environment of the program. Example:
579
580 UNBIAS=1 perl ./myprog.pl
581
582 This does the same thing as "$Statistics::Basic::UNBIAS=1" or "use
583 Statistics::Basic qw(unbias)" unless you disable the %ENV checking
584 with this option.
585
586 use Statistics::Basic qw(ignore_env);
587
589 You can change the defaults (assuming ignore_env is not used) from your
590 bash prompt. Example:
591
592 DEBUG_STATS_B=1 perl ./myprog.pl
593
594 $ENV{DEBUG_STATS_B}
595 Sets the default value of "debug".
596
597 $ENV{UNBIAS}
598 Sets the default value of "unbias".
599
600 $ENV{NOFILL}
601 Sets the default value of "nofill".
602
603 $ENV{IPRES}
604 Sets the default value of "ipres".
605
606 $ENV{TOLER}
607 Sets the default value of "toler".
608
610 All of the objects are true in numeric context. All of the objects
611 print useful strings when evaluated as a string. Most of the objects
612 evaluate usefully as numbers, although Statistics::Basic::Vector
613 objects, Statistics::Basic::ComputedVector objects, and
614 Statistics::Basic::LeastSquareFit objects do not -- they instead raise
615 an error.
616
618 I've been asked a couple times now why I don't link to
619 Statistics::Descriptive in my see also section. As a rule, I only link
620 to packages there that I think are related or that I actually used in
621 the package construction. I've never personally used Descriptive, but
622 it surely seems to do quite a lot more. In a sense, this package
623 really doesn't do statistics, not like a scientist would think about it
624 anyway. So I always figured people could find their own way to
625 Descriptive anyway.
626
627 The one thing this package does do, that I don't think Descriptive does
628 (correct me if I'm wrong) is time difference computations. If there
629 are say, 200 things in the mean object, then after inserting (using
630 this package) there'll still be 200 things, allowing the computation of
631 a moving average, moving stddev, moving correlation, etc. You might
632 argue that this is rarely needed, but it is really the only time I need
633 to compute these things.
634
635 while( $data = $fetch_sth->fetchrow_arrayref ) {
636 $mean->insert($data);
637 $moving_avg_sth->execute(0 + $mean);
638 }
639
640 Since I opened the topic I'd also like to mention that I find this
641 package easier to use. That is a matter of taste and since I wrote
642 this, you might say I'm a little biased. Your mileage may vary.
643
645 Paul Miller "<jettero@cpan.org>"
646
647 I am using this software in my own projects... If you find bugs,
648 please please please let me know. :) Actually, let me know if you find
649 it handy at all. Half the fun of releasing this stuff is knowing that
650 people use it.
651
653 Copyright 2012 Paul Miller -- Licensed under the LGPL version 2.
654
656 perl(1), Number::Format, overload, Statistics::Basic::Vector,
657 Statistics::Basic::ComputedVector, Statistics::Basic::_OneVectorBase,
658 Statistics::Basic::Mean, Statistics::Basic::Median,
659 Statistics::Basic::Mode, Statistics::Basic::Variance,
660 Statistics::Basic::StdDev, Statistics::Basic::_TwoVectorBase,
661 Statistics::Basic::Correlation, Statistics::Basic::Covariance,
662 Statistics::Basic::LeastSquareFit
663
664
665
666perl v5.32.0 2020-07-28 Statistics::Basic(3)