1math::statistics(n) Tcl Math Library math::statistics(n)
2
3
4
5______________________________________________________________________________
6
8 math::statistics - Basic statistical functions and procedures
9
11 package require Tcl 8
12
13 package require math::statistics 0.2
14
15 ::math::statistics::mean data
16
17 ::math::statistics::min data
18
19 ::math::statistics::max data
20
21 ::math::statistics::number data
22
23 ::math::statistics::stdev data
24
25 ::math::statistics::var data
26
27 ::math::statistics::median data
28
29 ::math::statistics::basic-stats data
30
31 ::math::statistics::histogram limits values
32
33 ::math::statistics::corr data1 data2
34
35 ::math::statistics::interval-mean-stdev data confidence
36
37 ::math::statistics::t-test-mean data est_mean est_stdev confidence
38
39 ::math::statistics::test-normal data confidence
40
41 ::math::statistics::lillieforsFit data
42
43 ::math::statistics::quantiles data confidence
44
45 ::math::statistics::quantiles limits counts confidence
46
47 ::math::statistics::autocorr data
48
49 ::math::statistics::crosscorr data1 data2
50
51 ::math::statistics::mean-histogram-limits mean stdev number
52
53 ::math::statistics::minmax-histogram-limits min max number
54
55 ::math::statistics::linear-model xdata ydata intercept
56
57 ::math::statistics::linear-residuals xdata ydata intercept
58
59 ::math::statistics::test-2x2 n11 n21 n12 n22
60
61 ::math::statistics::print-2x2 n11 n21 n12 n22
62
63 ::math::statistics::control-xbar data ?nsamples?
64
65 ::math::statistics::control-Rchart data ?nsamples?
66
67 ::math::statistics::test-xbar control data
68
69 ::math::statistics::test-Rchart control data
70
71 ::math::statistics::pdf-normal mean stdev value
72
73 ::math::statistics::pdf-exponential mean value
74
75 ::math::statistics::pdf-uniform xmin xmax value
76
77 ::math::statistics::cdf-normal mean stdev value
78
79 ::math::statistics::cdf-exponential mean value
80
81 ::math::statistics::cdf-uniform xmin xmax value
82
83 ::math::statistics::cdf-students-t degrees value
84
85 ::math::statistics::random-normal mean stdev number
86
87 ::math::statistics::random-exponential mean number
88
89 ::math::statistics::random-uniform xmin xmax value
90
91 ::math::statistics::histogram-uniform xmin xmax limits number
92
93 ::math::statistics::filter varname data expression
94
95 ::math::statistics::map varname data expression
96
97 ::math::statistics::samplescount varname list expression
98
99 ::math::statistics::subdivide
100
101 ::math::statistics::plot-scale canvas xmin xmax ymin ymax
102
103 ::math::statistics::plot-xydata canvas xdata ydata tag
104
105 ::math::statistics::plot-xyline canvas xdata ydata tag
106
107 ::math::statistics::plot-tdata canvas tdata tag
108
109 ::math::statistics::plot-tline canvas tdata tag
110
111 ::math::statistics::plot-histogram canvas counts limits tag
112
113_________________________________________________________________
114
116 The math::statistics package contains functions and procedures for
117 basic statistical data analysis, such as:
118
119 · Descriptive statistical parameters (mean, minimum, maximum,
120 standard deviation)
121
122 · Estimates of the distribution in the form of histograms and
123 quantiles
124
125 · Basic testing of hypotheses
126
127 · Probability and cumulative density functions It is meant to help
128 in developing data analysis applications or doing ad hoc data
129 analysis, it is not in itself a full application, nor is it
130 intended to rival with full (non-)commercial statistical pack‐
131 ages.
132
133 The purpose of this document is to describe the implemented procedures
134 and provide some examples of their usage. As there is ample literature
135 on the algorithms involved, we refer to relevant text books for more
136 explanations. The package contains a fairly large number of public
137 procedures. They can be distinguished in three sets: general proce‐
138 dures, procedures that deal with specific statistical distributions,
139 list procedures to select or transform data and simple plotting proce‐
140 dures (these require Tk). Note: The data that need to be analyzed are
141 always contained in a simple list. Missing values are represented as
142 empty list elements.
143
145 The general statistical procedures are:
146
147 ::math::statistics::mean data
148 Determine the mean value of the given list of data.
149
150 data - List of data
151
152
153 ::math::statistics::min data
154 Determine the minimum value of the given list of data.
155
156 data - List of data
157
158
159 ::math::statistics::max data
160 Determine the maximum value of the given list of data.
161
162 data - List of data
163
164
165 ::math::statistics::number data
166 Determine the number of non-missing data in the given list
167
168 data - List of data
169
170
171 ::math::statistics::stdev data
172 Determine the standard deviation of the data in the given list
173
174 data - List of data
175
176
177 ::math::statistics::var data
178 Determine the variance of the data in the given list
179
180 data - List of data
181
182
183 ::math::statistics::median data
184 Determine the median of the data in the given list (Note that
185 this requires sorting the data, which may be a costly operation)
186
187 data - List of data
188
189
190 ::math::statistics::basic-stats data
191 Determine a list of all the descriptive parameters: mean, mini‐
192 mum, maximum, number of data, standard deviation and variance.
193
194 (This routine is called whenever either or all of the basic sta‐
195 tistical parameters are required. Hence all calculations are
196 done and the relevant values are returned.)
197
198 data - List of data
199
200
201 ::math::statistics::histogram limits values
202 Determine histogram information for the given list of data.
203 Returns a list consisting of the number of values that fall into
204 each interval. (The first interval consists of all values lower
205 than the first limit, the last interval consists of all values
206 greater than the last limit. There is one more interval than
207 there are limits.)
208
209 limits - List of upper limits (in ascending order) for the
210 intervals of the histogram.
211
212 values - List of data
213
214
215 ::math::statistics::corr data1 data2
216 Determine the correlation coefficient between two sets of data.
217
218 data1 - First list of data
219
220 data2 - Second list of data
221
222
223 ::math::statistics::interval-mean-stdev data confidence
224 Return the interval containing the mean value and one containing
225 the standard deviation with a certain level of confidence
226 (assuming a normal distribution)
227
228 data - List of raw data values (small sample)
229
230 confidence - Confidence level (0.95 or 0.99 for instance)
231
232
233 ::math::statistics::t-test-mean data est_mean est_stdev confidence
234 Test whether the mean value of a sample is in accordance with
235 the estimated normal distribution with a certain level of confi‐
236 dence. Returns 1 if the test succeeds or 0 if the mean is
237 unlikely to fit the given distribution.
238
239 data - List of raw data values (small sample)
240
241 est_mean - Estimated mean of the distribution
242
243 est_stdev - Estimated stdev of the distribution
244
245 confidence - Confidence level (0.95 or 0.99 for instance)
246
247
248 ::math::statistics::test-normal data confidence
249 Test whether the given data follow a normal distribution with a
250 certain level of confidence. Returns 1 if the data are normally
251 distributed within the level of confidence, returns 0 if not.
252 The underlying test is the Lilliefors test.
253
254 data - List of raw data values
255
256 confidence - Confidence level (one of 0.80, 0.90, 0.95 or 0.99)
257
258
259 ::math::statistics::lillieforsFit data
260 Returns the goodness of fit to a normal distribution according
261 to Lilliefors. The higher the number, the more likely the data
262 are indeed normally distributed. The test requires at least five
263 data points.
264
265 data - List of raw data values
266
267
268 ::math::statistics::quantiles data confidence
269 Return the quantiles for a given set of data
270
271 data - List of raw data values
272
273 confidence - Confidence level (0.95 or 0.99 for instance)
274
275
276 ::math::statistics::quantiles limits counts confidence
277 Return the quantiles based on histogram information (alternative
278 to the call with two arguments)
279
280 limits - List of upper limits from histogram
281
282 counts - List of counts for for each interval in histogram
283
284 confidence - Confidence level (0.95 or 0.99 for instance)
285
286
287 ::math::statistics::autocorr data
288 Return the autocorrelation function as a list of values (assum‐
289 ing equidistance between samples, about 1/2 of the number of raw
290 data)
291
292 The correlation is determined in such a way that the first value
293 is always 1 and all others are equal to or smaller than 1. The
294 number of values involved will diminish as the "time" (the index
295 in the list of returned values) increases
296
297 data - Raw data for which the autocorrelation must be determined
298
299
300 ::math::statistics::crosscorr data1 data2
301 Return the cross-correlation function as a list of values
302 (assuming equidistance between samples, about 1/2 of the number
303 of raw data)
304
305 The correlation is determined in such a way that the values can
306 never exceed 1 in magnitude. The number of values involved will
307 diminish as the "time" (the index in the list of returned val‐
308 ues) increases.
309
310 data1 - First list of data
311
312 data2 - Second list of data
313
314
315 ::math::statistics::mean-histogram-limits mean stdev number
316 Determine reasonable limits based on mean and standard deviation
317 for a histogram
318
319 Convenience function - the result is suitable for the histogram
320 function.
321
322 mean - Mean of the data
323
324 stdev - Standard deviation
325
326 number - Number of limits to generate (defaults to 8)
327
328
329 ::math::statistics::minmax-histogram-limits min max number
330 Determine reasonable limits based on a minimum and maximum for a
331 histogram
332
333 Convenience function - the result is suitable for the histogram
334 function.
335
336 min - Expected minimum
337
338 max - Expected maximum
339
340 number - Number of limits to generate (defaults to 8)
341
342
343 ::math::statistics::linear-model xdata ydata intercept
344 Determine the coefficients for a linear regression between two
345 series of data (the model: Y = A + B*X). Returns a list of
346 parameters describing the fit
347
348 xdata - List of independent data
349
350 ydata - List of dependent data to be fitted
351
352 intercept - (Optional) compute the intercept (1, default) or fit
353 to a line through the origin (0)
354
355 The result consists of the following list:
356
357 · (Estimate of) Intercept A
358
359 · (Estimate of) Slope B
360
361 · Standard deviation of Y relative to fit
362
363 · Correlation coefficient R2
364
365 · Number of degrees of freedom df
366
367 · Standard error of the intercept A
368
369 · Significance level of A
370
371 · Standard error of the slope B
372
373 · Significance level of B
374
375 ::math::statistics::linear-residuals xdata ydata intercept
376 Determine the difference between actual data and predicted from
377 the linear model.
378
379 Returns a list of the differences between the actual data and
380 the predicted values.
381
382 xdata - List of independent data
383
384 ydata - List of dependent data to be fitted
385
386 intercept - (Optional) compute the intercept (1, default) or fit
387 to a line through the origin (0)
388
389 ::math::statistics::test-2x2 n11 n21 n12 n22
390 Determine if two set of samples, each from a binomial distribu‐
391 tion, differ significantly or not (implying a different parame‐
392 ter).
393
394 Returns the "chi-square" value, which can be used to the deter‐
395 mine the significance.
396
397 n11 - Number of outcomes with the first value from the first
398 sample.
399
400 n21 - Number of outcomes with the first value from the second
401 sample.
402
403 n12 - Number of outcomes with the second value from the first
404 sample.
405
406 n22 - Number of outcomes with the second value from the second
407 sample.
408
409
410 ::math::statistics::print-2x2 n11 n21 n12 n22
411 Determine if two set of samples, each from a binomial distribu‐
412 tion, differ significantly or not (implying a different parame‐
413 ter).
414
415 Returns a short report, useful in an interactive session.
416
417 n11 - Number of outcomes with the first value from the first
418 sample.
419
420 n21 - Number of outcomes with the first value from the second
421 sample.
422
423 n12 - Number of outcomes with the second value from the first
424 sample.
425
426 n22 - Number of outcomes with the second value from the second
427 sample.
428
429
430 ::math::statistics::control-xbar data ?nsamples?
431 Determine the control limits for an xbar chart. The number of
432 data in each subsample defaults to 4. At least 20 subsamples are
433 required.
434
435 Returns the mean, the lower limit, the upper limit and the num‐
436 ber of data per subsample.
437
438 data - List of observed data
439
440 nsamples - Number of data per subsample
441
442
443 ::math::statistics::control-Rchart data ?nsamples?
444 Determine the control limits for an R chart. The number of data
445 in each subsample defaults to 4. At least 20 subsamples are
446 required.
447
448 Returns the mean range, the lower limit, the upper limit and the
449 number of data per subsample.
450
451 data - List of observed data
452
453 nsamples - Number of data per subsample
454
455
456 ::math::statistics::test-xbar control data
457 Determine if the data exceed the control limits for the xbar
458 chart.
459
460 Returns a list of subsamples (their indices) that indeed violate
461 the limits.
462
463 control - Control limits as returned by the "control-xbar" pro‐
464 cedure
465
466 data - List of observed data
467
468
469 ::math::statistics::test-Rchart control data
470 Determine if the data exceed the control limits for the R chart.
471
472 Returns a list of subsamples (their indices) that indeed violate
473 the limits.
474
475 control - Control limits as returned by the "control-Rchart"
476 procedure
477
478 data - List of observed data
479
480
482 In the literature a large number of probability distributions can be
483 found. The statistics package supports:
484
485 · The normal or Gaussian distribution
486
487 · The uniform distribution - equal probability for all data within
488 a given interval
489
490 · The exponential distribution - useful as a model for certain
491 extreme-value distributions.
492
493 · PM - binomial, Poisson, chi-squared, student's T, F. In princi‐
494 ple for each distribution one has procedures for:
495
496 · The probability density (pdf-*)
497
498 · The cumulative density (cdf-*)
499
500 · Quantiles for the given distribution (quantiles-*)
501
502 · Histograms for the given distribution (histogram-*)
503
504 · List of random values with the given distribution (random-*) The
505 following procedures have been implemented:
506
507 ::math::statistics::pdf-normal mean stdev value
508 Return the probability of a given value for a normal distribu‐
509 tion with given mean and standard deviation.
510
511 mean - Mean value of the distribution
512
513 stdev - Standard deviation of the distribution
514
515 value - Value for which the probability is required
516
517
518 ::math::statistics::pdf-exponential mean value
519 Return the probability of a given value for an exponential dis‐
520 tribution with given mean.
521
522 mean - Mean value of the distribution
523
524 value - Value for which the probability is required
525
526
527 ::math::statistics::pdf-uniform xmin xmax value
528 Return the probability of a given value for a uniform distribu‐
529 tion with given extremes.
530
531 xmin - Minimum value of the distribution
532
533 xmin - Maximum value of the distribution
534
535 value - Value for which the probability is required
536
537
538 ::math::statistics::cdf-normal mean stdev value
539 Return the cumulative probability of a given value for a normal
540 distribution with given mean and standard deviation, that is the
541 probability for values up to the given one.
542
543 mean - Mean value of the distribution
544
545 stdev - Standard deviation of the distribution
546
547 value - Value for which the probability is required
548
549
550 ::math::statistics::cdf-exponential mean value
551 Return the cumulative probability of a given value for an expo‐
552 nential distribution with given mean.
553
554 mean - Mean value of the distribution
555
556 value - Value for which the probability is required
557
558
559 ::math::statistics::cdf-uniform xmin xmax value
560 Return the cumulative probability of a given value for a uniform
561 distribution with given extremes.
562
563 xmin - Minimum value of the distribution
564
565 xmin - Maximum value of the distribution
566
567 value - Value for which the probability is required
568
569
570 ::math::statistics::cdf-students-t degrees value
571 Return the cumulative probability of a given value for a Stu‐
572 dent's t distribution with given number of degrees.
573
574 degrees - Number of degrees of freedom
575
576 value - Value for which the probability is required
577
578
579 ::math::statistics::random-normal mean stdev number
580 Return a list of "number" random values satisfying a normal dis‐
581 tribution with given mean and standard deviation.
582
583 mean - Mean value of the distribution
584
585 stdev - Standard deviation of the distribution
586
587 number - Number of values to be returned
588
589
590 ::math::statistics::random-exponential mean number
591 Return a list of "number" random values satisfying an exponen‐
592 tial distribution with given mean.
593
594 mean - Mean value of the distribution
595
596 number - Number of values to be returned
597
598
599 ::math::statistics::random-uniform xmin xmax value
600 Return a list of "number" random values satisfying a uniform
601 distribution with given extremes.
602
603 xmin - Minimum value of the distribution
604
605 xmin - Maximum value of the distribution
606
607 number - Number of values to be returned
608
609
610 ::math::statistics::histogram-uniform xmin xmax limits number
611 Return the expected histogram for a uniform distribution.
612
613 xmin - Minimum value of the distribution
614
615 xmax - Maximum value of the distribution
616
617 limits - Upper limits for the buckets in the histogram
618
619 number - Total number of "observations" in the histogram
620
621 TO DO: more function descriptions to be added
622
624 The data manipulation procedures act on lists or lists of lists:
625
626 ::math::statistics::filter varname data expression
627 Return a list consisting of the data for which the logical
628 expression is true (this command works analogously to the com‐
629 mand foreach).
630
631 varname - Name of the variable used in the expression
632
633 data - List of data
634
635 expression - Logical expression using the variable name
636
637
638 ::math::statistics::map varname data expression
639 Return a list consisting of the data that are transformed via
640 the expression.
641
642 varname - Name of the variable used in the expression
643
644 data - List of data
645
646 expression - Expression to be used to transform (map) the data
647
648
649 ::math::statistics::samplescount varname list expression
650 Return a list consisting of the counts of all data in the sub‐
651 lists of the "list" argument for which the expression is true.
652
653 varname - Name of the variable used in the expression
654
655 data - List of sublists, each containing the data
656
657 expression - Logical expression to test the data (defaults to
658 "true").
659
660
661 ::math::statistics::subdivide
662 Routine PM - not implemented yet
663
665 The following simple plotting procedures are available:
666
667 ::math::statistics::plot-scale canvas xmin xmax ymin ymax
668 Set the scale for a plot in the given canvas. All plot routines
669 expect this function to be called first. There is no automatic
670 scaling provided.
671
672 canvas - Canvas widget to use
673
674 xmin - Minimum x value
675
676 xmax - Maximum x value
677
678 ymin - Minimum y value
679
680 ymax - Maximum y value
681
682
683 ::math::statistics::plot-xydata canvas xdata ydata tag
684 Create a simple XY plot in the given canvas - the data are shown
685 as a collection of dots. The tag can be used to manipulate the
686 appearance.
687
688 canvas - Canvas widget to use
689
690 xdata - Series of independent data
691
692 ydata - Series of dependent data
693
694 tag - Tag to give to the plotted data (defaults to xyplot)
695
696
697 ::math::statistics::plot-xyline canvas xdata ydata tag
698 Create a simple XY plot in the given canvas - the data are shown
699 as a line through the data points. The tag can be used to manip‐
700 ulate the appearance.
701
702 canvas - Canvas widget to use
703
704 xdata - Series of independent data
705
706 ydata - Series of dependent data
707
708 tag - Tag to give to the plotted data (defaults to xyplot)
709
710
711 ::math::statistics::plot-tdata canvas tdata tag
712 Create a simple XY plot in the given canvas - the data are shown
713 as a collection of dots. The horizontal coordinate is equal to
714 the index. The tag can be used to manipulate the appearance.
715 This type of presentation is suitable for autocorrelation func‐
716 tions for instance or for inspecting the time-dependent behav‐
717 iour.
718
719 canvas - Canvas widget to use
720
721 tdata - Series of dependent data
722
723 tag - Tag to give to the plotted data (defaults to xyplot)
724
725
726 ::math::statistics::plot-tline canvas tdata tag
727 Create a simple XY plot in the given canvas - the data are shown
728 as a line. See plot-tdata for an explanation.
729
730 canvas - Canvas widget to use
731
732 tdata - Series of dependent data
733
734 tag - Tag to give to the plotted data (defaults to xyplot)
735
736
737 ::math::statistics::plot-histogram canvas counts limits tag
738 Create a simple histogram in the given canvas
739
740 canvas - Canvas widget to use
741
742 counts - Series of bucket counts
743
744 limits - Series of upper limits for the buckets
745
746 tag - Tag to give to the plotted data (defaults to xyplot)
747
748
750 The following procedures are yet to be implemented:
751
752 · F-test-stdev
753
754 · interval-mean-stdev
755
756 · histogram-normal
757
758 · histogram-exponential
759
760 · test-histogram
761
762 · test-corr
763
764 · quantiles-*
765
766 · fourier-coeffs
767
768 · fourier-residuals
769
770 · onepar-function-fit
771
772 · onepar-function-residuals
773
774 · plot-linear-model
775
776 · subdivide
777
779 The code below is a small example of how you can examine a set of data:
780
781 # Simple example:
782 # - Generate data (as a cheap way of getting some)
783 # - Perform statistical analysis to describe the data
784 #
785 package require math::statistics
786
787 #
788 # Two auxiliary procs
789 #
790 proc pause {time} {
791 set wait 0
792 after [expr {$time*1000}] {set ::wait 1}
793 vwait wait
794 }
795
796 proc print-histogram {counts limits} {
797 foreach count $counts limit $limits {
798 if { $limit != {} } {
799 puts [format "<%12.4g\t%d" $limit $count]
800 set prev_limit $limit
801 } else {
802 puts [format ">%12.4g\t%d" $prev_limit $count]
803 }
804 }
805 }
806
807 #
808 # Our source of arbitrary data
809 #
810 proc generateData { data1 data2 } {
811 upvar 1 $data1 _data1
812 upvar 1 $data2 _data2
813
814 set d1 0.0
815 set d2 0.0
816 for { set i 0 } { $i < 100 } { incr i } {
817 set d1 [expr {10.0-2.0*cos(2.0*3.1415926*$i/24.0)+3.5*rand()}]
818 set d2 [expr {0.7*$d2+0.3*$d1+0.7*rand()}]
819 lappend _data1 $d1
820 lappend _data2 $d2
821 }
822 return {}
823 }
824
825 #
826 # The analysis session
827 #
828 package require Tk
829 console show
830 canvas .plot1
831 canvas .plot2
832 pack .plot1 .plot2 -fill both -side top
833
834 generateData data1 data2
835
836 puts "Basic statistics:"
837 set b1 [::math::statistics::basic-stats $data1]
838 set b2 [::math::statistics::basic-stats $data2]
839 foreach label {mean min max number stdev var} v1 $b1 v2 $b2 {
840 puts "$label\t$v1\t$v2"
841 }
842 puts "Plot the data as function of \"time\" and against each other"
843 ::math::statistics::plot-scale .plot1 0 100 0 20
844 ::math::statistics::plot-scale .plot2 0 20 0 20
845 ::math::statistics::plot-tline .plot1 $data1
846 ::math::statistics::plot-tline .plot1 $data2
847 ::math::statistics::plot-xydata .plot2 $data1 $data2
848
849 puts "Correlation coefficient:"
850 puts [::math::statistics::corr $data1 $data2]
851
852 pause 2
853 puts "Plot histograms"
854 ::math::statistics::plot-scale .plot2 0 20 0 100
855 set limits [::math::statistics::minmax-histogram-limits 7 16]
856 set histogram_data [::math::statistics::histogram $limits $data1]
857 ::math::statistics::plot-histogram .plot2 $histogram_data $limits
858
859 puts "First series:"
860 print-histogram $histogram_data $limits
861
862 pause 2
863 set limits [::math::statistics::minmax-histogram-limits 0 15 10]
864 set histogram_data [::math::statistics::histogram $limits $data2]
865 ::math::statistics::plot-histogram .plot2 $histogram_data $limits d2
866
867 puts "Second series:"
868 print-histogram $histogram_data $limits
869
870 puts "Autocorrelation function:"
871 set autoc [::math::statistics::autocorr $data1]
872 puts [::math::statistics::map $autoc {[format "%.2f" $x]}]
873 puts "Cross-correlation function:"
874 set crossc [::math::statistics::crosscorr $data1 $data2]
875 puts [::math::statistics::map $crossc {[format "%.2f" $x]}]
876
877 ::math::statistics::plot-scale .plot1 0 100 -1 4
878 ::math::statistics::plot-tline .plot1 $autoc "autoc"
879 ::math::statistics::plot-tline .plot1 $crossc "crossc"
880
881 puts "Quantiles: 0.1, 0.2, 0.5, 0.8, 0.9"
882 puts "First: [::math::statistics::quantiles $data1 {0.1 0.2 0.5 0.8 0.9}]"
883 puts "Second: [::math::statistics::quantiles $data2 {0.1 0.2 0.5 0.8 0.9}]"
884
885
886 If you run this example, then the following should be clear:
887
888 · There is a strong correlation between two time series, as dis‐
889 played by the raw data and especially by the correlation func‐
890 tions.
891
892 · Both time series show a significant periodic component
893
894 · The histograms are not very useful in identifying the nature of
895 the time series - they do not show the periodic nature.
896
898 data analysis, mathematics, statistics
899
900
901
902math 0.2 math::statistics(n)