1Statistics::CaseResamplUisnegr(3C)ontributed Perl DocumeSnttaattiisotnics::CaseResampling(3)
2
3
4

NAME

6       Statistics::CaseResampling - Efficient resampling and calculation of
7       medians with confidence intervals
8

SYNOPSIS

10         use Statistics::CaseResampling ':all';
11
12         my $sample = [1,3,5,7,1,2,9]; # ... usually MUCH more data ...
13         my $confidence = 0.95; # ~2*sigma or "90% within confidence limits"
14         #my $confidence = 0.37; # ~1*sigma or "~66% within confidence limits"
15
16         # calculate the median of the sample with lower and upper confidence
17         # limits using resampling/bootstrapping:
18         my ($lower_cl, $median, $upper_cl)
19           = median_simple_confidence_limits($sample, $confidence);
20
21         # There are many auxiliary functions:
22
23         my $resampled = resample($sample);
24         # $resampled is now a random set of measurements from $sample,
25         # including potential duplicates
26
27         my $medians = resample_medians($sample, $n_resamples);
28         # $medians is now an array reference containing the medians
29         # of $n_resamples resample runs
30         # This is vastly more efficient that doing the same thing with
31         # repeated resample() calls!
32         # Analogously:
33         my $means = resample_means($sample, $n_resamples);
34
35         # you can get the cl's from a set of separately resampled medians, too:
36         my ($lower_cl, $median, $upper_cl)
37           = simple_confidence_limits_from_samples($median, $medians, $confidence);
38
39         # utility functions:
40         print median([1..5]), "\n"; # prints 3
41         print mean([1..5]), "\n"; # prints 3, too, surprise!
42         print select_kth([1..5], 1), "\n"; # inefficient way to calculate the minimum
43

DESCRIPTION

45       The purpose of this (XS) module is to calculate the median (or in
46       principle also other statistics) with confidence intervals on a sample.
47       To do that, it uses a technique called bootstrapping. In a nutshell, it
48       resamples the sample a lot of times and for each resample, it
49       calculates the median.  From the distribution of medians, it then
50       calculates the confidence limits.
51
52       In order to implement the confidence limit calculation, various other
53       functions had to be implemented efficiently (both algorithmically
54       efficient and done in C). These functions may be useful in their own
55       right and are thus exposed to Perl. Most notably, this exposes a median
56       (and general selection) algorithm that works in linear time as opposed
57       to the trivial implementation that requires "O(n*log(n))".
58
59   Random numbers
60       The resampling involves drawing many random numbers. Therefore, the
61       module comes with an embedded Mersenne twister (taken from
62       "Math::Random::MT").
63
64       If you want to change the seed of the RNG, do this:
65
66         $Statistics::CaseResampling::Rnd
67           = Statistics::CaseResampling::RdGen::setup($seed);
68
69       or
70
71         $Statistics::CaseResampling::Rnd
72           = Statistics::CaseResampling::RdGen::setup(@seed);
73
74       Do not use the embedded random number generator for other purposes.
75       Use "Math::Random::MT" instead! At this point, you cannot change the
76       type of RNG.
77
78   EXPORT
79       None by default.
80
81       Can export any of the functions that are documented below using
82       standard "Exporter" semantics, including the customary ":all" group.
83

FUNCTIONS

85       This list of functions is loosely sorted from basic to comprehensive
86       because the more complicated functions are usually (under the hood, in
87       C) implemented using the basic building blocks. Unfortunately, that
88       means you may want to read the documentation backwards. :)
89
90       Additionally, there is a whole set of general purpose, fast (XS)
91       functions for calculating statistical metrics. They're useful without
92       the bootstrapping related stuff, so they're listed in the "OTHER
93       FUNCTIONS" section below.
94
95       All of these functions are written in C for speed.
96
97   resample(ARRAYREF)
98       Returns a reference to an array containing N random elements from the
99       input array, where N is the length of the original array.
100
101       This is different from shuffling in that it's random drawing without
102       removing the drawn elements from the sample.
103
104   resample_medians(ARRAYREF, NMEDIANS)
105       Returns a reference to an array containing the medians of "NMEDIANS"
106       resamples of the original input sample.
107
108   resample_means(ARRAYREF, NMEANS)
109       Returns a reference to an array containing the means of "NMEANS"
110       resamples of the original input sample.
111
112   simple_confidence_limits_from_median_samples(STATISTIC, STATISTIC_SAMPLES,
113       CONFIDENCE)
114       Calculates the confidence limits for STATISTIC. Returns the lower
115       confidence limit, the statistic, and the upper confidence limit.
116       STATISTIC_SAMPLES is the output of, for example,
117       "resample_means(\@sample)".  CONFIDENCE indicates the fraction of data
118       within the confidence limits (assuming a normal, symmetric distribution
119       of the statistic => simple confidence limits).
120
121       For example, to get the 90% confidence (~2 sigma) interval for the mean
122       of your sample, you can do the following:
123
124         my $sample = [<numbers>];
125         my $means = resample_means($sample, $n_resamples);
126         my ($lower_cl, $mean, $upper_cl)
127           = simple_confidence_limits_from_samples(mean($sample), $means, 0.90);
128
129       If you want to apply this logic to other statistics such as the first
130       or third quartile, you have to manually resample and lose the advantage
131       of doing it in C:
132
133         my $sample = [<numbers>];
134         my $quartiles = [];
135         foreach (1..1000) {
136           push @$quartiles, first_quartile(resample($sample));
137         }
138         my ($lower_cl, $firstq, $upper_cl)
139           = simple_confidence_limits_from_samples(
140               first_quartile($sample), $quartiles, 0.90
141             );
142
143       For a reliable calculation of the confidence limits, you should use at
144       least 1000 samples if not more. Yes. This whole procedure is expensive.
145
146       For medians, however, use the following function:
147
148   median_simple_confidence_limits(SAMPLE, CONFIDENCE, [NSAMPLES])
149       Calculates the confidence limits for the "CONFIDENCE" level for the
150       median of SAMPLE. Returns the lower confidence limit, the median, and
151       the upper confidence limit.
152
153       In order to calculate the limits, a lot of resampling has to be done.
154       NSAMPLES defaults to 1000. Try running this a couple of times on your
155       data interactively to see how the limits still vary a little bit at
156       this setting.
157

OTHER FUNCTIONS

159   approx_erf($x)
160       Calculates an approximatation of the error function of x.  Implemented
161       after
162
163         Winitzki, Sergei (6 February 2008).
164         "A handy approximation for the error function and its inverse" (PDF).
165         http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf
166
167       Quoting: Relative precision better than "1.3e-4".
168
169   approx_erf_inv($x)
170       Calculates an approximation of the inverse of the error function of x.
171
172       Algorithm from the same source as "approx_erf".
173
174       Quoting: Relative precision better than "2e-3".
175
176   nsigma_to_alpha($nsigma)
177       Calculates the probability that a measurement from a normal
178       distribution is further away from the mean than $nsigma standard
179       deviations.
180
181       The confidence level (what you pass as the "CONFIDENCE" parameter to
182       some functions in this module) is "1 - nsigma_to_alpha($nsigma)".
183
184   alpha_to_nsigma($alpha)
185       Inverse of "nsigma_to_alpha()".
186
187   mean(ARRAYREF)
188       Calculates the mean of a sample.
189
190   median(ARRAYREF)
191       Calculates the median (second quartile) of a sample. Works in linear
192       time thanks to using a selection instead of a sort.
193
194       Unfortunately, the way this is implemented, the median of an even
195       number of parameters is, here, defined as the "n/2-1"th largest number
196       and not the average of the "n/2-1"th and the "n/2"th number. This
197       shouldn't matter for nontrivial sample sizes!
198
199   median_absolute_deviation(ARRAYREF)
200       Calculates the median absolute deviation (MAD) in what I believe is
201       O(n).  Take care to rescale the MAD before using it in place of a
202       standard deviation.
203
204   first_quartile(ARRAYREF)
205       Calculates the first quartile of the sample.
206
207   third_quartile(ARRAYREF)
208       Calculates the third quartile of the sample.
209
210   select_kth(ARRAYREF, K)
211       Selects the kth smallest element from the sample.
212
213       This is the general function that implements the median/quartile
214       calculation. You get the median with this definition of K:
215
216         my $k = int(@$sample/2) + (@$sample & 1);
217         my $median = select_kth($sample, $k);
218
219   sample_standard_deviation
220       Given the sample mean and an anonymous array of numbers (the sample),
221       calculates the sample standard deviation.
222
223   population_standard_deviation
224       Same as sample_standard_deviation, but without the correction to "N".
225

TODO

227       One could calculate more statistics in C for performance.
228

SEE ALSO

230       Math::Random::MT
231
232       On the approximation of the error function:
233
234         Winitzki, Sergei (6 February 2008).
235         "A handy approximation for the error function and its inverse" (PDF).
236         http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf
237
238       The ~O(n) median implementation is based on C.A.R. Hoare's quickselect
239       algorithm. See
240       <http://en.wikipedia.org/wiki/Selection_algorithm#Partition-based_general_selection_algorithm>.
241       Right now, it does not implement the Median of Medians algorithm that
242       would guarantee linearity.
243

AUTHOR

245       Steffen Mueller, <smueller@cpan.org>
246
247       Daniel Dragan, <bulk88@hotmail.com>, who supplied MSVC compatibility
248       patches.
249
251       Copyright (C) 2010, 2011, 2012, 2013 by Steffen Mueller
252
253       This library is free software; you can redistribute it and/or modify it
254       under the same terms as Perl itself, either Perl version 5.8.0 or, at
255       your option, any later version of Perl 5 you may have available.
256
257
258
259perl v5.28.1                      2013-05-14     Statistics::CaseResampling(3)
Impressum