1Statistics::CaseResamplUisnegr(3C)ontributed Perl DocumeSnttaattiisotnics::CaseResampling(3)
2
3
4
6 Statistics::CaseResampling - Efficient resampling and calculation of
7 medians with confidence intervals
8
10 use Statistics::CaseResampling ':all';
11
12 my $sample = [1,3,5,7,1,2,9]; # ... usually MUCH more data ...
13 my $confidence = 0.95; # ~2*sigma or "90% within confidence limits"
14 #my $confidence = 0.37; # ~1*sigma or "~66% within confidence limits"
15
16 # calculate the median of the sample with lower and upper confidence
17 # limits using resampling/bootstrapping:
18 my ($lower_cl, $median, $upper_cl)
19 = median_simple_confidence_limits($sample, $confidence);
20
21 # There are many auxiliary functions:
22
23 my $resampled = resample($sample);
24 # $resampled is now a random set of measurements from $sample,
25 # including potential duplicates
26
27 my $medians = resample_medians($sample, $n_resamples);
28 # $medians is now an array reference containing the medians
29 # of $n_resamples resample runs
30 # This is vastly more efficient that doing the same thing with
31 # repeated resample() calls!
32 # Analogously:
33 my $means = resample_means($sample, $n_resamples);
34
35 # you can get the cl's from a set of separately resampled medians, too:
36 my ($lower_cl, $median, $upper_cl)
37 = simple_confidence_limits_from_samples($median, $medians, $confidence);
38
39 # utility functions:
40 print median([1..5]), "\n"; # prints 3
41 print mean([1..5]), "\n"; # prints 3, too, surprise!
42 print select_kth([1..5], 1), "\n"; # inefficient way to calculate the minimum
43
45 The purpose of this (XS) module is to calculate the median (or in
46 principle also other statistics) with confidence intervals on a sample.
47 To do that, it uses a technique called bootstrapping. In a nutshell, it
48 resamples the sample a lot of times and for each resample, it
49 calculates the median. From the distribution of medians, it then
50 calculates the confidence limits.
51
52 In order to implement the confidence limit calculation, various other
53 functions had to be implemented efficiently (both algorithmically
54 efficient and done in C). These functions may be useful in their own
55 right and are thus exposed to Perl. Most notably, this exposes a median
56 (and general selection) algorithm that works in linear time as opposed
57 to the trivial implementation that requires "O(n*log(n))".
58
59 Random numbers
60 The resampling involves drawing many random numbers. Therefore, the
61 module comes with an embedded Mersenne twister (taken from
62 "Math::Random::MT").
63
64 If you want to change the seed of the RNG, do this:
65
66 $Statistics::CaseResampling::Rnd
67 = Statistics::CaseResampling::RdGen::setup($seed);
68
69 or
70
71 $Statistics::CaseResampling::Rnd
72 = Statistics::CaseResampling::RdGen::setup(@seed);
73
74 Do not use the embedded random number generator for other purposes.
75 Use "Math::Random::MT" instead! At this point, you cannot change the
76 type of RNG.
77
78 EXPORT
79 None by default.
80
81 Can export any of the functions that are documented below using
82 standard "Exporter" semantics, including the customary ":all" group.
83
85 This list of functions is loosely sorted from basic to comprehensive
86 because the more complicated functions are usually (under the hood, in
87 C) implemented using the basic building blocks. Unfortunately, that
88 means you may want to read the documentation backwards. :)
89
90 Additionally, there is a whole set of general purpose, fast (XS)
91 functions for calculating statistical metrics. They're useful without
92 the bootstrapping related stuff, so they're listed in the "OTHER
93 FUNCTIONS" section below.
94
95 All of these functions are written in C for speed.
96
97 resample(ARRAYREF)
98 Returns a reference to an array containing N random elements from the
99 input array, where N is the length of the original array.
100
101 This is different from shuffling in that it's random drawing without
102 removing the drawn elements from the sample.
103
104 resample_medians(ARRAYREF, NMEDIANS)
105 Returns a reference to an array containing the medians of "NMEDIANS"
106 resamples of the original input sample.
107
108 resample_means(ARRAYREF, NMEANS)
109 Returns a reference to an array containing the means of "NMEANS"
110 resamples of the original input sample.
111
112 simple_confidence_limits_from_median_samples(STATISTIC, STATISTIC_SAMPLES,
113 CONFIDENCE)
114 Calculates the confidence limits for STATISTIC. Returns the lower
115 confidence limit, the statistic, and the upper confidence limit.
116 STATISTIC_SAMPLES is the output of, for example,
117 "resample_means(\@sample)". CONFIDENCE indicates the fraction of data
118 within the confidence limits (assuming a normal, symmetric distribution
119 of the statistic => simple confidence limits).
120
121 For example, to get the 90% confidence (~2 sigma) interval for the mean
122 of your sample, you can do the following:
123
124 my $sample = [<numbers>];
125 my $means = resample_means($sample, $n_resamples);
126 my ($lower_cl, $mean, $upper_cl)
127 = simple_confidence_limits_from_samples(mean($sample), $means, 0.90);
128
129 If you want to apply this logic to other statistics such as the first
130 or third quartile, you have to manually resample and lose the advantage
131 of doing it in C:
132
133 my $sample = [<numbers>];
134 my $quartiles = [];
135 foreach (1..1000) {
136 push @$quartiles, first_quartile(resample($sample));
137 }
138 my ($lower_cl, $firstq, $upper_cl)
139 = simple_confidence_limits_from_samples(
140 first_quartile($sample), $quartiles, 0.90
141 );
142
143 For a reliable calculation of the confidence limits, you should use at
144 least 1000 samples if not more. Yes. This whole procedure is expensive.
145
146 For medians, however, use the following function:
147
148 median_simple_confidence_limits(SAMPLE, CONFIDENCE, [NSAMPLES])
149 Calculates the confidence limits for the "CONFIDENCE" level for the
150 median of SAMPLE. Returns the lower confidence limit, the median, and
151 the upper confidence limit.
152
153 In order to calculate the limits, a lot of resampling has to be done.
154 NSAMPLES defaults to 1000. Try running this a couple of times on your
155 data interactively to see how the limits still vary a little bit at
156 this setting.
157
159 approx_erf($x)
160 Calculates an approximatation of the error function of x. Implemented
161 after
162
163 Winitzki, Sergei (6 February 2008).
164 "A handy approximation for the error function and its inverse" (PDF).
165 http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf
166
167 Quoting: Relative precision better than "1.3e-4".
168
169 approx_erf_inv($x)
170 Calculates an approximation of the inverse of the error function of x.
171
172 Algorithm from the same source as "approx_erf".
173
174 Quoting: Relative precision better than "2e-3".
175
176 nsigma_to_alpha($nsigma)
177 Calculates the probability that a measurement from a normal
178 distribution is further away from the mean than $nsigma standard
179 deviations.
180
181 The confidence level (what you pass as the "CONFIDENCE" parameter to
182 some functions in this module) is "1 - nsigma_to_alpha($nsigma)".
183
184 alpha_to_nsigma($alpha)
185 Inverse of "nsigma_to_alpha()".
186
187 mean(ARRAYREF)
188 Calculates the mean of a sample.
189
190 median(ARRAYREF)
191 Calculates the median (second quartile) of a sample. Works in linear
192 time thanks to using a selection instead of a sort.
193
194 Unfortunately, the way this is implemented, the median of an even
195 number of parameters is, here, defined as the "n/2-1"th largest number
196 and not the average of the "n/2-1"th and the "n/2"th number. This
197 shouldn't matter for nontrivial sample sizes!
198
199 median_absolute_deviation(ARRAYREF)
200 Calculates the median absolute deviation (MAD) in what I believe is
201 O(n). Take care to rescale the MAD before using it in place of a
202 standard deviation.
203
204 first_quartile(ARRAYREF)
205 Calculates the first quartile of the sample.
206
207 third_quartile(ARRAYREF)
208 Calculates the third quartile of the sample.
209
210 select_kth(ARRAYREF, K)
211 Selects the kth smallest element from the sample.
212
213 This is the general function that implements the median/quartile
214 calculation. You get the median with this definition of K:
215
216 my $k = int(@$sample/2) + (@$sample & 1);
217 my $median = select_kth($sample, $k);
218
219 sample_standard_deviation
220 Given the sample mean and an anonymous array of numbers (the sample),
221 calculates the sample standard deviation.
222
223 population_standard_deviation
224 Same as sample_standard_deviation, but without the correction to "N".
225
227 One could calculate more statistics in C for performance.
228
230 Math::Random::MT
231
232 On the approximation of the error function:
233
234 Winitzki, Sergei (6 February 2008).
235 "A handy approximation for the error function and its inverse" (PDF).
236 http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf
237
238 The ~O(n) median implementation is based on C.A.R. Hoare's quickselect
239 algorithm. See
240 <http://en.wikipedia.org/wiki/Selection_algorithm#Partition-based_general_selection_algorithm>.
241 Right now, it does not implement the Median of Medians algorithm that
242 would guarantee linearity.
243
245 Steffen Mueller, <smueller@cpan.org>
246
247 Daniel Dragan, <bulk88@hotmail.com>, who supplied MSVC compatibility
248 patches.
249
251 Copyright (C) 2010, 2011, 2012, 2013 by Steffen Mueller
252
253 This library is free software; you can redistribute it and/or modify it
254 under the same terms as Perl itself, either Perl version 5.8.0 or, at
255 your option, any later version of Perl 5 you may have available.
256
257
258
259perl v5.28.0 2013-05-14 Statistics::CaseResampling(3)