1Fsdb::Filter::dbmapreduUcsee(r3)Contributed Perl DocumenFtsadtbi:o:nFilter::dbmapreduce(3)
2
3
4
6 dbmapreduce - reduce all input rows with the same key
7
9 dbmapreduce [-dMS] [-k KeyField] [-f CodeFile] [-C Filtercode] [--] [ReduceCommand [ReduceArguments...]]
10
12 Group input data by KeyField, then apply a function (the "reducer") to
13 each group. The reduce function can be an external program given by
14 ReduceCommand and ReduceArguments, or an Perl subroutine given in
15 CodeFile or FilterCode.
16
17 If a "--" appears before reduce command, arguments after the -- passed
18 the the command.
19
20 Grouping (The Mapper)
21 By default the KeyField is the first field in the row. Unlike Hadoop
22 streaming, the -k KeyField option can explicitly name where the key is
23 in any column of each input row.
24
25 By default, we sort the data to make sure data is grouped by key. If
26 the input is already grouped, the "-S" option avoids this cost.
27
28 The Reducer
29 Reduce functions default to be shell commands. However, with "-C", one
30 can use arbitrary Perl code
31
32 (see the "-C" option below for details). the "-f" option is useful to
33 specify complex Perl code somewhere other than the command line.
34
35 Finally, as a special case, if there are no rows of input, the reducer
36 will be invoked once with the empty value (if it's an external reducer)
37 or with undef (if it's a subroutine). It is expected to generate the
38 output header, and it may generate no data rows itself, or a null data
39 row of its choosing.
40
41 Output
42 For non-multi-key-aware reducers, we add the KeyField use for each
43 Reduce is in the output stream. (If the reducer passes the key we
44 trust that it gives a correct value.) We also insure that the output
45 field separator is the same as the input field separator.
46
47 Adding the key and adjusting the output field separator is not possible
48 for non-multi-key-aware reducers.
49
50 Comparison to Related Work
51 This program thus implements Google-style map/reduce, but executed
52 sequentially.
53
54 For input, these systems include a map function and apply it to input
55 data to generate the key. We assume this key generation (the map
56 function) has occurred head of time.
57
58 We also allow the grouping key to be in any column. Hadoop Streaming
59 requires it to be in the first column.
60
61 By default, the reducer gets exactly (and only) one key. This
62 invariant is stronger than Google and Hadoop. They both pass multiple
63 keys to the reducer, insuring that each key is grouped together. With
64 the "-M" option, we also pass multiple multiple groups to the reducer.
65
66 Unlike those systems, with the "-S" option we do not require the groups
67 arrive in any particular order, just that they be grouped together.
68 (They guarantees they arrive in lexically sorted order). However, with
69 "-S" we create lexical ordering.
70
71 With "--prepend-key" we insure that the KeyField is in the output
72 stream; other systems do not enforce this.
73
74 Assumptions and requirements
75 By default, data can be provided in arbitrary order and the program
76 consumes O(number of unique tags) memory, and O(size of data) disk
77 space.
78
79 With the "-S" option, data must arrive group by tags (not necessarily
80 sorted), and the program consumes O(number of tags) memory and no disk
81 space. The program will check and abort if this precondition is not
82 met.
83
84 With two "-S"'s, program consumes O(1) memory, but doesn't verify that
85 the data-arrival precondition is met.
86
87 The field separators of the input and the output can now be different
88 (early versions of this tool prohibited such variation.) With
89 "--copy-fs" we copy the input field separator to the output, but only
90 for non-multi-key-aware reducers. (this used to be done
91 automatically). Alternatively, one can specify the output field
92 separator with "--fieldseparator", in which case the output had better
93 generate that format. An explicit "--fieldseparator" takes priority
94 over "--copy-fs".
95
96 Known bugs
97 As of 2013-09-21, we don't verify key order with options "-M -S".
98
100 -k or --key KeyField
101 specify which column is the key for grouping (default: the first
102 column)
103
104 -S or --pre-sorted
105 Assume data is already grouped by tag. Provided twice, it removes
106 the validation of this assertion.
107
108 -M or --multiple-ok
109 Assume the ReduceCommand can handle multiple grouped keys, and the
110 ReduceCommand is responsible for outputting the with each output
111 row. (By default, a separate ReduceCommand is run for each key,
112 and dbmapreduce adds the key to each output row.)
113
114 -K or --pass-current-key
115 Pass the current key as an argument to the external, non-map-aware
116 ReduceCommand. This is only done optionally since some external
117 commands do not expect an extra argument. (Internal, non-map-aware
118 Perl reducers are always given the current key as an argument.)
119
120 --prepend-key
121 Add the current key into the reducer output for non-multi-key-aware
122 reducers only. Not done by default.
123
124 --copy-fs or --copy-fieldseparator
125 Change the field separator of a non-multi-key-aware reducers to
126 match the input's field separator. Not done by default.
127
128 --parallelism=N or -j N
129 Allow up to N reducers to run in parallel. Default is the number
130 of CPUs in the machine.
131
132 -F or --fs or --fieldseparator S
133 Specify the field (column) separator as "S". See dbfilealter for
134 valid field separators.
135
136 -C FILTER-CODE or --filter-code=FILTER-CODE
137 Provide FILTER-CODE, Perl code that generates and returns a
138 Fsdb::Filter object that implements the reduce function. The
139 provided code should be an anonymous sub that creates a Fsdb Filter
140 that implements the reduce object.
141
142 The reduce object will then be called with --input and --output
143 parameters that hook it into a the reduce with queues.
144
145 One sample fragment that works is just:
146
147 dbcolstats(qw(--nolog duration))
148
149 So this command:
150
151 cat DATA/stats.fsdb | \
152 dbmapreduce -k experiment -C 'dbcolstats(qw(--nolog duration))'
153
154 is the same as the example
155
156 cat DATA/stats.fsdb | \
157 dbmapreduce -k experiment -- dbcolstats duration
158
159 except that with "-C" there is no forking and so things run faster.
160
161 If "dbmapreduce" is invoked from within Perl, then one can use a
162 code SUB as well:
163 dbmapreduce(-k => 'experiment', -C => sub {
164 dbcolstats(qw(--nolong duration)) });
165
166 The reduce object must consume all input as a Fsdb stream, and
167 close the output Fsdb stream. (If this assumption is not met the
168 map/reduce will be aborted.)
169
170 For non-map-reduce-aware filters, when the filter-generator code
171 runs, $_[0] will be the current key.
172
173 -f CODE-FILE or --code-file=CODE-FILE
174 Includes CODE-FILE in the program. This option is useful for more
175 complicated perl reducer functions.
176
177 Thus, if reducer.pl has the code.
178
179 sub make_reducer {
180 my($current_key) = @_;
181 dbcolstats(qw(--nolog duration));
182 }
183
184 Then the command
185
186 cat DATA/stats.fsdb | \
187 dbmapreduce -k experiment -f reducer.pl -C make_reducer
188
189 does the same thing as the example.
190
191 -w or --warnings
192 Enable warnings in user supplied code. Warnings are issued if an
193 external reducer fails to consume all input. (Default to include
194 warnings.)
195
196 -T TmpDir
197 where to put tmp files. Also uses environment variable TMPDIR, if
198 -T is not specified. Default is /tmp.
199
200 This module also supports the standard fsdb options:
201
202 -d Enable debugging output.
203
204 -i or --input InputSource
205 Read from InputSource, typically a file name, or "-" for standard
206 input, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue
207 objects.
208
209 -o or --output OutputDestination
210 Write to OutputDestination, typically a file name, or "-" for
211 standard output, or (if in Perl) a IO::Handle, Fsdb::IO or
212 Fsdb::BoundedQueue objects.
213
214 --autorun or --noautorun
215 By default, programs process automatically, but Fsdb::Filter
216 objects in Perl do not run until you invoke the run() method. The
217 "--(no)autorun" option controls that behavior within Perl.
218
219 --header H
220 Use H as the full Fsdb header, rather than reading a header from
221 then input.
222
223 --help
224 Show help.
225
226 --man
227 Show full manual.
228
230 Input:
231 #fsdb experiment duration
232 ufs_mab_sys 37.2
233 ufs_mab_sys 37.3
234 ufs_rcp_real 264.5
235 ufs_rcp_real 277.9
236
237 Command:
238 cat DATA/stats.fsdb | \
239 dbmapreduce --prepend-key -k experiment -- dbcolstats duration
240
241 Output:
242 #fsdb experiment mean stddev pct_rsd conf_range conf_low conf_high conf_pct sum sum_squared min max n
243 ufs_mab_sys 37.25 0.070711 0.18983 0.6353 36.615 37.885 0.95 74.5 2775.1 37.2 37.3 2
244 ufs_rcp_real 271.2 9.4752 3.4938 85.13 186.07 356.33 0.95 542.4 1.4719e+05 264.5 277.9 2
245 # | dbmapreduce -k experiment dbstats duration
246
248 Fsdb. dbmultistats dbrowsplituniq
249
251 OLD TEXT: A few notes about the internal structure: dbmapreduce uses
252 two to four threads (actually Freds) to run. An optional thread
253 "$self-"{_in_fred}> sorts the input. The main process reads input and
254 groups input by key. Each group is passed to a secondary fred
255 "$self-"{_reducer_thread}> that invokes the reducer on each group and
256 does any output. If the reducer is not map-aware, then we create a
257 final postprocessor thread that adds the key back to the output.
258 Either the reducer or the postprocessor thread do output.
259
260 NEW VERSION with Freds:
261
262 A few notes about parallelism, since we have fairly different structure
263 depending on what we're doing:
264
265 1. for multi-key aware reducers, there is no output post-processing.
266
267 1a. if input is sorted and there is no input checking (-S -S), we run
268 the reducer in our own process.
269 (TEST/dbmapreduce_multiple_aware_sub.cmd)
270
271 1b. with grouped input and input checking (-S), we fork off an input
272 process that checks grouping, then run the reducer in our process.
273 (TEST/dbmapreduce_multiple_aware_sub_checked.cmd) xxx: case 1b not yet
274 done
275
276 1c. with ungrouped input, we invoke an input process to do sorting,
277 then run the reducer in our process.
278 (TEST/dbmapreduce_multiple_aware_sub_ungrouped.cmd)
279
280 2. for non-multi-key aware. A sorter thread groups content, if
281 necessary. We breaks stuff into groups and feeds them to a reducer
282 Fred, one per group. A dedicated additional Fred merges output and
283 addes the missing key, if necessary. Either way, output ends up in a
284 file. A finally postprocessor thread merges all the output files.
285
286 new
287 $filter = new Fsdb::Filter::dbmapreduce(@arguments);
288
289 Create a new dbmapreduce object, taking command-line arguments.
290
291 set_defaults
292 $filter->set_defaults();
293
294 Internal: set up defaults.
295
296 parse_options
297 $filter->parse_options(@ARGV);
298
299 Internal: parse command-line arguments.
300
301 setup
302 $filter->setup();
303
304 Internal: setup, parse headers.
305
306 _setup_reducer
307 _setup_reducer
308
309 (internal) One Fred that runs the reducer and produces output.
310 "_reducer_queue" is sends the new key, then a Fsdb stream, then EOF
311 (undef) for each group. We setup the output, suppress all but the
312 first header, and add in the keys if necessary.
313
314 _key_to_string
315 $self->_key_to_string($key)
316
317 Convert a key (maybe undef) to a string for status messages.
318
319 _open_new_key
320 _open_new_key
321
322 (internal)
323
324 Note that new_key can be undef if there was no input.
325
326 _close_old_key
327 _close_old_key
328
329 Internal, finish a key.
330
331 _check_finished_reducers
332 $self->_check_finished_reducers($force);
333
334 Internal: see if any reducer freds finished, optionally $FORCE-ing all
335 to finish.
336
337 This routine also enforces a maximum amount of parallelism, blocking us
338 when we have too many reducers running.
339
340 _mapper_run
341 $filter->_mapper_run();
342
343 Internal: run over each rows, grouping them. Fork off reducer as
344 necessary.
345
346 run
347 $filter->run();
348
349 Internal: run over each rows.
350
351 finish
352 $filter->finish();
353
354 Internal: write trailer.
355
357 Copyright (C) 1991-2018 by John Heidemann <johnh@isi.edu>
358
359 This program is distributed under terms of the GNU general public
360 license, version 2. See the file COPYING with the distribution for
361 details.
362
363
364
365perl v5.30.1 2020-01-30 Fsdb::Filter::dbmapreduce(3)