Fsdb::Filter::dbmapreduce(3pm)

1Fsdb::Filter::dbmapreduUcsee(r3)Contributed Perl DocumenFtsadtbi:o:nFilter::dbmapreduce(3)
2
3
4

NAME

6       dbmapreduce - reduce all input rows with the same key
7

SYNOPSIS

9           dbmapreduce [-dMS] [-k KeyField] [-f CodeFile] [-C Filtercode] [--] [ReduceCommand [ReduceArguments...]]
10

DESCRIPTION

12       Group input data by KeyField, then apply a function (the "reducer") to
13       each group.  The reduce function can be an external program given by
14       ReduceCommand and ReduceArguments, or an Perl subroutine given in
15       CodeFile or FilterCode.
16
17       If a "--" appears before reduce command, arguments after the -- passed
18       the the command.
19
20   Grouping (The Mapper)
21       By default the KeyField is the first field in the row.  Unlike Hadoop
22       streaming, the -k KeyField option can explicitly name where the key is
23       in any column of each input row.
24
25       By default, we sort the data to make sure data is grouped by key.  If
26       the input is already grouped, the "-S" option avoids this cost.
27
28   The Reducer
29       Reduce functions default to be shell commands.  However, with "-C", one
30       can use arbitrary Perl code
31
32       (see the "-C" option below for details).  the "-f" option is useful to
33       specify complex Perl code somewhere other than the command line.
34
35       Finally, as a special case, if there are no rows of input, the reducer
36       will be invoked once with the empty value (if it's an external reducer)
37       or with undef (if it's a subroutine).  It is expected to generate the
38       output header, and it may generate no data rows itself, or a null data
39       row of its choosing.
40
41   Output
42       For non-multi-key-aware reducers, we add the KeyField use for each
43       Reduce is in the output stream.  (If the reducer passes the key we
44       trust that it gives a correct value.)  We also insure that the output
45       field separator is the same as the input field separator.
46
47       Adding the key and adjusting the output field separator is not possible
48       for non-multi-key-aware reducers.
49
50   Comparison to Related Work
51       This program thus implements Google-style map/reduce, but executed
52       sequentially.
53
54       For input, these systems include a map function and apply it to input
55       data to generate the key.  We assume this key generation (the map
56       function) has occurred head of time.
57
58       We also allow the grouping key to be in any column.  Hadoop Streaming
59       requires it to be in the first column.
60
61       By default, the reducer gets exactly (and only) one key.  This
62       invariant is stronger than Google and Hadoop.  They both pass multiple
63       keys to the reducer, insuring that each key is grouped together.  With
64       the "-M" option, we also pass multiple multiple groups to the reducer.
65
66       Unlike those systems, with the "-S" option we do not require the groups
67       arrive in any particular order, just that they be grouped together.
68       (They guarantees they arrive in lexically sorted order).  However, with
69       "-S" we create lexical ordering.
70
71       With "--prepend-key" we insure that the KeyField is in the output
72       stream; other systems do not enforce this.
73
74   Assumptions and requirements
75       By default, data can be provided in arbitrary order and the program
76       consumes O(number of unique tags) memory, and O(size of data) disk
77       space.
78
79       With the "-S" option, data must arrive group by tags (not necessarily
80       sorted), and the program consumes O(number of tags) memory and no disk
81       space.  The program will check and abort if this precondition is not
82       met.
83
84       With two "-S"'s, program consumes O(1) memory, but doesn't verify that
85       the data-arrival precondition is met.
86
87       The field separators of the input and the output can now be different
88       (early versions of this tool prohibited such variation.)  With
89       "--copy-fs" we copy the input field separator to the output, but only
90       for non-multi-key-aware reducers.  (this used to be done
91       automatically).  Alternatively, one can specify the output field
92       separator with "--fieldseparator", in which case the output had better
93       generate that format.  An explicit "--fieldseparator" takes priority
94       over "--copy-fs".
95
96   Known bugs
97       As of 2013-09-21, we don't verify key order with options "-M -S".
98

OPTIONS

100       -k or --key KeyField
101           Specify which column is the key for grouping (default: the first
102           column).
103
104           Note that dbmapreduce can only operate on one column as the key.
105           To group on the combination of multiple columns, one must merge
106           them, perhaps with dbcolmerge.
107
108       -S or --pre-sorted
109           Assume data is already grouped by tag.  Provided twice, it removes
110           the validation of this assertion.
111
112       -M or --multiple-ok
113           Assume the ReduceCommand can handle multiple grouped keys, and the
114           ReduceCommand is responsible for outputting the with each output
115           row.  (By default, a separate ReduceCommand is run for each key,
116           and dbmapreduce adds the key to each output row.)
117
118       -K or --pass-current-key
119           Pass the current key as an argument to the external, non-map-aware
120           ReduceCommand.  This is only done optionally since some external
121           commands do not expect an extra argument.  (Internal, non-map-aware
122           Perl reducers are always given the current key as an argument.)
123
124       --prepend-key
125           Add the current key into the reducer output for non-multi-key-aware
126           reducers only.  Not done by default.
127
128       --copy-fs or --copy-fieldseparator
129           Change the field separator of a non-multi-key-aware reducers to
130           match the input's field separator.  Not done by default.
131
132       --parallelism=N or -j N
133           Allow up to N reducers to run in parallel.  Default is the number
134           of CPUs in the machine.
135
136       -F or --fs or --fieldseparator S
137           Specify the field (column) separator as "S".  See dbfilealter for
138           valid field separators.
139
140       -C FILTER-CODE or --filter-code=FILTER-CODE
141           Provide FILTER-CODE, Perl code that generates and returns a
142           Fsdb::Filter object that implements the reduce function.  The
143           provided code should be an anonymous sub that creates a Fsdb Filter
144           that implements the reduce object.
145
146           The reduce object will then be called with --input and --output
147           parameters that hook it into a the reduce with queues.
148
149           One sample fragment that works is just:
150
151               dbcolstats(qw(--nolog duration))
152
153           So this command:
154
155               cat DATA/stats.fsdb | \
156                   dbmapreduce -k experiment -C 'dbcolstats(qw(--nolog duration))'
157
158           is the same as the example
159
160               cat DATA/stats.fsdb | \
161                   dbmapreduce -k experiment -- dbcolstats duration
162
163           except that with "-C" there is no forking and so things run faster.
164
165           If "dbmapreduce" is invoked from within Perl, then one can use a
166           code SUB as well:
167               dbmapreduce(-k => 'experiment',      -C => sub {
168           dbcolstats(qw(--nolong duration)) });
169
170           The reduce object must consume all input as a Fsdb stream, and
171           close the output Fsdb stream.  (If this assumption is not met the
172           map/reduce will be aborted.)
173
174           For non-map-reduce-aware filters, when the filter-generator code
175           runs, $_[0] will be the current key.
176
177       -f CODE-FILE or --code-file=CODE-FILE
178           Includes CODE-FILE in the program.  This option is useful for more
179           complicated perl reducer functions.
180
181           Thus, if reducer.pl has the code.
182
183               sub make_reducer {
184                   my($current_key) = @_;
185                   dbcolstats(qw(--nolog duration));
186               }
187
188           Then the command
189
190               cat DATA/stats.fsdb | \
191                   dbmapreduce -k experiment -f reducer.pl -C make_reducer
192
193           does the same thing as the example.
194
195       -w or --warnings
196           Enable warnings in user supplied code.  Warnings are issued if an
197           external reducer fails to consume all input.  (Default to include
198           warnings.)
199
200       -T TmpDir
201           where to put tmp files.  Also uses environment variable TMPDIR, if
202           -T is not specified.  Default is /tmp.
203
204       This module also supports the standard fsdb options:
205
206       -d  Enable debugging output.
207
208       -i or --input InputSource
209           Read from InputSource, typically a file name, or "-" for standard
210           input, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue
211           objects.
212
213       -o or --output OutputDestination
214           Write to OutputDestination, typically a file name, or "-" for
215           standard output, or (if in Perl) a IO::Handle, Fsdb::IO or
216           Fsdb::BoundedQueue objects.
217
218       --autorun or --noautorun
219           By default, programs process automatically, but Fsdb::Filter
220           objects in Perl do not run until you invoke the run() method.  The
221           "--(no)autorun" option controls that behavior within Perl.
222
223       --header H
224           Use H as the full Fsdb header, rather than reading a header from
225           then input.
226
227       --help
228           Show help.
229
230       --man
231           Show full manual.
232

SAMPLE USAGE

234   Input:
235           #fsdb experiment duration
236           ufs_mab_sys 37.2
237           ufs_mab_sys 37.3
238           ufs_rcp_real 264.5
239           ufs_rcp_real 277.9
240
241   Command:
242           cat DATA/stats.fsdb | \
243               dbmapreduce --prepend-key -k experiment -- dbcolstats duration
244
245   Output:
246           #fsdb      experiment      mean    stddev  pct_rsd conf_range      conf_low       conf_high        conf_pct        sum     sum_squared     min     max     n
247           ufs_mab_sys     37.25 0.070711 0.18983 0.6353 36.615 37.885 0.95 74.5 2775.1 37.2 37.3 2
248           ufs_rcp_real    271.2 9.4752 3.4938 85.13 186.07 356.33 0.95 542.4 1.4719e+05 264.5 277.9 2
249           #  | dbmapreduce -k experiment dbstats duration
250

CLASS FUNCTIONS

255       OLD TEXT: A few notes about the internal structure: dbmapreduce uses
256       two to four threads (actually Freds) to run.  An optional thread
257       "$self-"{_in_fred}> sorts the input.  The main process reads input and
258       groups input by key.  Each group is passed to a secondary fred
259       "$self-"{_reducer_thread}> that invokes the reducer on each group and
260       does any output.  If the reducer is not map-aware, then we create a
261       final postprocessor thread that adds the key back to the output.
262       Either the reducer or the postprocessor thread do output.
263
264       NEW VERSION with Freds:
265
266       A few notes about parallelism, since we have fairly different structure
267       depending on what we're doing:
268
269       1. for multi-key aware reducers, there is no output post-processing.
270
271       1a. if input is sorted and there is no input checking (-S -S), we run
272       the reducer in our own process.
273       (TEST/dbmapreduce_multiple_aware_sub.cmd)
274
275       1b. with grouped input and input checking (-S), we fork off an input
276       process that checks grouping, then run the reducer in our process.
277       (TEST/dbmapreduce_multiple_aware_sub_checked.cmd) xxx: case 1b not yet
278       done
279
280       1c. with ungrouped input, we invoke an input process to do sorting,
281       then run the reducer in our process.
282       (TEST/dbmapreduce_multiple_aware_sub_ungrouped.cmd)
283
284       2. for non-multi-key aware.  A sorter thread groups content, if
285       necessary.  We breaks stuff into groups and feeds them to a reducer
286       Fred, one per group.  A dedicated additional Fred merges output and
287       addes the missing key, if necessary.  Either way, output ends up in a
288       file.  A finally postprocessor thread merges all the output files.
289
290   new
291           $filter = new Fsdb::Filter::dbmapreduce(@arguments);
292
293       Create a new dbmapreduce object, taking command-line arguments.
294
295   set_defaults
296           $filter->set_defaults();
297
298       Internal: set up defaults.
299
300   parse_options
301           $filter->parse_options(@ARGV);
302
303       Internal: parse command-line arguments.
304
305   setup
306           $filter->setup();
307
308       Internal: setup, parse headers.
309
310   _setup_reducer
311           _setup_reducer
312
313       (internal) One Fred runs the reducer and produces output.
314       "_reducer_queue" is sends the new key, then a Fsdb stream, then EOF
315       (undef) for each group.  We setup the output, suppress all but the
316       first header, and add in the keys if necessary.
317
318   _key_to_string
319           $self->_key_to_string($key)
320
321       Convert a key (maybe undef) to a string for status messages.
322
323   _open_new_key
324           _open_new_key
325
326       (internal)
327
328       Note that new_key can be undef if there was no input.
329
330   _close_old_key
331           _close_old_key
332
333       Internal, finish a key.
334
335   _check_finished_reducers
336           $self->_check_finished_reducers($force);
337
338       Internal: see if any reducer freds finished, optionally $FORCE-ing all
339       to finish.
340
341       This routine also enforces a maximum amount of parallelism, blocking us
342       when we have too many reducers running.
343
344   _mapper_run
345           $filter->_mapper_run();
346
347       Internal: run over each rows, grouping them.  Fork off reducer as
348       necessary.
349
350   run
351           $filter->run();
352
353       Internal: run over each rows.
354
355   finish
356           $filter->finish();
357
358       Internal: write trailer.
359

AUTHOR and COPYRIGHT

361       Copyright (C) 1991-2018 by John Heidemann <johnh@isi.edu>
362
363       This program is distributed under terms of the GNU general public
364       license, version 2.  See the file COPYING with the distribution for
365       details.
366
367
368
369perl v5.36.0                      2022-11-22      Fsdb::Filter::dbmapreduce(3)