1DBMAPREDUCE(1)        User Contributed Perl Documentation       DBMAPREDUCE(1)
2
3
4

NAME

6       dbmapreduce - reduce all input rows with the same key
7

SYNOPSIS

9           dbmapreduce [-dMS] [-k KeyField] [-f CodeFile] [-C Filtercode] [--] [ReduceCommand [ReduceArguments...]]
10

DESCRIPTION

12       Group input data by KeyField, then apply a function (the "reducer") to
13       each group.  The reduce function can be an external program given by
14       ReduceCommand and ReduceArguments, or an Perl subroutine given in
15       CodeFile or FilterCode.
16
17       If a "--" appears before reduce command, arguments after the -- passed
18       the the command.
19
20   Grouping (The Mapper)
21       By default the KeyField is the first field in the row.  Unlike Hadoop
22       streaming, the -k KeyField option can explicitly name where the key is
23       in any column of each input row.
24
25       By default, we sort the data to make sure data is grouped by key.  If
26       the input is already grouped, the "-S" option avoids this cost.
27
28   The Reducer
29       Reduce functions default to be shell commands.  However, with "-C", one
30       can use arbitrary Perl code
31
32       (see the "-C" option below for details).  the "-f" option is useful to
33       specify complex Perl code somewhere other than the command line.
34
35       Finally, as a special case, if there are no rows of input, the reducer
36       will be invoked once with the empty value (if it's an external reducer)
37       or with undef (if it's a subroutine).  It is expected to generate the
38       output header, and it may generate no data rows itself, or a null data
39       row of its choosing.
40
41   Output
42       For non-multi-key-aware reducers, we add the KeyField use for each
43       Reduce is in the output stream.  (If the reducer passes the key we
44       trust that it gives a correct value.)  We also insure that the output
45       field separator is the same as the input field separator.
46
47       Adding the key and adjusting the output field separator is not possible
48       for non-multi-key-aware reducers.
49
50   Comparison to Related Work
51       This program thus implements Google-style map/reduce, but executed
52       sequentially.
53
54       For input, these systems include a map function and apply it to input
55       data to generate the key.  We assume this key generation (the map
56       function) has occurred head of time.
57
58       We also allow the grouping key to be in any column.  Hadoop Streaming
59       requires it to be in the first column.
60
61       By default, the reducer gets exactly (and only) one key.  This
62       invariant is stronger than Google and Hadoop.  They both pass multiple
63       keys to the reducer, insuring that each key is grouped together.  With
64       the "-M" option, we also pass multiple multiple groups to the reducer.
65
66       Unlike those systems, with the "-S" option we do not require the groups
67       arrive in any particular order, just that they be grouped together.
68       (They guarantees they arrive in lexically sorted order).  However, with
69       "-S" we create lexical ordering.
70
71       With "--prepend-key" we insure that the KeyField is in the output
72       stream; other systems do not enforce this.
73
74   Assumptions and requirements
75       By default, data can be provided in arbitrary order and the program
76       consumes O(number of unique tags) memory, and O(size of data) disk
77       space.
78
79       With the "-S" option, data must arrive group by tags (not necessarily
80       sorted), and the program consumes O(number of tags) memory and no disk
81       space.  The program will check and abort if this precondition is not
82       met.
83
84       With two "-S"'s, program consumes O(1) memory, but doesn't verify that
85       the data-arrival precondition is met.
86
87       The field separators of the input and the output can now be different
88       (early versions of this tool prohibited such variation.)  With
89       "--copy-fs" we copy the input field separator to the output, but only
90       for non-multi-key-aware reducers.  (this used to be done
91       automatically).  Alternatively, one can specify the output field
92       separator with "--fieldseparator", in which case the output had better
93       generate that format.  An explicit "--fieldseparator" takes priority
94       over "--copy-fs".
95
96   Known bugs
97       As of 2013-09-21, we don't verify key order with options "-M -S".
98

OPTIONS

100       -k or --key KeyField
101           specify which column is the key for grouping (default: the first
102           column)
103
104       -S or --pre-sorted
105           Assume data is already grouped by tag.  Provided twice, it removes
106           the validation of this assertion.
107
108       -M or --multiple-ok
109           Assume the ReduceCommand can handle multiple grouped keys, and the
110           ReduceCommand is responsible for outputting the with each output
111           row.  (By default, a separate ReduceCommand is run for each key,
112           and dbmapreduce adds the key to each output row.)
113
114       -K or --pass-current-key
115           Pass the current key as an argument to the external, non-map-aware
116           ReduceCommand.  This is only done optionally since some external
117           commands do not expect an extra argument.  (Internal, non-map-aware
118           Perl reducers are always given the current key as an argument.)
119
120       --prepend-key
121           Add the current key into the reducer output for non-multi-key-aware
122           reducers only.  Not done by default.
123
124       --copy-fs or --copy-fieldseparator
125           Change the field separator of a non-multi-key-aware reducers to
126           match the input's field separator.  Not done by default.
127
128       --parallelism=N or -j N
129           Allow up to N reducers to run in parallel.  Default is the number
130           of CPUs in the machine.
131
132       -F or --fs or --fieldseparator S
133           Specify the field (column) separator as "S".  See dbfilealter for
134           valid field separators.
135
136       -C FILTER-CODE or --filter-code=FILTER-CODE
137           Provide FILTER-CODE, Perl code that generates and returns a
138           Fsdb::Filter object that implements the reduce function.  The
139           provided code should be an anonymous sub that creates a Fsdb Filter
140           that implements the reduce object.
141
142           The reduce object will then be called with --input and --output
143           parameters that hook it into a the reduce with queues.
144
145           One sample fragment that works is just:
146
147               dbcolstats(qw(--nolog duration))
148
149           So this command:
150
151               cat DATA/stats.fsdb | \
152                   dbmapreduce -k experiment -C 'dbcolstats(qw(--nolog duration))'
153
154           is the same as the example
155
156               cat DATA/stats.fsdb | \
157                   dbmapreduce -k experiment -- dbcolstats duration
158
159           except that with "-C" there is no forking and so things run faster.
160
161           If "dbmapreduce" is invoked from within Perl, then one can use a
162           code SUB as well:
163               dbmapreduce(-k => 'experiment',      -C => sub {
164           dbcolstats(qw(--nolong duration)) });
165
166           The reduce object must consume all input as a Fsdb stream, and
167           close the output Fsdb stream.  (If this assumption is not met the
168           map/reduce will be aborted.)
169
170           For non-map-reduce-aware filters, when the filter-generator code
171           runs, $_[0] will be the current key.
172
173       -f CODE-FILE or --code-file=CODE-FILE
174           Includes CODE-FILE in the program.  This option is useful for more
175           complicated perl reducer functions.
176
177           Thus, if reducer.pl has the code.
178
179               sub make_reducer {
180                   my($current_key) = @_;
181                   dbcolstats(qw(--nolog duration));
182               }
183
184           Then the command
185
186               cat DATA/stats.fsdb | \
187                   dbmapreduce -k experiment -f reducer.pl -C make_reducer
188
189           does the same thing as the example.
190
191       -w or --warnings
192           Enable warnings in user supplied code.  Warnings are issued if an
193           external reducer fails to consume all input.  (Default to include
194           warnings.)
195
196       -T TmpDir
197           where to put tmp files.  Also uses environment variable TMPDIR, if
198           -T is not specified.  Default is /tmp.
199
200       This module also supports the standard fsdb options:
201
202       -d  Enable debugging output.
203
204       -i or --input InputSource
205           Read from InputSource, typically a file name, or "-" for standard
206           input, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue
207           objects.
208
209       -o or --output OutputDestination
210           Write to OutputDestination, typically a file name, or "-" for
211           standard output, or (if in Perl) a IO::Handle, Fsdb::IO or
212           Fsdb::BoundedQueue objects.
213
214       --autorun or --noautorun
215           By default, programs process automatically, but Fsdb::Filter
216           objects in Perl do not run until you invoke the run() method.  The
217           "--(no)autorun" option controls that behavior within Perl.
218
219       --header H
220           Use H as the full Fsdb header, rather than reading a header from
221           then input.
222
223       --help
224           Show help.
225
226       --man
227           Show full manual.
228

SAMPLE USAGE

230   Input:
231           #fsdb experiment duration
232           ufs_mab_sys 37.2
233           ufs_mab_sys 37.3
234           ufs_rcp_real 264.5
235           ufs_rcp_real 277.9
236
237   Command:
238           cat DATA/stats.fsdb | \
239               dbmapreduce --prepend-key -k experiment -- dbcolstats duration
240
241   Output:
242           #fsdb      experiment      mean    stddev  pct_rsd conf_range      conf_low       conf_high        conf_pct        sum     sum_squared     min     max     n
243           ufs_mab_sys     37.25 0.070711 0.18983 0.6353 36.615 37.885 0.95 74.5 2775.1 37.2 37.3 2
244           ufs_rcp_real    271.2 9.4752 3.4938 85.13 186.07 356.33 0.95 542.4 1.4719e+05 264.5 277.9 2
245           #  | dbmapreduce -k experiment dbstats duration
246

SEE ALSO

248       Fsdb.  dbmultistats dbrowsplituniq
249
251       Copyright (C) 1991-2018 by John Heidemann <johnh@isi.edu>
252
253       This program is distributed under terms of the GNU general public
254       license, version 2.  See the file COPYING with the distribution for
255       details.
256
257
258
259perl v5.30.1                      2020-01-30                    DBMAPREDUCE(1)
Impressum