Fsdb::Filter::dbmerge(3pm)

1Fsdb::Filter::dbmerge(3U)ser Contributed Perl DocumentatiFosndb::Filter::dbmerge(3)
2
3
4

NAME

6       dbmerge - merge all inputs in sorted order based on the the specified
7       columns
8

SYNOPSIS

10           dbmerge --input A.fsdb --input B.fsdb [-T TemporaryDirectory] [-nNrR] column [column...]
11
12       or
13           cat A.fsdb | dbmerge --input - --input B.fsdb [-T
14       TemporaryDirectory] [-nNrR] column [column...]
15
16       or
17           dbmerge [-T TemporaryDirectory] [-nNrR] column [column...] --inputs
18       A.fsdb [B.fsdb ...]
19

DESCRIPTION

21       Merge all provided, pre-sorted input files, producing one sorted
22       result.  Inputs can both be specified with "--input", or one can come
23       from standard input and the other from "--input".  With "--xargs", each
24       line of standard input is a filename for input.
25
26       Inputs must have identical schemas (columns, column order, and field
27       separators).
28
29       Unlike dbmerge2, dbmerge supports an arbitrary number of input files.
30
31       Because this program is intended to merge multiple sources, it does not
32       default to reading from standard input.  If you wish to list - as an
33       explicit input source.
34
35       Also, because we deal with multiple input files, this module doesn't
36       output anything until it's run.
37
38       dbmerge consumes a fixed amount of memory regardless of input size.  It
39       therefore buffers output on disk as necessary.  (Merging is implemented
40       a series of two-way merges, so disk space is O(number of records).)
41
42       dbmerge will merge data in parallel, if possible.  The <--parallelism>
43       option can control the degree of parallelism, if desired.
44

OPTIONS

46       General option:
47
48       --xargs
49           Expect that input filenames are given, one-per-line, on standard
50           input.  (In this case, merging can start incrementally.
51
52       --removeinputs
53           Delete the source files after they have been consumed.  (Defaults
54           off, leaving the inputs in place.)
55
56       -T TmpDir
57           where to put tmp files.  Also uses environment variable TMPDIR, if
58           -T is not specified.  Default is /tmp.
59
60       --parallelism N or -j N
61           Allow up to N merges to happen in parallel.  Default is the number
62           of CPUs in the machine.
63
64       --endgame (or --noendgame)
65           Enable endgame mode, extra parallelism when finishing up.  (On by
66           default.)
67
68       Sort specification options (can be interspersed with column names):
69
70       -r or --descending
71           sort in reverse order (high to low)
72
73       -R or --ascending
74           sort in normal order (low to high)
75
76       -n or --numeric
77           sort numerically
78
79       -N or --lexical
80           sort lexicographically
81
82       This module also supports the standard fsdb options:
83
84       -d  Enable debugging output.
85
86       -i or --input InputSource
87           Read from InputSource, typically a file name, or "-" for standard
88           input, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue
89           objects.
90
91       -o or --output OutputDestination
92           Write to OutputDestination, typically a file name, or "-" for
93           standard output, or (if in Perl) a IO::Handle, Fsdb::IO or
94           Fsdb::BoundedQueue objects.
95
96       --autorun or --noautorun
97           By default, programs process automatically, but Fsdb::Filter
98           objects in Perl do not run until you invoke the run() method.  The
99           "--(no)autorun" option controls that behavior within Perl.
100
101       --help
102           Show help.
103
104       --man
105           Show full manual.
106

SAMPLE USAGE

108   Input:
109       File a.fsdb:
110
111           #fsdb cid cname
112           11 numanal
113           10 pascal
114
115       File b.fsdb:
116
117           #fsdb cid cname
118           12 os
119           13 statistics
120
121       These two files are both sorted by "cname", and they have identical
122       schemas.
123
124   Command:
125           dbmerge --input a.fsdb --input b.fsdb cname
126
127       or
128
129           cat a.fsdb | dbmerge --input b.fsdb cname
130
131   Output:
132           #fsdb      cid     cname
133           11 numanal
134           12 os
135           10 pascal
136           13 statistics
137           #  | dbmerge --input a.fsdb --input b.fsdb cname
138

CLASS FUNCTIONS

143   new
144           $filter = new Fsdb::Filter::dbmerge(@arguments);
145
146       Create a new object, taking command-line arguments.
147
148   set_defaults
149           $filter->set_defaults();
150
151       Internal: set up defaults.
152
153   parse_options
154           $filter->parse_options(@ARGV);
155
156       Internal: parse command-line arguments.
157
158   _pretty_fn
159           _pretty_fn($fn)
160
161       Internal: pretty-print a filename or Fsdb::BoundedQueue.
162
163   segment_next_output
164           $out = $self->segment_next_output($output_type)
165
166       Internal: return a Fsdb::IO::Writer as $OUT that either points to our
167       output or a temporary file, depending on how things are going.
168
169       The $OUTPUT_TYPE can be 'final' or 'ipc' or 'file'.
170
171   segment_cleanup
172           $out = $self->segment_cleanup($file);
173
174       Internal: Clean up a file, if necessary.  (Sigh, used to be function
175       pointers, but not clear how they would interact with threads.)
176
177   _unique_id
178           $id = $self->_unique_id()
179
180       Generate a sequence number for debugging.
181
182   segments_merge2_run
183           $out = $self->segments_merge2_run($out_fn, $is_final_output,
184                               $in0, $in1);
185
186       Internal: do the actual merge2 work (maybe our parent put us in a
187       thread, maybe not).
188
189   enqueue_work
190           $self->enqueue_work($depth, $work);
191
192       Internal: put $WORK on the queue at $DEPTH, updating the max count.
193
194   segments_merge_one_depth
195           $self->segments_merge_one_depth($depth);
196
197       Merge queued files, if any.
198
199       Also release any queued threads.
200
201   segments_xargs
202           $self->segments_xargs();
203
204       Internal: read new filenames to process (from stdin) and send them to
205       the work queue.
206
207       Making a separate Fred to handle xargs is a lot of work, but it
208       guarantees it comes in on an IO::Handle that is selectable.
209
210   segments_merge_all
211           $self->segments_merge_all()
212
213       Internal: Merge queued files, if any.  Iterates over all depths of the
214       merge tree, and handles any forked threads.
215
216       Merging Strategy
217
218       Merging is done in a binary tree is managed through the "_work" queue.
219       It has an array of "depth" entries, one for each level of the tree.
220
221       Items are processed in order at each level of the tree, and only level-
222       by-level, so the sort is stable.
223
224       Parallelism Model
225
226       Parallelism is also managed through the "_work" queue, each element of
227       which consists of one file or stream suitable for merging.  The work
228       queue contains both ready output (files or BoundedQueue streams) that
229       can be immediately handled, and pairs of semaphore/pending output for
230       work that is not yet started.  All manipulation of the work queue
231       happens in the main thread (with "segments_merge_all" and
232       "segments_merge_one_depth").
233
234       We start a thread to handle each item in the work queue, and limit
235       parallelism to the "_max_parallelism", defaulting to the number of
236       available processors.
237
238       There two two kinds of parallelism, regular and endgame.  For regular
239       parallelism we pick two items off the work queue, merge them, and put
240       the result back on the queue as a new file.  Items in the work queue
241       may not be ready.  For in-progress items we wait until they are done.
242       For not-yet-started items we start them, then wait until they are done.
243
244       Endgame parallelism handles the final stages of a large merge.  When
245       there are enough processors that we can start a merge jobs for all
246       remaining levels of the merge tree.  At this point we switch from
247       merging to files to merging into "Fsdb::BoundedQueue" pipelines that
248       connect merge processes which start and run concurrently.
249
250       The final merge is done in the main thread so that that the main thread
251       can handle the output stream and recording the merge action.
252
253   setup
254           $filter->setup();
255
256       Internal: setup, parse headers.
257
258   run
259           $filter->run();
260
261       Internal: run over each rows.
262

AUTHOR and COPYRIGHT

264       Copyright (C) 1991-2018 by John Heidemann <johnh@isi.edu>
265
266       This program is distributed under terms of the GNU general public
267       license, version 2.  See the file COPYING with the distribution for
268       details.
269
270
271
272perl v5.28.1                      2018-02-17          Fsdb::Filter::dbmerge(3)