Fsdb::Filter::dbmerge(3pm)

1Fsdb::Filter::dbmerge(3U)ser Contributed Perl DocumentatiFosndb::Filter::dbmerge(3)
2
3
4

NAME

6       dbmerge - merge all inputs in sorted order based on the the specified
7       columns
8

SYNOPSIS

10           dbmerge --input A.fsdb --input B.fsdb [-T TemporaryDirectory] [-nNrR] column [column...]
11
12       or
13           cat A.fsdb | dbmerge --input - --input B.fsdb [-T
14       TemporaryDirectory] [-nNrR] column [column...]
15
16       or
17           dbmerge [-T TemporaryDirectory] [-nNrR] column [column...] --inputs
18       A.fsdb [B.fsdb ...]
19
20       or
21           { echo "A.fsdb"; echo "B.fsdb" } | dbmerge --xargs column
22       [column...]
23

DESCRIPTION

25       Merge all provided, pre-sorted input files, producing one sorted
26       result.  Inputs can both be specified with "--input", or one can come
27       from standard input and the other from "--input".  With "--xargs", each
28       line of standard input is a filename for input.
29
30       Inputs must have identical schemas (columns, column order, and field
31       separators).
32
33       Unlike dbmerge2, dbmerge supports an arbitrary number of input files.
34
35       Because this program is intended to merge multiple sources, it does not
36       default to reading from standard input.  If you wish to list - as an
37       explicit input source.
38
39       Also, because we deal with multiple input files, this module doesn't
40       output anything until it's run.
41
42       dbmerge consumes a fixed amount of memory regardless of input size.  It
43       therefore buffers output on disk as necessary.  (Merging is implemented
44       a series of two-way merges, so disk space is O(number of records).)
45
46       dbmerge will merge data in parallel, if possible.  The "--parallelism"
47       option can control the degree of parallelism, if desired.
48

OPTIONS

50       General option:
51
52       --xargs
53           Expect that input filenames are given, one-per-line, on standard
54           input.  (In this case, merging can start incrementally.)
55
56       --removeinputs
57           Delete the source files after they have been consumed.  (Defaults
58           off, leaving the inputs in place.)
59
60       -T TmpDir
61           where to put tmp files.  Also uses environment variable TMPDIR, if
62           -T is not specified.  Default is /tmp.
63
64       --parallelism N or -j N
65           Allow up to N merges to happen in parallel.  Default is the number
66           of CPUs in the machine.
67
68       --endgame (or --noendgame)
69           Enable endgame mode, extra parallelism when finishing up.  (On by
70           default.)
71
72       Sort specification options (can be interspersed with column names):
73
74       -r or --descending
75           sort in reverse order (high to low)
76
77       -R or --ascending
78           sort in normal order (low to high)
79
80       -n or --numeric
81           sort numerically
82
83       -N or --lexical
84           sort lexicographically
85
86       This module also supports the standard fsdb options:
87
88       -d  Enable debugging output.
89
90       -i or --input InputSource
91           Read from InputSource, typically a file name, or "-" for standard
92           input, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue
93           objects.
94
95       -o or --output OutputDestination
96           Write to OutputDestination, typically a file name, or "-" for
97           standard output, or (if in Perl) a IO::Handle, Fsdb::IO or
98           Fsdb::BoundedQueue objects.
99
100       --autorun or --noautorun
101           By default, programs process automatically, but Fsdb::Filter
102           objects in Perl do not run until you invoke the run() method.  The
103           "--(no)autorun" option controls that behavior within Perl.
104
105       --header H
106           Use H as the full Fsdb header, rather than reading a header from
107           then input.
108
109       --help
110           Show help.
111
112       --man
113           Show full manual.
114

SAMPLE USAGE

116   Input:
117       File a.fsdb:
118
119           #fsdb cid cname
120           11 numanal
121           10 pascal
122
123       File b.fsdb:
124
125           #fsdb cid cname
126           12 os
127           13 statistics
128
129       These two files are both sorted by "cname", and they have identical
130       schemas.
131
132   Command:
133           dbmerge --input a.fsdb --input b.fsdb cname
134
135       or
136
137           cat a.fsdb | dbmerge --input b.fsdb cname
138
139   Output:
140           #fsdb      cid     cname
141           11 numanal
142           12 os
143           10 pascal
144           13 statistics
145           #  | dbmerge --input a.fsdb --input b.fsdb cname
146

CLASS FUNCTIONS

151   new
152           $filter = new Fsdb::Filter::dbmerge(@arguments);
153
154       Create a new object, taking command-line arguments.
155
156   set_defaults
157           $filter->set_defaults();
158
159       Internal: set up defaults.
160
161   parse_options
162           $filter->parse_options(@ARGV);
163
164       Internal: parse command-line arguments.
165
166   _pretty_fn
167           _pretty_fn($fn)
168
169       Internal: pretty-print a filename or Fsdb::BoundedQueue.
170
171   segment_next_output
172           $out = $self->segment_next_output($output_type)
173
174       Internal: return a Fsdb::IO::Writer as $OUT that either points to our
175       output or a temporary file, depending on how things are going.
176
177       The $OUTPUT_TYPE can be 'final' or 'ipc' or 'file'.
178
179   segment_cleanup
180           $out = $self->segment_cleanup($file);
181
182       Internal: Clean up a file, if necessary.  (Sigh, used to be function
183       pointers, but not clear how they would interact with threads.)
184
185   _unique_id
186           $id = $self->_unique_id()
187
188       Generate a sequence number for debugging.
189
190   segments_merge2_run
191           $out = $self->segments_merge2_run($out_fn, $is_final_output,
192                               $in0, $in1);
193
194       Internal: do the actual merge2 work (maybe our parent put us in a
195       thread, maybe not).
196
197   enqueue_work
198           $self->enqueue_work($depth, $work);
199
200       Internal: put $WORK on the queue at $DEPTH, updating the max count.
201
202   segments_merge_one_depth
203           $self->segments_merge_one_depth($depth);
204
205       Merge queued files, if any.
206
207       Also release any queued threads.
208
209   segments_xargs
210           $self->segments_xargs();
211
212       Internal: read new filenames to process (from stdin) and send them to
213       the work queue.
214
215       Making a separate Fred to handle xargs is a lot of work, but it
216       guarantees it comes in on an IO::Handle that is selectable.
217
218   segments_merge_all
219           $self->segments_merge_all()
220
221       Internal: Merge queued files, if any.  Iterates over all depths of the
222       merge tree, and handles any forked threads.
223
224       Merging Strategy
225
226       Merging is done in a binary tree is managed through the "_work" queue.
227       It has an array of "depth" entries, one for each level of the tree.
228
229       Items are processed in order at each level of the tree, and only level-
230       by-level, so the sort is stable.
231
232       Parallelism Model
233
234       Parallelism is also managed through the "_work" queue, each element of
235       which consists of one file or stream suitable for merging.  The work
236       queue contains both ready output (files or BoundedQueue streams) that
237       can be immediately handled, and pairs of semaphore/pending output for
238       work that is not yet started.  All manipulation of the work queue
239       happens in the main thread (with "segments_merge_all" and
240       "segments_merge_one_depth").
241
242       We start a thread to handle each item in the work queue, and limit
243       parallelism to the "_max_parallelism", defaulting to the number of
244       available processors.
245
246       There two two kinds of parallelism, regular and endgame.  For regular
247       parallelism we pick two items off the work queue, merge them, and put
248       the result back on the queue as a new file.  Items in the work queue
249       may not be ready.  For in-progress items we wait until they are done.
250       For not-yet-started items we start them, then wait until they are done.
251
252       Endgame parallelism handles the final stages of a large merge.  When
253       there are enough processors that we can start a merge jobs for all
254       remaining levels of the merge tree.  At this point we switch from
255       merging to files to merging into "Fsdb::BoundedQueue" pipelines that
256       connect merge processes which start and run concurrently.
257
258       The final merge is done in the main thread so that that the main thread
259       can handle the output stream and recording the merge action.
260
261   setup
262           $filter->setup();
263
264       Internal: setup, parse headers.
265
266   run
267           $filter->run();
268
269       Internal: run over each rows.
270

AUTHOR and COPYRIGHT

272       Copyright (C) 1991-2019 by John Heidemann <johnh@isi.edu>
273
274       This program is distributed under terms of the GNU general public
275       license, version 2.  See the file COPYING with the distribution for
276       details.
277
278
279
280perl v5.30.1                      2020-01-30          Fsdb::Filter::dbmerge(3)