Fsdb::Filter::dbmerge(3pm)

1Fsdb::Filter::dbmerge(3U)ser Contributed Perl DocumentatiFosndb::Filter::dbmerge(3)
2
3
4

NAME

6       dbmerge - merge all inputs in sorted order based on the the specified
7       columns
8

SYNOPSIS

10           dbmerge --input A.fsdb --input B.fsdb [-T TemporaryDirectory] [-nNrR] column [column...]
11
12       or
13           cat A.fsdb | dbmerge --input - --input B.fsdb [-T
14       TemporaryDirectory] [-nNrR] column [column...]
15
16       or
17           dbmerge [-T TemporaryDirectory] [-nNrR] column [column...] --inputs
18       A.fsdb [B.fsdb ...]
19
20       or
21           { echo "A.fsdb"; echo "B.fsdb" } | dbmerge --xargs column
22       [column...]
23

DESCRIPTION

25       Merge all provided, pre-sorted input files, producing one sorted
26       result.  Inputs can both be specified with "--input", or one can come
27       from standard input and the other from "--input".  With "--xargs", each
28       line of standard input is a filename for input.
29
30       Inputs must have identical schemas (columns, column order, and field
31       separators).
32
33       Unlike dbmerge2, dbmerge supports an arbitrary number of input files.
34
35       Because this program is intended to merge multiple sources, it does not
36       default to reading from standard input.  If you wish to list - as an
37       explicit input source.
38
39       Also, because we deal with multiple input files, this module doesn't
40       output anything until it's run.
41
42       dbmerge consumes a fixed amount of memory regardless of input size.  It
43       therefore buffers output on disk as necessary.  (Merging is implemented
44       a series of two-way merges, so disk space is O(number of records).)
45
46       dbmerge will merge data in parallel, if possible.  The "--parallelism"
47       option can control the degree of parallelism, if desired.
48

OPTIONS

50       General option:
51
52       --xargs
53           Expect that input filenames are given, one-per-line, on standard
54           input.  (In this case, merging can start incrementally.)
55
56       --removeinputs
57           Delete the source files after they have been consumed.  (Defaults
58           off, leaving the inputs in place.)
59
60       -T TmpDir
61           where to put tmp files.  Also uses environment variable TMPDIR, if
62           -T is not specified.  Default is /tmp.
63
64       --parallelism N or -j N
65           Allow up to N merges to happen in parallel.  Default is the number
66           of CPUs in the machine.
67
68       --endgame (or --noendgame)
69           Enable endgame mode, extra parallelism when finishing up.  (On by
70           default.)
71
72       Sort specification options (can be interspersed with column names):
73
74       -r or --descending
75           sort in reverse order (high to low)
76
77       -R or --ascending
78           sort in normal order (low to high)
79
80       -n or --numeric
81           sort numerically
82
83       -N or --lexical
84           sort lexicographically
85
86       This module also supports the standard fsdb options:
87
88       -d  Enable debugging output.
89
90       -i or --input InputSource
91           Read from InputSource, typically a file name, or "-" for standard
92           input, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue
93           objects.
94
95       -o or --output OutputDestination
96           Write to OutputDestination, typically a file name, or "-" for
97           standard output, or (if in Perl) a IO::Handle, Fsdb::IO or
98           Fsdb::BoundedQueue objects.
99
100       --autorun or --noautorun
101           By default, programs process automatically, but Fsdb::Filter
102           objects in Perl do not run until you invoke the run() method.  The
103           "--(no)autorun" option controls that behavior within Perl.
104
105       --header H
106           Use H as the full Fsdb header, rather than reading a header from
107           then input.
108
109       --help
110           Show help.
111
112       --man
113           Show full manual.
114

SAMPLE USAGE

116   Input:
117       File a.fsdb:
118
119           #fsdb cid cname
120           11 numanal
121           10 pascal
122
123       File b.fsdb:
124
125           #fsdb cid cname
126           12 os
127           13 statistics
128
129       These two files are both sorted by "cname", and they have identical
130       schemas.
131
132   Command:
133           dbmerge --input a.fsdb --input b.fsdb cname
134
135       or
136
137           cat a.fsdb | dbmerge --input b.fsdb cname
138
139   Output:
140           #fsdb      cid     cname
141           11 numanal
142           12 os
143           10 pascal
144           13 statistics
145           #  | dbmerge --input a.fsdb --input b.fsdb cname
146

CLASS FUNCTIONS

151   new
152           $filter = new Fsdb::Filter::dbmerge(@arguments);
153
154       Create a new object, taking command-line arguments.
155
156   set_defaults
157           $filter->set_defaults();
158
159       Internal: set up defaults.
160
161   parse_options
162           $filter->parse_options(@ARGV);
163
164       Internal: parse command-line arguments.
165
166   _pretty_fn
167           _pretty_fn($fn)
168
169       Internal: pretty-print a filename or Fsdb::BoundedQueue.
170
171   segment_next_output
172           $out = $self->segment_next_output($output_type)
173
174       Internal: return a Fsdb::IO::Writer as $OUT that either points to our
175       output or a temporary file, depending on how things are going.
176
177       The $OUTPUT_TYPE can be 'final' or 'ipc' or 'file'.
178
179   segment_cleanup
180           $out = $self->segment_cleanup($file);
181
182       Internal: Clean up a file, if necessary.  (Sigh, used to be function
183       pointers, but not clear how they would interact with threads.)
184
185   _unique_id
186           $id = $self->_unique_id()
187
188       Generate a sequence number for debugging.
189
190   segments_merge2_run
191           $out = $self->segments_merge2_run($out_fn, $is_final_output,
192                               $in0, $in1, $id);
193
194       Internal: do the actual merge2 work (maybe our parent put us in a
195       thread, maybe not).
196
197   segments_merge1_run
198           $out = $self->segments_merge1_run($out_fn, $in0);
199
200       Internal: a special case of merge1 when we have only one file.
201
202   enqueue_work
203           $self->enqueue_work($depth, $work);
204
205       Internal: put $WORK on the queue at $DEPTH, updating the max count.
206
207   segments_merge_one_depth
208           $self->segments_merge_one_depth($depth);
209
210       Merge queued files, if any.
211
212       Also release any queued threads.
213
214   segments_xargs
215           $self->segments_xargs();
216
217       Internal: read new filenames to process (from stdin) and send them to
218       the work queue.
219
220       Making a separate Fred to handle xargs is a lot of work, but it
221       guarantees it comes in on an IO::Handle that is selectable.
222
223   segments_merge_all
224           $self->segments_merge_all()
225
226       Internal: Merge queued files, if any.  Iterates over all depths of the
227       merge tree, and handles any forked threads.
228
229       Merging Strategy
230
231       Merging is done in a binary tree is managed through the "_work" queue.
232       It has an array of "depth" entries, one for each level of the tree.
233
234       Items are processed in order at each level of the tree, and only level-
235       by-level, so the sort is stable.
236
237       Parallelism Model
238
239       Parallelism is also managed through the "_work" queue, each element of
240       which consists of one file or stream suitable for merging.  The work
241       queue contains both ready output (files or BoundedQueue streams) that
242       can be immediately handled, and pairs of semaphore/pending output for
243       work that is not yet started.  All manipulation of the work queue
244       happens in the main thread (with "segments_merge_all" and
245       "segments_merge_one_depth").
246
247       We start a thread to handle each item in the work queue, and limit
248       parallelism to the "_max_parallelism", defaulting to the number of
249       available processors.
250
251       There two two kinds of parallelism, regular and endgame.  For regular
252       parallelism we pick two items off the work queue, merge them, and put
253       the result back on the queue as a new file.  Items in the work queue
254       may not be ready.  For in-progress items we wait until they are done.
255       For not-yet-started items we start them, then wait until they are done.
256
257       Endgame parallelism handles the final stages of a large merge.  When
258       there are enough processors that we can start a merge jobs for all
259       remaining levels of the merge tree.  At this point we switch from
260       merging to files to merging into "Fsdb::BoundedQueue" pipelines that
261       connect merge processes which start and run concurrently.
262
263       The final merge is done in the main thread so that that the main thread
264       can handle the output stream and recording the merge action.
265
266   setup
267           $filter->setup();
268
269       Internal: setup, parse headers.
270
271   run
272           $filter->run();
273
274       Internal: run over each rows.
275

AUTHOR and COPYRIGHT

277       Copyright (C) 1991-2020 by John Heidemann <johnh@isi.edu>
278
279       This program is distributed under terms of the GNU general public
280       license, version 2.  See the file COPYING with the distribution for
281       details.
282
283
284
285perl v5.34.1                      2022-04-04          Fsdb::Filter::dbmerge(3)