Fsdb::Filter::dbmerge(3pm)

1Fsdb::Filter::dbmerge(3U)ser Contributed Perl DocumentatiFosndb::Filter::dbmerge(3)
2
3
4

NAME

6       dbmerge - merge all inputs in sorted order based on the the specified
7       columns
8

SYNOPSIS

10           dbmerge --input A.fsdb --input B.fsdb [-T TemporaryDirectory] [-nNrR] column [column...]
11
12       or
13           cat A.fsdb | dbmerge --input - --input B.fsdb [-T
14       TemporaryDirectory] [-nNrR] column [column...]
15
16       or
17           dbmerge [-T TemporaryDirectory] [-nNrR] column [column...] --inputs
18       A.fsdb [B.fsdb ...]
19
20       or
21           { echo "A.fsdb"; echo "B.fsdb" } | dbmerge --xargs column
22       [column...]
23

DESCRIPTION

25       Merge all provided, pre-sorted input files, producing one sorted
26       result.  Inputs can both be specified with "--input", or with
27       "--inputs", or one can come from standard input and the other from
28       "--input".  With "--xargs", each line of standard input is a filename
29       for input.
30
31       Inputs must have identical schemas (columns, column order, and field
32       separators).
33
34       Unlike dbmerge2, dbmerge supports an arbitrary number of input files.
35
36       Because this program is intended to merge multiple sources, it does not
37       default to reading from standard input.  If you wish to read standard
38       input, giv - as an explicit input source.
39
40       Also, because we deal with multiple input files, this module doesn't
41       output anything until it's run.
42
43       dbmerge consumes a fixed amount of memory regardless of input size.  It
44       therefore buffers output on disk as necessary.  (Merging is implemented
45       a series of two-way merges and possibly an n-way merge at the end, so
46       disk space is O(number of records).)
47
48       dbmerge will merge data in parallel, if possible.  The "--parallelism"
49       option can control the degree of parallelism, if desired.
50

OPTIONS

52       General option:
53
54       --xargs
55           Expect that input filenames are given, one-per-line, on standard
56           input.  (In this case, merging can start incrementally.)
57
58       --removeinputs
59           Delete the source files after they have been consumed.  (Defaults
60           off, leaving the inputs in place.)
61
62       -T TmpDir
63           where to put tmp files.  Also uses environment variable TMPDIR, if
64           -T is not specified.  Default is /tmp.
65
66       --parallelism N or -j N
67           Allow up to N merges to happen in parallel.  Default is the number
68           of CPUs in the machine.
69
70       --endgame (or --noendgame)
71           Enable endgame mode, extra parallelism when finishing up.  (On by
72           default.)
73
74       Sort specification options (can be interspersed with column names):
75
76       -r or --descending
77           sort in reverse order (high to low)
78
79       -R or --ascending
80           sort in normal order (low to high)
81
82       -n or --numeric
83           sort numerically
84
85       -N or --lexical
86           sort lexicographically
87
88       This module also supports the standard fsdb options:
89
90       -d  Enable debugging output.
91
92       -i or --input InputSource
93           Read from InputSource, typically a file name, or "-" for standard
94           input, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue
95           objects.
96
97       -o or --output OutputDestination
98           Write to OutputDestination, typically a file name, or "-" for
99           standard output, or (if in Perl) a IO::Handle, Fsdb::IO or
100           Fsdb::BoundedQueue objects.
101
102       --autorun or --noautorun
103           By default, programs process automatically, but Fsdb::Filter
104           objects in Perl do not run until you invoke the run() method.  The
105           "--(no)autorun" option controls that behavior within Perl.
106
107       --header H
108           Use H as the full Fsdb header, rather than reading a header from
109           then input.
110
111       --help
112           Show help.
113
114       --man
115           Show full manual.
116

SAMPLE USAGE

118   Input:
119       File a.fsdb:
120
121           #fsdb cid cname
122           11 numanal
123           10 pascal
124
125       File b.fsdb:
126
127           #fsdb cid cname
128           12 os
129           13 statistics
130
131       These two files are both sorted by "cname", and they have identical
132       schemas.
133
134   Command:
135           dbmerge --input a.fsdb --input b.fsdb cname
136
137       or
138
139           cat a.fsdb | dbmerge --input b.fsdb cname
140
141   Output:
142           #fsdb      cid     cname
143           11 numanal
144           12 os
145           10 pascal
146           13 statistics
147           #  | dbmerge --input a.fsdb --input b.fsdb cname
148

CLASS FUNCTIONS

153   new
154           $filter = new Fsdb::Filter::dbmerge(@arguments);
155
156       Create a new object, taking command-line arguments.
157
158   set_defaults
159           $filter->set_defaults();
160
161       Internal: set up defaults.
162
163   parse_options
164           $filter->parse_options(@ARGV);
165
166       Internal: parse command-line arguments.
167
168   _pretty_fn
169           _pretty_fn($fn)
170
171       Internal: pretty-print a filename or Fsdb::BoundedQueue.
172
173   segment_next_output
174           $out = $self->segment_next_output($output_type)
175
176       Internal: return a Fsdb::IO::Writer as $OUT that either points to our
177       output or a temporary file, depending on how things are going.
178
179       The $OUTPUT_TYPE can be 'final' or 'ipc' or 'file'.
180
181   segment_cleanup
182           $out = $self->segment_cleanup($file);
183
184       Internal: Clean up a file, if necessary.  (Sigh, used to be function
185       pointers, but not clear how they would interact with threads.)
186
187   _unique_id
188           $id = $self->_unique_id()
189
190       Generate a sequence number for debugging.
191
192   segments_merge2_run
193           $out = $self->segments_merge2_run($out_fn, $is_final_output,
194                               $in0, $in1, $id);
195
196       Internal: do the actual merge2 work (maybe our parent put us in a
197       thread, maybe not).
198
199   segments_merge1_run
200           $out = $self->segments_merge1_run($out_fn, $in0);
201
202       Internal: a special case of merge1 when we have only one file.
203
204   enqueue_work
205           $self->enqueue_work($depth, $work);
206
207       Internal: put $WORK on the queue at $DEPTH, updating the max count.
208
209   segments_merge_one_depth
210           $self->segments_merge_one_depth($depth);
211
212       Merge queued files, if any.
213
214       Also release any queued threads.
215
216   segments_xargs
217           $self->segments_xargs();
218
219       Internal: read new filenames to process (from stdin) and send them to
220       the work queue.
221
222       Making a separate Fred to handle xargs is a lot of work, but it
223       guarantees it comes in on an IO::Handle that is selectable.
224
225   segments_merge_all
226           $self->segments_merge_all()
227
228       Internal: Merge queued files, if any.  Iterates over all depths of the
229       merge tree, and handles any forked threads.
230
231       Merging Strategy
232
233       Merging is done in a binary tree is managed through the "_work" queue.
234       It has an array of "depth" entries, one for each level of the tree.
235
236       Items are processed in order at each level of the tree, and only level-
237       by-level, so the sort is stable.
238
239       Parallelism Model
240
241       Parallelism is also managed through the "_work" queue, each element of
242       which consists of one file or stream suitable for merging.  The work
243       queue contains both ready output (files or BoundedQueue streams) that
244       can be immediately handled, and pairs of semaphore/pending output for
245       work that is not yet started.  All manipulation of the work queue
246       happens in the main thread (with "segments_merge_all" and
247       "segments_merge_one_depth").
248
249       We start a thread to handle each item in the work queue, and limit
250       parallelism to the "_max_parallelism", defaulting to the number of
251       available processors.
252
253       There two two kinds of parallelism, regular and endgame.  For regular
254       parallelism we pick two items off the work queue, merge them, and put
255       the result back on the queue as a new file.  Items in the work queue
256       may not be ready.  For in-progress items we wait until they are done.
257       For not-yet-started items we start them, then wait until they are done.
258
259       Endgame parallelism handles the final stages of a large merge.  When
260       there are enough processors that we can start a merge jobs for all
261       remaining levels of the merge tree.  At this point we switch from
262       merging to files to merging into "Fsdb::BoundedQueue" pipelines that
263       connect merge processes which start and run concurrently.
264
265       The final merge is done in the main thread so that that the main thread
266       can handle the output stream and recording the merge action.
267
268   setup
269           $filter->setup();
270
271       Internal: setup, parse headers.
272
273   run
274           $filter->run();
275
276       Internal: run over each rows.
277

AUTHOR and COPYRIGHT

279       Copyright (C) 1991-2020 by John Heidemann <johnh@isi.edu>
280
281       This program is distributed under terms of the GNU general public
282       license, version 2.  See the file COPYING with the distribution for
283       details.
284
285
286
287perl v5.38.0                      2023-07-20          Fsdb::Filter::dbmerge(3)