1Fsdb::Filter::dbmerge(3U)ser Contributed Perl DocumentatiFosndb::Filter::dbmerge(3)
2
3
4
6 dbmerge - merge all inputs in sorted order based on the the specified
7 columns
8
10 dbmerge --input A.fsdb --input B.fsdb [-T TemporaryDirectory] [-nNrR] column [column...]
11
12 or
13 cat A.fsdb | dbmerge --input - --input B.fsdb [-T
14 TemporaryDirectory] [-nNrR] column [column...]
15
16 or
17 dbmerge [-T TemporaryDirectory] [-nNrR] column [column...] --inputs
18 A.fsdb [B.fsdb ...]
19
20 or
21 { echo "A.fsdb"; echo "B.fsdb" } | dbmerge --xargs column
22 [column...]
23
25 Merge all provided, pre-sorted input files, producing one sorted
26 result. Inputs can both be specified with "--input", or one can come
27 from standard input and the other from "--input". With "--xargs", each
28 line of standard input is a filename for input.
29
30 Inputs must have identical schemas (columns, column order, and field
31 separators).
32
33 Unlike dbmerge2, dbmerge supports an arbitrary number of input files.
34
35 Because this program is intended to merge multiple sources, it does not
36 default to reading from standard input. If you wish to list - as an
37 explicit input source.
38
39 Also, because we deal with multiple input files, this module doesn't
40 output anything until it's run.
41
42 dbmerge consumes a fixed amount of memory regardless of input size. It
43 therefore buffers output on disk as necessary. (Merging is implemented
44 a series of two-way merges, so disk space is O(number of records).)
45
46 dbmerge will merge data in parallel, if possible. The "--parallelism"
47 option can control the degree of parallelism, if desired.
48
50 General option:
51
52 --xargs
53 Expect that input filenames are given, one-per-line, on standard
54 input. (In this case, merging can start incrementally.)
55
56 --removeinputs
57 Delete the source files after they have been consumed. (Defaults
58 off, leaving the inputs in place.)
59
60 -T TmpDir
61 where to put tmp files. Also uses environment variable TMPDIR, if
62 -T is not specified. Default is /tmp.
63
64 --parallelism N or -j N
65 Allow up to N merges to happen in parallel. Default is the number
66 of CPUs in the machine.
67
68 --endgame (or --noendgame)
69 Enable endgame mode, extra parallelism when finishing up. (On by
70 default.)
71
72 Sort specification options (can be interspersed with column names):
73
74 -r or --descending
75 sort in reverse order (high to low)
76
77 -R or --ascending
78 sort in normal order (low to high)
79
80 -n or --numeric
81 sort numerically
82
83 -N or --lexical
84 sort lexicographically
85
86 This module also supports the standard fsdb options:
87
88 -d Enable debugging output.
89
90 -i or --input InputSource
91 Read from InputSource, typically a file name, or "-" for standard
92 input, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue
93 objects.
94
95 -o or --output OutputDestination
96 Write to OutputDestination, typically a file name, or "-" for
97 standard output, or (if in Perl) a IO::Handle, Fsdb::IO or
98 Fsdb::BoundedQueue objects.
99
100 --autorun or --noautorun
101 By default, programs process automatically, but Fsdb::Filter
102 objects in Perl do not run until you invoke the run() method. The
103 "--(no)autorun" option controls that behavior within Perl.
104
105 --header H
106 Use H as the full Fsdb header, rather than reading a header from
107 then input.
108
109 --help
110 Show help.
111
112 --man
113 Show full manual.
114
116 Input:
117 File a.fsdb:
118
119 #fsdb cid cname
120 11 numanal
121 10 pascal
122
123 File b.fsdb:
124
125 #fsdb cid cname
126 12 os
127 13 statistics
128
129 These two files are both sorted by "cname", and they have identical
130 schemas.
131
132 Command:
133 dbmerge --input a.fsdb --input b.fsdb cname
134
135 or
136
137 cat a.fsdb | dbmerge --input b.fsdb cname
138
139 Output:
140 #fsdb cid cname
141 11 numanal
142 12 os
143 10 pascal
144 13 statistics
145 # | dbmerge --input a.fsdb --input b.fsdb cname
146
148 dbmerge2(1), dbsort(1), Fsdb(3)
149
151 new
152 $filter = new Fsdb::Filter::dbmerge(@arguments);
153
154 Create a new object, taking command-line arguments.
155
156 set_defaults
157 $filter->set_defaults();
158
159 Internal: set up defaults.
160
161 parse_options
162 $filter->parse_options(@ARGV);
163
164 Internal: parse command-line arguments.
165
166 _pretty_fn
167 _pretty_fn($fn)
168
169 Internal: pretty-print a filename or Fsdb::BoundedQueue.
170
171 segment_next_output
172 $out = $self->segment_next_output($output_type)
173
174 Internal: return a Fsdb::IO::Writer as $OUT that either points to our
175 output or a temporary file, depending on how things are going.
176
177 The $OUTPUT_TYPE can be 'final' or 'ipc' or 'file'.
178
179 segment_cleanup
180 $out = $self->segment_cleanup($file);
181
182 Internal: Clean up a file, if necessary. (Sigh, used to be function
183 pointers, but not clear how they would interact with threads.)
184
185 _unique_id
186 $id = $self->_unique_id()
187
188 Generate a sequence number for debugging.
189
190 segments_merge2_run
191 $out = $self->segments_merge2_run($out_fn, $is_final_output,
192 $in0, $in1, $id);
193
194 Internal: do the actual merge2 work (maybe our parent put us in a
195 thread, maybe not).
196
197 segments_merge1_run
198 $out = $self->segments_merge1_run($out_fn, $in0);
199
200 Internal: a special case of merge1 when we have only one file.
201
202 enqueue_work
203 $self->enqueue_work($depth, $work);
204
205 Internal: put $WORK on the queue at $DEPTH, updating the max count.
206
207 segments_merge_one_depth
208 $self->segments_merge_one_depth($depth);
209
210 Merge queued files, if any.
211
212 Also release any queued threads.
213
214 segments_xargs
215 $self->segments_xargs();
216
217 Internal: read new filenames to process (from stdin) and send them to
218 the work queue.
219
220 Making a separate Fred to handle xargs is a lot of work, but it
221 guarantees it comes in on an IO::Handle that is selectable.
222
223 segments_merge_all
224 $self->segments_merge_all()
225
226 Internal: Merge queued files, if any. Iterates over all depths of the
227 merge tree, and handles any forked threads.
228
229 Merging Strategy
230
231 Merging is done in a binary tree is managed through the "_work" queue.
232 It has an array of "depth" entries, one for each level of the tree.
233
234 Items are processed in order at each level of the tree, and only level-
235 by-level, so the sort is stable.
236
237 Parallelism Model
238
239 Parallelism is also managed through the "_work" queue, each element of
240 which consists of one file or stream suitable for merging. The work
241 queue contains both ready output (files or BoundedQueue streams) that
242 can be immediately handled, and pairs of semaphore/pending output for
243 work that is not yet started. All manipulation of the work queue
244 happens in the main thread (with "segments_merge_all" and
245 "segments_merge_one_depth").
246
247 We start a thread to handle each item in the work queue, and limit
248 parallelism to the "_max_parallelism", defaulting to the number of
249 available processors.
250
251 There two two kinds of parallelism, regular and endgame. For regular
252 parallelism we pick two items off the work queue, merge them, and put
253 the result back on the queue as a new file. Items in the work queue
254 may not be ready. For in-progress items we wait until they are done.
255 For not-yet-started items we start them, then wait until they are done.
256
257 Endgame parallelism handles the final stages of a large merge. When
258 there are enough processors that we can start a merge jobs for all
259 remaining levels of the merge tree. At this point we switch from
260 merging to files to merging into "Fsdb::BoundedQueue" pipelines that
261 connect merge processes which start and run concurrently.
262
263 The final merge is done in the main thread so that that the main thread
264 can handle the output stream and recording the merge action.
265
266 setup
267 $filter->setup();
268
269 Internal: setup, parse headers.
270
271 run
272 $filter->run();
273
274 Internal: run over each rows.
275
277 Copyright (C) 1991-2020 by John Heidemann <johnh@isi.edu>
278
279 This program is distributed under terms of the GNU general public
280 license, version 2. See the file COPYING with the distribution for
281 details.
282
283
284
285perl v5.32.0 2020-11-16 Fsdb::Filter::dbmerge(3)