1Fsdb::Filter::dbmerge(3U)ser Contributed Perl DocumentatiFosndb::Filter::dbmerge(3)
2
3
4
6 dbmerge - merge all inputs in sorted order based on the the specified
7 columns
8
10 dbmerge --input A.fsdb --input B.fsdb [-T TemporaryDirectory] [-nNrR] column [column...]
11
12 or
13 cat A.fsdb | dbmerge --input - --input B.fsdb [-T
14 TemporaryDirectory] [-nNrR] column [column...]
15
16 or
17 dbmerge [-T TemporaryDirectory] [-nNrR] column [column...] --inputs
18 A.fsdb [B.fsdb ...]
19
20 or
21 { echo "A.fsdb"; echo "B.fsdb" } | dbmerge --xargs column
22 [column...]
23
25 Merge all provided, pre-sorted input files, producing one sorted
26 result. Inputs can both be specified with "--input", or one can come
27 from standard input and the other from "--input". With "--xargs", each
28 line of standard input is a filename for input.
29
30 Inputs must have identical schemas (columns, column order, and field
31 separators).
32
33 Unlike dbmerge2, dbmerge supports an arbitrary number of input files.
34
35 Because this program is intended to merge multiple sources, it does not
36 default to reading from standard input. If you wish to list - as an
37 explicit input source.
38
39 Also, because we deal with multiple input files, this module doesn't
40 output anything until it's run.
41
42 dbmerge consumes a fixed amount of memory regardless of input size. It
43 therefore buffers output on disk as necessary. (Merging is implemented
44 a series of two-way merges, so disk space is O(number of records).)
45
46 dbmerge will merge data in parallel, if possible. The "--parallelism"
47 option can control the degree of parallelism, if desired.
48
50 General option:
51
52 --xargs
53 Expect that input filenames are given, one-per-line, on standard
54 input. (In this case, merging can start incrementally.)
55
56 --removeinputs
57 Delete the source files after they have been consumed. (Defaults
58 off, leaving the inputs in place.)
59
60 -T TmpDir
61 where to put tmp files. Also uses environment variable TMPDIR, if
62 -T is not specified. Default is /tmp.
63
64 --parallelism N or -j N
65 Allow up to N merges to happen in parallel. Default is the number
66 of CPUs in the machine.
67
68 --endgame (or --noendgame)
69 Enable endgame mode, extra parallelism when finishing up. (On by
70 default.)
71
72 Sort specification options (can be interspersed with column names):
73
74 -r or --descending
75 sort in reverse order (high to low)
76
77 -R or --ascending
78 sort in normal order (low to high)
79
80 -n or --numeric
81 sort numerically
82
83 -N or --lexical
84 sort lexicographically
85
86 This module also supports the standard fsdb options:
87
88 -d Enable debugging output.
89
90 -i or --input InputSource
91 Read from InputSource, typically a file name, or "-" for standard
92 input, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue
93 objects.
94
95 -o or --output OutputDestination
96 Write to OutputDestination, typically a file name, or "-" for
97 standard output, or (if in Perl) a IO::Handle, Fsdb::IO or
98 Fsdb::BoundedQueue objects.
99
100 --autorun or --noautorun
101 By default, programs process automatically, but Fsdb::Filter
102 objects in Perl do not run until you invoke the run() method. The
103 "--(no)autorun" option controls that behavior within Perl.
104
105 --header H
106 Use H as the full Fsdb header, rather than reading a header from
107 then input.
108
109 --help
110 Show help.
111
112 --man
113 Show full manual.
114
116 Input:
117 File a.fsdb:
118
119 #fsdb cid cname
120 11 numanal
121 10 pascal
122
123 File b.fsdb:
124
125 #fsdb cid cname
126 12 os
127 13 statistics
128
129 These two files are both sorted by "cname", and they have identical
130 schemas.
131
132 Command:
133 dbmerge --input a.fsdb --input b.fsdb cname
134
135 or
136
137 cat a.fsdb | dbmerge --input b.fsdb cname
138
139 Output:
140 #fsdb cid cname
141 11 numanal
142 12 os
143 10 pascal
144 13 statistics
145 # | dbmerge --input a.fsdb --input b.fsdb cname
146
148 dbmerge2(1), dbsort(1), Fsdb(3)
149
151 new
152 $filter = new Fsdb::Filter::dbmerge(@arguments);
153
154 Create a new object, taking command-line arguments.
155
156 set_defaults
157 $filter->set_defaults();
158
159 Internal: set up defaults.
160
161 parse_options
162 $filter->parse_options(@ARGV);
163
164 Internal: parse command-line arguments.
165
166 _pretty_fn
167 _pretty_fn($fn)
168
169 Internal: pretty-print a filename or Fsdb::BoundedQueue.
170
171 segment_next_output
172 $out = $self->segment_next_output($output_type)
173
174 Internal: return a Fsdb::IO::Writer as $OUT that either points to our
175 output or a temporary file, depending on how things are going.
176
177 The $OUTPUT_TYPE can be 'final' or 'ipc' or 'file'.
178
179 segment_cleanup
180 $out = $self->segment_cleanup($file);
181
182 Internal: Clean up a file, if necessary. (Sigh, used to be function
183 pointers, but not clear how they would interact with threads.)
184
185 _unique_id
186 $id = $self->_unique_id()
187
188 Generate a sequence number for debugging.
189
190 segments_merge2_run
191 $out = $self->segments_merge2_run($out_fn, $is_final_output,
192 $in0, $in1);
193
194 Internal: do the actual merge2 work (maybe our parent put us in a
195 thread, maybe not).
196
197 enqueue_work
198 $self->enqueue_work($depth, $work);
199
200 Internal: put $WORK on the queue at $DEPTH, updating the max count.
201
202 segments_merge_one_depth
203 $self->segments_merge_one_depth($depth);
204
205 Merge queued files, if any.
206
207 Also release any queued threads.
208
209 segments_xargs
210 $self->segments_xargs();
211
212 Internal: read new filenames to process (from stdin) and send them to
213 the work queue.
214
215 Making a separate Fred to handle xargs is a lot of work, but it
216 guarantees it comes in on an IO::Handle that is selectable.
217
218 segments_merge_all
219 $self->segments_merge_all()
220
221 Internal: Merge queued files, if any. Iterates over all depths of the
222 merge tree, and handles any forked threads.
223
224 Merging Strategy
225
226 Merging is done in a binary tree is managed through the "_work" queue.
227 It has an array of "depth" entries, one for each level of the tree.
228
229 Items are processed in order at each level of the tree, and only level-
230 by-level, so the sort is stable.
231
232 Parallelism Model
233
234 Parallelism is also managed through the "_work" queue, each element of
235 which consists of one file or stream suitable for merging. The work
236 queue contains both ready output (files or BoundedQueue streams) that
237 can be immediately handled, and pairs of semaphore/pending output for
238 work that is not yet started. All manipulation of the work queue
239 happens in the main thread (with "segments_merge_all" and
240 "segments_merge_one_depth").
241
242 We start a thread to handle each item in the work queue, and limit
243 parallelism to the "_max_parallelism", defaulting to the number of
244 available processors.
245
246 There two two kinds of parallelism, regular and endgame. For regular
247 parallelism we pick two items off the work queue, merge them, and put
248 the result back on the queue as a new file. Items in the work queue
249 may not be ready. For in-progress items we wait until they are done.
250 For not-yet-started items we start them, then wait until they are done.
251
252 Endgame parallelism handles the final stages of a large merge. When
253 there are enough processors that we can start a merge jobs for all
254 remaining levels of the merge tree. At this point we switch from
255 merging to files to merging into "Fsdb::BoundedQueue" pipelines that
256 connect merge processes which start and run concurrently.
257
258 The final merge is done in the main thread so that that the main thread
259 can handle the output stream and recording the merge action.
260
261 setup
262 $filter->setup();
263
264 Internal: setup, parse headers.
265
266 run
267 $filter->run();
268
269 Internal: run over each rows.
270
272 Copyright (C) 1991-2019 by John Heidemann <johnh@isi.edu>
273
274 This program is distributed under terms of the GNU general public
275 license, version 2. See the file COPYING with the distribution for
276 details.
277
278
279
280perl v5.30.1 2020-01-30 Fsdb::Filter::dbmerge(3)