1Fsdb::Filter::dbmerge(3U)ser Contributed Perl DocumentatiFosndb::Filter::dbmerge(3)
2
3
4
6 dbmerge - merge all inputs in sorted order based on the the specified
7 columns
8
10 dbmerge --input A.fsdb --input B.fsdb [-T TemporaryDirectory] [-nNrR] column [column...]
11
12 or
13 cat A.fsdb | dbmerge --input - --input B.fsdb [-T
14 TemporaryDirectory] [-nNrR] column [column...]
15
16 or
17 dbmerge [-T TemporaryDirectory] [-nNrR] column [column...] --inputs
18 A.fsdb [B.fsdb ...]
19
21 Merge all provided, pre-sorted input files, producing one sorted
22 result. Inputs can both be specified with "--input", or one can come
23 from standard input and the other from "--input". With "--xargs", each
24 line of standard input is a filename for input.
25
26 Inputs must have identical schemas (columns, column order, and field
27 separators).
28
29 Unlike dbmerge2, dbmerge supports an arbitrary number of input files.
30
31 Because this program is intended to merge multiple sources, it does not
32 default to reading from standard input. If you wish to list - as an
33 explicit input source.
34
35 Also, because we deal with multiple input files, this module doesn't
36 output anything until it's run.
37
38 dbmerge consumes a fixed amount of memory regardless of input size. It
39 therefore buffers output on disk as necessary. (Merging is implemented
40 a series of two-way merges, so disk space is O(number of records).)
41
42 dbmerge will merge data in parallel, if possible. The <--parallelism>
43 option can control the degree of parallelism, if desired.
44
46 General option:
47
48 --xargs
49 Expect that input filenames are given, one-per-line, on standard
50 input. (In this case, merging can start incrementally.
51
52 --removeinputs
53 Delete the source files after they have been consumed. (Defaults
54 off, leaving the inputs in place.)
55
56 -T TmpDir
57 where to put tmp files. Also uses environment variable TMPDIR, if
58 -T is not specified. Default is /tmp.
59
60 --parallelism N or -j N
61 Allow up to N merges to happen in parallel. Default is the number
62 of CPUs in the machine.
63
64 --endgame (or --noendgame)
65 Enable endgame mode, extra parallelism when finishing up. (On by
66 default.)
67
68 Sort specification options (can be interspersed with column names):
69
70 -r or --descending
71 sort in reverse order (high to low)
72
73 -R or --ascending
74 sort in normal order (low to high)
75
76 -n or --numeric
77 sort numerically
78
79 -N or --lexical
80 sort lexicographically
81
82 This module also supports the standard fsdb options:
83
84 -d Enable debugging output.
85
86 -i or --input InputSource
87 Read from InputSource, typically a file name, or "-" for standard
88 input, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue
89 objects.
90
91 -o or --output OutputDestination
92 Write to OutputDestination, typically a file name, or "-" for
93 standard output, or (if in Perl) a IO::Handle, Fsdb::IO or
94 Fsdb::BoundedQueue objects.
95
96 --autorun or --noautorun
97 By default, programs process automatically, but Fsdb::Filter
98 objects in Perl do not run until you invoke the run() method. The
99 "--(no)autorun" option controls that behavior within Perl.
100
101 --help
102 Show help.
103
104 --man
105 Show full manual.
106
108 Input:
109 File a.fsdb:
110
111 #fsdb cid cname
112 11 numanal
113 10 pascal
114
115 File b.fsdb:
116
117 #fsdb cid cname
118 12 os
119 13 statistics
120
121 These two files are both sorted by "cname", and they have identical
122 schemas.
123
124 Command:
125 dbmerge --input a.fsdb --input b.fsdb cname
126
127 or
128
129 cat a.fsdb | dbmerge --input b.fsdb cname
130
131 Output:
132 #fsdb cid cname
133 11 numanal
134 12 os
135 10 pascal
136 13 statistics
137 # | dbmerge --input a.fsdb --input b.fsdb cname
138
140 dbmerge2(1), dbsort(1), Fsdb(3)
141
143 new
144 $filter = new Fsdb::Filter::dbmerge(@arguments);
145
146 Create a new object, taking command-line arguments.
147
148 set_defaults
149 $filter->set_defaults();
150
151 Internal: set up defaults.
152
153 parse_options
154 $filter->parse_options(@ARGV);
155
156 Internal: parse command-line arguments.
157
158 _pretty_fn
159 _pretty_fn($fn)
160
161 Internal: pretty-print a filename or Fsdb::BoundedQueue.
162
163 segment_next_output
164 $out = $self->segment_next_output($output_type)
165
166 Internal: return a Fsdb::IO::Writer as $OUT that either points to our
167 output or a temporary file, depending on how things are going.
168
169 The $OUTPUT_TYPE can be 'final' or 'ipc' or 'file'.
170
171 segment_cleanup
172 $out = $self->segment_cleanup($file);
173
174 Internal: Clean up a file, if necessary. (Sigh, used to be function
175 pointers, but not clear how they would interact with threads.)
176
177 _unique_id
178 $id = $self->_unique_id()
179
180 Generate a sequence number for debugging.
181
182 segments_merge2_run
183 $out = $self->segments_merge2_run($out_fn, $is_final_output,
184 $in0, $in1);
185
186 Internal: do the actual merge2 work (maybe our parent put us in a
187 thread, maybe not).
188
189 enqueue_work
190 $self->enqueue_work($depth, $work);
191
192 Internal: put $WORK on the queue at $DEPTH, updating the max count.
193
194 segments_merge_one_depth
195 $self->segments_merge_one_depth($depth);
196
197 Merge queued files, if any.
198
199 Also release any queued threads.
200
201 segments_xargs
202 $self->segments_xargs();
203
204 Internal: read new filenames to process (from stdin) and send them to
205 the work queue.
206
207 Making a separate Fred to handle xargs is a lot of work, but it
208 guarantees it comes in on an IO::Handle that is selectable.
209
210 segments_merge_all
211 $self->segments_merge_all()
212
213 Internal: Merge queued files, if any. Iterates over all depths of the
214 merge tree, and handles any forked threads.
215
216 Merging Strategy
217
218 Merging is done in a binary tree is managed through the "_work" queue.
219 It has an array of "depth" entries, one for each level of the tree.
220
221 Items are processed in order at each level of the tree, and only level-
222 by-level, so the sort is stable.
223
224 Parallelism Model
225
226 Parallelism is also managed through the "_work" queue, each element of
227 which consists of one file or stream suitable for merging. The work
228 queue contains both ready output (files or BoundedQueue streams) that
229 can be immediately handled, and pairs of semaphore/pending output for
230 work that is not yet started. All manipulation of the work queue
231 happens in the main thread (with "segments_merge_all" and
232 "segments_merge_one_depth").
233
234 We start a thread to handle each item in the work queue, and limit
235 parallelism to the "_max_parallelism", defaulting to the number of
236 available processors.
237
238 There two two kinds of parallelism, regular and endgame. For regular
239 parallelism we pick two items off the work queue, merge them, and put
240 the result back on the queue as a new file. Items in the work queue
241 may not be ready. For in-progress items we wait until they are done.
242 For not-yet-started items we start them, then wait until they are done.
243
244 Endgame parallelism handles the final stages of a large merge. When
245 there are enough processors that we can start a merge jobs for all
246 remaining levels of the merge tree. At this point we switch from
247 merging to files to merging into "Fsdb::BoundedQueue" pipelines that
248 connect merge processes which start and run concurrently.
249
250 The final merge is done in the main thread so that that the main thread
251 can handle the output stream and recording the merge action.
252
253 setup
254 $filter->setup();
255
256 Internal: setup, parse headers.
257
258 run
259 $filter->run();
260
261 Internal: run over each rows.
262
264 Copyright (C) 1991-2018 by John Heidemann <johnh@isi.edu>
265
266 This program is distributed under terms of the GNU general public
267 license, version 2. See the file COPYING with the distribution for
268 details.
269
270
271
272perl v5.28.1 2018-02-17 Fsdb::Filter::dbmerge(3)