1Fsdb::Filter::dbmerge(3U)ser Contributed Perl DocumentatiFosndb::Filter::dbmerge(3)
2
3
4
6 dbmerge - merge all inputs in sorted order based on the the specified
7 columns
8
10 dbmerge --input A.fsdb --input B.fsdb [-T TemporaryDirectory] [-nNrR] column [column...]
11
12 or
13 cat A.fsdb | dbmerge --input - --input B.fsdb [-T
14 TemporaryDirectory] [-nNrR] column [column...]
15
16 or
17 dbmerge [-T TemporaryDirectory] [-nNrR] column [column...] --inputs
18 A.fsdb [B.fsdb ...]
19
20 or
21 { echo "A.fsdb"; echo "B.fsdb" } | dbmerge --xargs column
22 [column...]
23
25 Merge all provided, pre-sorted input files, producing one sorted
26 result. Inputs can both be specified with "--input", or with
27 "--inputs", or one can come from standard input and the other from
28 "--input". With "--xargs", each line of standard input is a filename
29 for input.
30
31 Inputs must have identical schemas (columns, column order, and field
32 separators).
33
34 Unlike dbmerge2, dbmerge supports an arbitrary number of input files.
35
36 Because this program is intended to merge multiple sources, it does not
37 default to reading from standard input. If you wish to read standard
38 input, giv - as an explicit input source.
39
40 Also, because we deal with multiple input files, this module doesn't
41 output anything until it's run.
42
43 dbmerge consumes a fixed amount of memory regardless of input size. It
44 therefore buffers output on disk as necessary. (Merging is implemented
45 a series of two-way merges and possibly an n-way merge at the end, so
46 disk space is O(number of records).)
47
48 dbmerge will merge data in parallel, if possible. The "--parallelism"
49 option can control the degree of parallelism, if desired.
50
52 General option:
53
54 --xargs
55 Expect that input filenames are given, one-per-line, on standard
56 input. (In this case, merging can start incrementally.)
57
58 --removeinputs
59 Delete the source files after they have been consumed. (Defaults
60 off, leaving the inputs in place.)
61
62 -T TmpDir
63 where to put tmp files. Also uses environment variable TMPDIR, if
64 -T is not specified. Default is /tmp.
65
66 --parallelism N or -j N
67 Allow up to N merges to happen in parallel. Default is the number
68 of CPUs in the machine.
69
70 --endgame (or --noendgame)
71 Enable endgame mode, extra parallelism when finishing up. (On by
72 default.)
73
74 Sort specification options (can be interspersed with column names):
75
76 -r or --descending
77 sort in reverse order (high to low)
78
79 -R or --ascending
80 sort in normal order (low to high)
81
82 -n or --numeric
83 sort numerically
84
85 -N or --lexical
86 sort lexicographically
87
88 This module also supports the standard fsdb options:
89
90 -d Enable debugging output.
91
92 -i or --input InputSource
93 Read from InputSource, typically a file name, or "-" for standard
94 input, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue
95 objects.
96
97 -o or --output OutputDestination
98 Write to OutputDestination, typically a file name, or "-" for
99 standard output, or (if in Perl) a IO::Handle, Fsdb::IO or
100 Fsdb::BoundedQueue objects.
101
102 --autorun or --noautorun
103 By default, programs process automatically, but Fsdb::Filter
104 objects in Perl do not run until you invoke the run() method. The
105 "--(no)autorun" option controls that behavior within Perl.
106
107 --header H
108 Use H as the full Fsdb header, rather than reading a header from
109 then input.
110
111 --help
112 Show help.
113
114 --man
115 Show full manual.
116
118 Input:
119 File a.fsdb:
120
121 #fsdb cid cname
122 11 numanal
123 10 pascal
124
125 File b.fsdb:
126
127 #fsdb cid cname
128 12 os
129 13 statistics
130
131 These two files are both sorted by "cname", and they have identical
132 schemas.
133
134 Command:
135 dbmerge --input a.fsdb --input b.fsdb cname
136
137 or
138
139 cat a.fsdb | dbmerge --input b.fsdb cname
140
141 Output:
142 #fsdb cid cname
143 11 numanal
144 12 os
145 10 pascal
146 13 statistics
147 # | dbmerge --input a.fsdb --input b.fsdb cname
148
150 dbmerge2(1), dbsort(1), Fsdb(3)
151
153 new
154 $filter = new Fsdb::Filter::dbmerge(@arguments);
155
156 Create a new object, taking command-line arguments.
157
158 set_defaults
159 $filter->set_defaults();
160
161 Internal: set up defaults.
162
163 parse_options
164 $filter->parse_options(@ARGV);
165
166 Internal: parse command-line arguments.
167
168 _pretty_fn
169 _pretty_fn($fn)
170
171 Internal: pretty-print a filename or Fsdb::BoundedQueue.
172
173 segment_next_output
174 $out = $self->segment_next_output($output_type)
175
176 Internal: return a Fsdb::IO::Writer as $OUT that either points to our
177 output or a temporary file, depending on how things are going.
178
179 The $OUTPUT_TYPE can be 'final' or 'ipc' or 'file'.
180
181 segment_cleanup
182 $out = $self->segment_cleanup($file);
183
184 Internal: Clean up a file, if necessary. (Sigh, used to be function
185 pointers, but not clear how they would interact with threads.)
186
187 _unique_id
188 $id = $self->_unique_id()
189
190 Generate a sequence number for debugging.
191
192 segments_merge2_run
193 $out = $self->segments_merge2_run($out_fn, $is_final_output,
194 $in0, $in1, $id);
195
196 Internal: do the actual merge2 work (maybe our parent put us in a
197 thread, maybe not).
198
199 segments_merge1_run
200 $out = $self->segments_merge1_run($out_fn, $in0);
201
202 Internal: a special case of merge1 when we have only one file.
203
204 enqueue_work
205 $self->enqueue_work($depth, $work);
206
207 Internal: put $WORK on the queue at $DEPTH, updating the max count.
208
209 segments_merge_one_depth
210 $self->segments_merge_one_depth($depth);
211
212 Merge queued files, if any.
213
214 Also release any queued threads.
215
216 segments_xargs
217 $self->segments_xargs();
218
219 Internal: read new filenames to process (from stdin) and send them to
220 the work queue.
221
222 Making a separate Fred to handle xargs is a lot of work, but it
223 guarantees it comes in on an IO::Handle that is selectable.
224
225 segments_merge_all
226 $self->segments_merge_all()
227
228 Internal: Merge queued files, if any. Iterates over all depths of the
229 merge tree, and handles any forked threads.
230
231 Merging Strategy
232
233 Merging is done in a binary tree is managed through the "_work" queue.
234 It has an array of "depth" entries, one for each level of the tree.
235
236 Items are processed in order at each level of the tree, and only level-
237 by-level, so the sort is stable.
238
239 Parallelism Model
240
241 Parallelism is also managed through the "_work" queue, each element of
242 which consists of one file or stream suitable for merging. The work
243 queue contains both ready output (files or BoundedQueue streams) that
244 can be immediately handled, and pairs of semaphore/pending output for
245 work that is not yet started. All manipulation of the work queue
246 happens in the main thread (with "segments_merge_all" and
247 "segments_merge_one_depth").
248
249 We start a thread to handle each item in the work queue, and limit
250 parallelism to the "_max_parallelism", defaulting to the number of
251 available processors.
252
253 There two two kinds of parallelism, regular and endgame. For regular
254 parallelism we pick two items off the work queue, merge them, and put
255 the result back on the queue as a new file. Items in the work queue
256 may not be ready. For in-progress items we wait until they are done.
257 For not-yet-started items we start them, then wait until they are done.
258
259 Endgame parallelism handles the final stages of a large merge. When
260 there are enough processors that we can start a merge jobs for all
261 remaining levels of the merge tree. At this point we switch from
262 merging to files to merging into "Fsdb::BoundedQueue" pipelines that
263 connect merge processes which start and run concurrently.
264
265 The final merge is done in the main thread so that that the main thread
266 can handle the output stream and recording the merge action.
267
268 setup
269 $filter->setup();
270
271 Internal: setup, parse headers.
272
273 run
274 $filter->run();
275
276 Internal: run over each rows.
277
279 Copyright (C) 1991-2020 by John Heidemann <johnh@isi.edu>
280
281 This program is distributed under terms of the GNU general public
282 license, version 2. See the file COPYING with the distribution for
283 details.
284
285
286
287perl v5.38.0 2023-07-20 Fsdb::Filter::dbmerge(3)