Fsdb::Filter::dbjoin(3pm)

1Fsdb::Filter::dbjoin(3)User Contributed Perl DocumentatioFnsdb::Filter::dbjoin(3)
2
3
4

NAME

6       dbjoin - join two tables on common columns
7

SYNOPSIS

9           dbjoin [-Sid] --input table1.fsdb --input table2.fsdb [-nNrR] column [column...]
10
11       OR
12
13           cat table1.fsdb  | dbjoin [-Sid] --input table2.fsdb [-nNrR] column [column...]
14

DESCRIPTION

16       Does a natural, inner join on TABLE1 and TABLE2 the specified columns.
17       With the "-a" option, or with "-t outer" it will do a natural, full
18       outer join.
19
20       (Database review: inner joints output records only when there are
21       matches in both tables and will omit records that do not match.  Outer
22       joins output all records from both tables, filling with the empty value
23       as needed.  Right (left) outer joins keep all elements of the right
24       (left) table, even those that don't match in the other table.)
25
26       By default for non-hash joins, data will be sorted lexically, but the
27       usual sorting options can be mixed with the column specification.
28
29       Because two tables are required, input is typically in files.  Standard
30       input is accessible by the file "-".
31
32       If only one input is given, the first (left) input is taken from stdin.
33

RESOURCE REQUIREMENTS AND PERFORMANCE

35       Joins can be expensive.  Most databases have a query optimizer that
36       knows something about the data and so can select algorithms for
37       efficent operation, in Fsdb, you are that optimizer.
38
39       For non-hash joins: If data is already sorted, dbjoin will run more
40       efficiently by telling dbjoin the data is sorted with the "-S".
41
42       The resource requirements dbjoin vary.  If input data is sorted and
43       "-S" is given, then memory consumption is bounded by the the sum of the
44       largest number of records in either dataset with the same value in the
45       join column, and there is no disk consumption.  If data is not sorted,
46       then dbjoin requires disk storage the size of both input files.
47
48       One can minimize memory consumption by making sure each record of
49       table1 matches relatively few records in table2.  Typically this means
50       that table2 should be the smaller.  For example, given two files:
51       people.fsdb (schema: name iso_country_code) and countries.fsdb (schema:
52       iso_country_code full_country_name), then
53
54           dbjoin -i people.fsdb -i countries.fsdb iso_country_code
55
56       will require less memory than
57
58           dbjoin -i countries.fsdb -i people.fsdb iso_country_code
59
60       if there are many people per country (as one would expect).  If warning
61       "lots of matching rows accumulating in memory" appears, this is the
62       cause and try swapping join order.
63
64       For hash joins (that is, with "-m righthash" or "-m lefthash"): all of
65       the right table (the second input) or the left (the first) is loaded
66       into memory (and "hashed").  The other table need not be sorted.
67       Runtime is O(n), but memory is O(size of hashed table).
68

OPTIONS

70       -a or --all
71           Perform a full outer join, include non-matches (each record which
72           doesn't match at all will appear once).  Default is an inner join.
73
74       -t TYPE or --type TYPE
75           Explicitly specify the join type.  TYPE must be inner, outer, left
76           (outer), right (outer).  (Recall tha inner join requires data on
77           both sides, outer joins keep all records from both sides for outer,
78           or all of the first or second input for left and right outer
79           joins.)  Default: inner.
80
81       -m METHOD or --method METHOD
82           Select join method (algorithm).  Choices are merge, righthash, and
83           lefthash.  Default: merge.
84
85       -S or --pre-sorted
86           assume (and verify) data is already sorted
87
88       -e E or --empty E
89           give value E as the value for empty (null) records
90
91       -T TmpDir
92           where to put tmp files.  Also uses environment variable TMPDIR, if
93           -T is not specified.  Default is /tmp.
94
95       Sort specification options (can be interspersed with column names):
96
97       -r or --descending
98           sort in reverse order (high to low)
99
100       -R or --ascending
101           sort in normal order (low to high)
102
103       -n or --numeric
104           sort numerically
105
106       -N or --lexical
107           sort lexicographically
108
109       This module also supports the standard fsdb options:
110
111       -d  Enable debugging output.
112
113       -i or --input InputSource
114           Read from InputSource, typically a file name, or "-" for standard
115           input, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue
116           objects.
117
118       -o or --output OutputDestination
119           Write to OutputDestination, typically a file name, or "-" for
120           standard output, or (if in Perl) a IO::Handle, Fsdb::IO or
121           Fsdb::BoundedQueue objects.
122
123       --autorun or --noautorun
124           By default, programs process automatically, but Fsdb::Filter
125           objects in Perl do not run until you invoke the run() method.  The
126           "--(no)autorun" option controls that behavior within Perl.
127
128       --help
129           Show help.
130
131       --man
132           Show full manual.
133

SAMPLE USAGE

135   Input:
136           #fsdb sid cid
137           1 10
138           2 11
139           1 12
140           2 12
141
142       And in the file DATA/classes:
143
144           #fsdb cid cname
145           10 pascal
146           11 numanal
147           12 os
148
149   Command:
150           cat DATA/reg.fsdb | dbsort -n cid | dbjoin -i - -i DATA/classes -n cid
151
152   Output:
153           #fsdb      cid     sid     cname
154           10      1       pascal
155           11      2       numanal
156           12      1       os
157           12      2       os
158           # - COMMENTS:
159           #  | /home/johnh/BIN/DB/dbsort -n cid
160           # DATA/classes COMMENTS:
161           # joined comments:
162           #  | /home/johnh/BIN/DB/dbjoin - DATA/classes cid
163

CLASS FUNCTIONS

168   new
169           $filter = new Fsdb::Filter::dbjoin(@arguments);
170
171       Create a new dbjoin object, taking command-line arguments.
172
173   set_defaults
174           $filter->set_defaults();
175
176       Internal: set up defaults.
177
178   parse_options
179           $filter->parse_options(@ARGV);
180
181       Internal: parse command-line arguments.
182
183   setup
184           $filter->setup();
185
186       Internal: setup, parse headers.
187
188   run_merge_join
189           $filter->run_merge_join();
190
191       Internal: run over each rows.
192
193   run_hash_join
194           $filter->run_hash_join();
195
196       Internal: run over each rows, doing a hash join.
197
198   run
199           $filter->run();
200
201       Internal: run over each rows.
202

AUTHOR and COPYRIGHT

204       Copyright (C) 1991-2022 by John Heidemann <johnh@isi.edu>
205
206       This program is distributed under terms of the GNU general public
207       license, version 2.  See the file COPYING with the distribution for
208       details.
209
210
211
212perl v5.38.0                      2023-07-20           Fsdb::Filter::dbjoin(3)