1CDB_File(3) User Contributed Perl Documentation CDB_File(3)
2
3
4
6 CDB_File - Perl extension for access to cdb databases
7
9 use CDB_File;
10 $c = tie(%h, 'CDB_File', 'file.cdb') or die "tie failed: $!\n";
11
12 # If accessing a utf8 stored CDB_File
13 $c = tie(%h, 'CDB_File', 'file.cdb', utf8 => 1) or die "tie failed: $!\n";
14
15 $fh = $c->handle;
16 sysseek $fh, $c->datapos, 0 or die ...;
17 sysread $fh, $x, $c->datalen;
18 undef $c;
19 untie %h;
20
21 $t = CDB_File->new('t.cdb', "t.$$") or die ...;
22 $t->insert('key', 'value');
23 $t->finish;
24
25 CDB_File::create %t, $file, "$file.$$";
26
27 or
28
29 use CDB_File 'create';
30 create %t, $file, "$file.$$";
31
32 # If you want to store the data in utf8 mode.
33 create %t, $file, "$file.$$", utf8 => 1;
34 =head1 DESCRIPTION
35
36 CDB_File is a module which provides a Perl interface to Dan Bernstein's
37 cdb package:
38
39 cdb is a fast, reliable, lightweight package for creating and
40 reading constant databases.
41
42 Reading from a cdb
43 After the "tie" shown above, accesses to %h will refer to the cdb file
44 "file.cdb", as described in "tie" in perlfunc.
45
46 Low level access to the database is provided by the three methods
47 "handle", "datapos", and "datalen". To use them, you must remember the
48 "CDB_File" object returned by the "tie" call: $c in the example above.
49 The "datapos" and "datalen" methods return the file offset position and
50 length respectively of the most recently visited key (for example, via
51 "exists").
52
53 Beware that if you create an extra reference to the "CDB_File" object
54 (like $c in the example above) you must destroy it (with "undef")
55 before calling "untie" on the hash. This ensures that the object's
56 "DESTROY" method is called. Note that "perl -w" will check this for
57 you; see perltie for further details.
58
59 Creating a cdb
60 A cdb file is created in three steps. First call "new CDB_File
61 ($final, $tmp)", where $final is the name of the database to be
62 created, and $tmp is the name of a temporary file which can be
63 atomically renamed to $final. Secondly, call the "insert" method once
64 for each (key, value) pair. Finally, call the "finish" method to
65 complete the creation and renaming of the cdb file.
66
67 Alternatively, call the "insert()" method with multiple key/value
68 pairs. This can be significantly faster because there is less crossing
69 over the bridge from perl to C code. One simple way to do this is to
70 pass in an entire hash, as in: "$cdbmaker->insert(%hash);".
71
72 A simpler interface to cdb file creation is provided by
73 "CDB_File::create %t, $final, $tmp". This creates a cdb file named
74 $final containing the contents of %t. As before, $tmp must name a
75 temporary file which can be atomically renamed to $final.
76 "CDB_File::create" may be imported.
77
78 UTF8 support.
79 When CDB_File was created in 1997 (prior even to Perl 5.6), Perl SVs
80 didn't really deal with UTF8. In order to properly store mixed bytes
81 and utf8 data in the file, we would normally need to store a bit for
82 each string which clarifies the encoding of the key / values. This
83 would be useful since Perl hash keys are downgraded to bytes when
84 possible so as to normalize the hash key access regardless of encoding.
85
86 The CDB_File format is used outside of Perl and so must maintain file
87 format compatibility with those systems. As a result this module
88 provides a utf8 mode which must be enabled at database generation and
89 then later at read. Keys will always be stored as UTF8 strings which is
90 the opposite of how Perl stores the strings. This approach had to be
91 taken to assure no data corruption happened due to accidentally
92 downgraded SVs before they are stored or on retrieval.
93
94 You can enable utf8 mode by passing "utf8 => 1" to new, tie, or create.
95 All returned SVs while in this mode will be encoded in utf8. This
96 feature is not available below 5.14 due to lack of Perl macro support.
97
98 NOTE: read/write of databases not stored in utf8 mode will often be
99 incompatible with any non-ascii data.
100
102 These are all complete programs.
103
104 1. Convert a Berkeley DB (B-tree) database to cdb format.
105
106 use CDB_File;
107 use DB_File;
108
109 tie %h, DB_File, $ARGV[0], O_RDONLY, undef, $DB_BTREE or
110 die "$0: can't tie to $ARGV[0]: $!\n";
111
112 CDB_File::create %h, $ARGV[1], "$ARGV[1].$$" or
113 die "$0: can't create cdb: $!\n";
114
115 2. Convert a flat file to cdb format. In this example, the flat file
116 consists of one key per line, separated by a colon from the value.
117 Blank lines and lines beginning with # are skipped.
118
119 use CDB_File;
120
121 $cdb = new CDB_File("data.cdb", "data.$$") or
122 die "$0: new CDB_File failed: $!\n";
123 while (<>) {
124 next if /^$/ or /^#/;
125 chop;
126 ($k, $v) = split /:/, $_, 2;
127 if (defined $v) {
128 $cdb->insert($k, $v);
129 } else {
130 warn "bogus line: $_\n";
131 }
132 }
133 $cdb->finish or die "$0: CDB_File finish failed: $!\n";
134
135 3. Perl version of cdbdump.
136
137 use CDB_File;
138
139 tie %data, 'CDB_File', $ARGV[0] or
140 die "$0: can't tie to $ARGV[0]: $!\n";
141 while (($k, $v) = each %data) {
142 print '+', length $k, ',', length $v, ":$k->$v\n";
143 }
144 print "\n";
145
146 4. For really enormous data values, you can use "handle", "datapos",
147 and "datalen", in combination with "sysseek" and "sysread", to avoid
148 reading the values into memory. Here is the script bun-x.pl, which can
149 extract uncompressed files and directories from a bun file.
150
151 use CDB_File;
152
153 sub unnetstrings {
154 my($netstrings) = @_;
155 my @result;
156 while ($netstrings =~ s/^([0-9]+)://) {
157 push @result, substr($netstrings, 0, $1, '');
158 $netstrings =~ s/^,//;
159 }
160 return @result;
161 }
162
163 my $chunk = 8192;
164
165 sub extract {
166 my($file, $t, $b) = @_;
167 my $head = $$b{"H$file"};
168 my ($code, $type) = $head =~ m/^([0-9]+)(.)/;
169 if ($type eq "/") {
170 mkdir $file, 0777;
171 } elsif ($type eq "_") {
172 my ($total, $now, $got, $x);
173 open OUT, ">$file" or die "open for output: $!\n";
174 exists $$b{"D$code"} or die "corrupt bun file\n";
175 my $fh = $t->handle;
176 sysseek $fh, $t->datapos, 0;
177 $total = $t->datalen;
178 while ($total) {
179 $now = ($total > $chunk) ? $chunk : $total;
180 $got = sysread $fh, $x, $now;
181 if (not $got) { die "read error\n"; }
182 $total -= $got;
183 print OUT $x;
184 }
185 close OUT;
186 } else {
187 print STDERR "warning: skipping unknown file type\n";
188 }
189 }
190
191 die "usage\n" if @ARGV != 1;
192
193 my (%b, $t);
194 $t = tie %b, 'CDB_File', $ARGV[0] or die "tie: $!\n";
195 map { extract $_, $t, \%b } unnetstrings $b{""};
196
197 5. Although a cdb file is constant, you can simulate updating it in
198 Perl. This is an expensive operation, as you have to create a new
199 database, and copy into it everything that's unchanged from the old
200 database. (As compensation, the update does not affect database
201 readers. The old database is available for them, till the moment the
202 new one is "finish"ed.)
203
204 use CDB_File;
205
206 $file = 'data.cdb';
207 $new = new CDB_File($file, "$file.$$") or
208 die "$0: new CDB_File failed: $!\n";
209
210 # Add the new values; remember which keys we've seen.
211 while (<>) {
212 chop;
213 ($k, $v) = split;
214 $new->insert($k, $v);
215 $seen{$k} = 1;
216 }
217
218 # Add any old values that haven't been replaced.
219 tie %old, 'CDB_File', $file or die "$0: can't tie to $file: $!\n";
220 while (($k, $v) = each %old) {
221 $new->insert($k, $v) unless $seen{$k};
222 }
223
224 $new->finish or die "$0: CDB_File finish failed: $!\n";
225
227 Most users can ignore this section.
228
229 A cdb file can contain repeated keys. If the "insert" method is called
230 more than once with the same key during the creation of a cdb file,
231 that key will be repeated.
232
233 Here's an example.
234
235 $cdb = new CDB_File ("$file.cdb", "$file.$$") or die ...;
236 $cdb->insert('cat', 'gato');
237 $cdb->insert('cat', 'chat');
238 $cdb->finish;
239
240 Normally, any attempt to access a key retrieves the first value stored
241 under that key. This code snippet always prints gato.
242
243 $catref = tie %catalogue, CDB_File, "$file.cdb" or die ...;
244 print "$catalogue{cat}";
245
246 However, all the usual ways of iterating over a hash---"keys",
247 "values", and "each"---do the Right Thing, even in the presence of
248 repeated keys. This code snippet prints cat cat gato chat.
249
250 print join(' ', keys %catalogue, values %catalogue);
251
252 And these two both print cat:gato cat:chat, although the second is more
253 efficient.
254
255 foreach $key (keys %catalogue) {
256 print "$key:$catalogue{$key} ";
257 }
258
259 while (($key, $val) = each %catalogue) {
260 print "$key:$val ";
261 }
262
263 The "multi_get" method retrieves all the values associated with a key.
264 It returns a reference to an array containing all the values. This
265 code prints gato chat.
266
267 print "@{$catref->multi_get('cat')}";
268
269 "multi_get" always returns an array reference. If the key was not
270 found in the database, it will be a reference to an empty array. To
271 test whether the key was found, you must test the array, and not the
272 reference.
273
274 $x = $catref->multiget($key);
275 warn "$key not found\n" unless $x; # WRONG; message never printed
276 warn "$key not found\n" unless @$x; # Correct
277
278 The "fetch_all" method returns a hashref of all keys with the first
279 value in the cdb. This is useful for quickly loading a cdb file where
280 there is a 1:1 key mapping. In practice it proved to be about 400%
281 faster then iterating a tied hash.
282
283 # Slow
284 my %copy = %tied_cdb;
285
286 # Much Faster
287 my $copy_hashref = $catref->fetch_all();
288
290 The routines "tie", "new", and "finish" return undef if the attempted
291 operation failed; $! contains the reason for failure.
292
294 The following fatal errors may occur. (See "eval" in perlfunc if you
295 want to trap them.)
296
297 Modification of a CDB_File attempted
298 You attempted to modify a hash tied to a CDB_File.
299
300 CDB database too large
301 You attempted to create a cdb file larger than 4 gigabytes.
302
303 [ Write to | Read of | Seek in ] CDB_File failed: <error string>
304 If error string is Protocol error, you tried to "use CDB_File" to
305 access something that isn't a cdb file. Otherwise a serious OS
306 level problem occurred, for example, you have run out of disk
307 space.
308
310 Sometimes you need to get the most performance possible out of a
311 library. Rumour has it that perl's tie() interface is slow. In order to
312 get around that you can use CDB_File in an object oriented fashion,
313 rather than via tie().
314
315 my $cdb = CDB_File->TIEHASH('/path/to/cdbfile.cdb');
316
317 if ($cdb->EXISTS('key')) {
318 print "Key is: ", $cdb->FETCH('key'), "\n";
319 }
320
321 For more information on the methods available on tied hashes see
322 perltie.
323
325 This algorithm is described at <http://cr.yp.to/cdb/cdb.txt> It is
326 small enough that it is included inline in the event that the internet
327 loses the page:
328
329 A structure for constant databases
330 Copyright (c) 1996 D. J. Bernstein, djb@pobox.com
331
332 A cdb is an associative array: it maps strings ('keys'') to strings
333 ('data'').
334
335 A cdb contains 256 pointers to linearly probed open hash tables. The
336 hash tables contain pointers to (key,data) pairs. A cdb is stored in a
337 single file on disk:
338
339 +----------------+---------+-------+-------+-----+---------+
340 | p0 p1 ... p255 | records | hash0 | hash1 | ... | hash255 |
341 +----------------+---------+-------+-------+-----+---------+
342
343 Each of the 256 initial pointers states a position and a length. The
344 position is the starting byte position of the hash table. The length is
345 the number of slots in the hash table.
346
347 Records are stored sequentially, without special alignment. A record
348 states a key length, a data length, the key, and the data.
349
350 Each hash table slot states a hash value and a byte position. If the
351 byte position is 0, the slot is empty. Otherwise, the slot points to a
352 record whose key has that hash value.
353
354 Positions, lengths, and hash values are 32-bit quantities, stored in
355 little-endian form in 4 bytes. Thus a cdb must fit into 4 gigabytes.
356
357 A record is located as follows. Compute the hash value of the key in
358 the record. The hash value modulo 256 is the number of a hash table.
359 The hash value divided by 256, modulo the length of that table, is a
360 slot number. Probe that slot, the next higher slot, and so on, until
361 you find the record or run into an empty slot.
362
363 The cdb hash function is "h = ((h << 5) + h) ^ c", with a starting hash
364 of 5381.
365
367 The "create()" interface could be done with "TIEHASH".
368
370 cdb(3)
371
373 Tim Goodwin, <tjg@star.le.ac.uk>. CDB_File began on 1997-01-08.
374
375 Work provided through 2008 by Matt Sergeant, <matt@sergeant.org>
376
377 Now maintained by Todd Rinaldo, <toddr@cpan.org>
378
379
380
381perl v5.34.0 2021-07-22 CDB_File(3)