CDB_File(3pm)

1CDB_File(3)           User Contributed Perl Documentation          CDB_File(3)
2
3
4

NAME

6       CDB_File - Perl extension for access to cdb databases
7

SYNOPSIS

9           use CDB_File;
10           $c = tie(%h, 'CDB_File', 'file.cdb') or die "tie failed: $!\n";
11
12           # If accessing a utf8 stored CDB_File
13           $c = tie(%h, 'CDB_File', 'file.cdb', utf8 => 1) or die "tie failed: $!\n";
14
15           $fh = $c->handle;
16           sysseek $fh, $c->datapos, 0 or die ...;
17           sysread $fh, $x, $c->datalen;
18           undef $c;
19           untie %h;
20
21           $t = CDB_File->new('t.cdb', "t.$$") or die ...;
22           $t->insert('key', 'value');
23           $t->finish;
24
25           CDB_File::create %t, $file, "$file.$$";
26
27       or
28
29           use CDB_File 'create';
30           create %t, $file, "$file.$$";
31
32           # If you want to store the data in utf8 mode.
33           create %t, $file, "$file.$$", utf8 => 1;
34       =head1 DESCRIPTION
35
36       CDB_File is a module which provides a Perl interface to Dan Bernstein's
37       cdb package:
38
39           cdb is a fast, reliable, lightweight package for creating and
40           reading constant databases.
41
42   Reading from a cdb
43       After the "tie" shown above, accesses to %h will refer to the cdb file
44       "file.cdb", as described in "tie" in perlfunc.
45
46       Low level access to the database is provided by the three methods
47       "handle", "datapos", and "datalen".  To use them, you must remember the
48       "CDB_File" object returned by the "tie" call: $c in the example above.
49       The "datapos" and "datalen" methods return the file offset position and
50       length respectively of the most recently visited key (for example, via
51       "exists").
52
53       Beware that if you create an extra reference to the "CDB_File" object
54       (like $c in the example above) you must destroy it (with "undef")
55       before calling "untie" on the hash.  This ensures that the object's
56       "DESTROY" method is called.  Note that "perl -w" will check this for
57       you; see perltie for further details.
58
59   Creating a cdb
60       A cdb file is created in three steps.  First call "new CDB_File
61       ($final, $tmp)", where $final is the name of the database to be
62       created, and $tmp is the name of a temporary file which can be
63       atomically renamed to $final.  Secondly, call the "insert" method once
64       for each (key, value) pair.  Finally, call the "finish" method to
65       complete the creation and renaming of the cdb file.
66
67       Alternatively, call the "insert()" method with multiple key/value
68       pairs. This can be significantly faster because there is less crossing
69       over the bridge from perl to C code. One simple way to do this is to
70       pass in an entire hash, as in: "$cdbmaker->insert(%hash);".
71
72       A simpler interface to cdb file creation is provided by
73       "CDB_File::create %t, $final, $tmp".  This creates a cdb file named
74       $final containing the contents of %t.  As before,  $tmp must name a
75       temporary file which can be atomically renamed to $final.
76       "CDB_File::create" may be imported.
77
78   UTF8 support.
79       When CDB_File was created in 1997 (prior even to Perl 5.6), Perl SVs
80       didn't really deal with UTF8. In order to properly store mixed bytes
81       and utf8 data in the file, we would normally need to store a bit for
82       each string which clarifies the encoding of the key / values.  This
83       would be useful since Perl hash keys are downgraded to bytes when
84       possible so as to normalize the hash key access regardless of encoding.
85
86       The CDB_File format is used outside of Perl and so must maintain file
87       format compatibility with those systems. As a result this module
88       provides a utf8 mode which must be enabled at database generation and
89       then later at read. Keys will always be stored as UTF8 strings which is
90       the opposite of how Perl stores the strings. This approach had to be
91       taken to assure no data corruption happened due to accidentally
92       downgraded SVs before they are stored or on retrieval.
93
94       You can enable utf8 mode by passing "utf8 => 1" to new, tie, or create.
95       All returned SVs while in this mode will be encoded in utf8.  This
96       feature is not available below 5.14 due to lack of Perl macro support.
97
98       NOTE: read/write of databases not stored in utf8 mode will often be
99       incompatible with any non-ascii data.
100

EXAMPLES

102       These are all complete programs.
103
104       1. Convert a Berkeley DB (B-tree) database to cdb format.
105
106           use CDB_File;
107           use DB_File;
108
109           tie %h, DB_File, $ARGV[0], O_RDONLY, undef, $DB_BTREE or
110                   die "$0: can't tie to $ARGV[0]: $!\n";
111
112           CDB_File::create %h, $ARGV[1], "$ARGV[1].$$" or
113                   die "$0: can't create cdb: $!\n";
114
115       2. Convert a flat file to cdb format.  In this example, the flat file
116       consists of one key per line, separated by a colon from the value.
117       Blank lines and lines beginning with # are skipped.
118
119           use CDB_File;
120
121           $cdb = new CDB_File("data.cdb", "data.$$") or
122                   die "$0: new CDB_File failed: $!\n";
123           while (<>) {
124                   next if /^$/ or /^#/;
125                   chop;
126                   ($k, $v) = split /:/, $_, 2;
127                   if (defined $v) {
128                           $cdb->insert($k, $v);
129                   } else {
130                           warn "bogus line: $_\n";
131                   }
132           }
133           $cdb->finish or die "$0: CDB_File finish failed: $!\n";
134
135       3. Perl version of cdbdump.
136
137           use CDB_File;
138
139           tie %data, 'CDB_File', $ARGV[0] or
140                   die "$0: can't tie to $ARGV[0]: $!\n";
141           while (($k, $v) = each %data) {
142                   print '+', length $k, ',', length $v, ":$k->$v\n";
143           }
144           print "\n";
145
146       4. For really enormous data values, you can use "handle", "datapos",
147       and "datalen", in combination with "sysseek" and "sysread", to avoid
148       reading the values into memory.  Here is the script bun-x.pl, which can
149       extract uncompressed files and directories from a bun file.
150
151           use CDB_File;
152
153           sub unnetstrings {
154               my($netstrings) = @_;
155               my @result;
156               while ($netstrings =~ s/^([0-9]+)://) {
157                       push @result, substr($netstrings, 0, $1, '');
158                       $netstrings =~ s/^,//;
159               }
160               return @result;
161           }
162
163           my $chunk = 8192;
164
165           sub extract {
166               my($file, $t, $b) = @_;
167               my $head = $$b{"H$file"};
168               my ($code, $type) = $head =~ m/^([0-9]+)(.)/;
169               if ($type eq "/") {
170                       mkdir $file, 0777;
171               } elsif ($type eq "_") {
172                       my ($total, $now, $got, $x);
173                       open OUT, ">$file" or die "open for output: $!\n";
174                       exists $$b{"D$code"} or die "corrupt bun file\n";
175                       my $fh = $t->handle;
176                       sysseek $fh, $t->datapos, 0;
177                       $total = $t->datalen;
178                       while ($total) {
179                               $now = ($total > $chunk) ? $chunk : $total;
180                               $got = sysread $fh, $x, $now;
181                               if (not $got) { die "read error\n"; }
182                               $total -= $got;
183                               print OUT $x;
184                       }
185                       close OUT;
186               } else {
187                       print STDERR "warning: skipping unknown file type\n";
188               }
189           }
190
191           die "usage\n" if @ARGV != 1;
192
193           my (%b, $t);
194           $t = tie %b, 'CDB_File', $ARGV[0] or die "tie: $!\n";
195           map { extract $_, $t, \%b } unnetstrings $b{""};
196
197       5. Although a cdb file is constant, you can simulate updating it in
198       Perl.  This is an expensive operation, as you have to create a new
199       database, and copy into it everything that's unchanged from the old
200       database.  (As compensation, the update does not affect database
201       readers.  The old database is available for them, till the moment the
202       new one is "finish"ed.)
203
204           use CDB_File;
205
206           $file = 'data.cdb';
207           $new = new CDB_File($file, "$file.$$") or
208                   die "$0: new CDB_File failed: $!\n";
209
210           # Add the new values; remember which keys we've seen.
211           while (<>) {
212                   chop;
213                   ($k, $v) = split;
214                   $new->insert($k, $v);
215                   $seen{$k} = 1;
216           }
217
218           # Add any old values that haven't been replaced.
219           tie %old, 'CDB_File', $file or die "$0: can't tie to $file: $!\n";
220           while (($k, $v) = each %old) {
221                   $new->insert($k, $v) unless $seen{$k};
222           }
223
224           $new->finish or die "$0: CDB_File finish failed: $!\n";
225

REPEATED KEYS

227       Most users can ignore this section.
228
229       A cdb file can contain repeated keys.  If the "insert" method is called
230       more than once with the same key during the creation of a cdb file,
231       that key will be repeated.
232
233       Here's an example.
234
235           $cdb = new CDB_File ("$file.cdb", "$file.$$") or die ...;
236           $cdb->insert('cat', 'gato');
237           $cdb->insert('cat', 'chat');
238           $cdb->finish;
239
240       Normally, any attempt to access a key retrieves the first value stored
241       under that key.  This code snippet always prints gato.
242
243           $catref = tie %catalogue, CDB_File, "$file.cdb" or die ...;
244           print "$catalogue{cat}";
245
246       However, all the usual ways of iterating over a hash---"keys",
247       "values", and "each"---do the Right Thing, even in the presence of
248       repeated keys.  This code snippet prints cat cat gato chat.
249
250           print join(' ', keys %catalogue, values %catalogue);
251
252       And these two both print cat:gato cat:chat, although the second is more
253       efficient.
254
255           foreach $key (keys %catalogue) {
256                   print "$key:$catalogue{$key} ";
257           }
258
259           while (($key, $val) = each %catalogue) {
260                   print "$key:$val ";
261           }
262
263       The "multi_get" method retrieves all the values associated with a key.
264       It returns a reference to an array containing all the values.  This
265       code prints gato chat.
266
267           print "@{$catref->multi_get('cat')}";
268
269       "multi_get" always returns an array reference.  If the key was not
270       found in the database, it will be a reference to an empty array.  To
271       test whether the key was found, you must test the array, and not the
272       reference.
273
274           $x = $catref->multiget($key);
275           warn "$key not found\n" unless $x; # WRONG; message never printed
276           warn "$key not found\n" unless @$x; # Correct
277
278       The "fetch_all" method returns a hashref of all keys with the first
279       value in the cdb.  This is useful for quickly loading a cdb file where
280       there is a 1:1 key mapping.  In practice it proved to be about 400%
281       faster then iterating a tied hash.
282
283           # Slow
284           my %copy = %tied_cdb;
285
286           # Much Faster
287           my $copy_hashref = $catref->fetch_all();
288

RETURN VALUES

290       The routines "tie", "new", and "finish" return undef if the attempted
291       operation failed; $! contains the reason for failure.
292

DIAGNOSTICS

294       The following fatal errors may occur.  (See "eval" in perlfunc if you
295       want to trap them.)
296
297       Modification of a CDB_File attempted
298           You attempted to modify a hash tied to a CDB_File.
299
300       CDB database too large
301           You attempted to create a cdb file larger than 4 gigabytes.
302
303       [ Write to | Read of | Seek in ] CDB_File failed: <error string>
304           If error string is Protocol error, you tried to "use CDB_File" to
305           access something that isn't a cdb file.  Otherwise a serious OS
306           level problem occurred, for example, you have run out of disk
307           space.
308

PERFORMANCE

310       Sometimes you need to get the most performance possible out of a
311       library. Rumour has it that perl's tie() interface is slow. In order to
312       get around that you can use CDB_File in an object oriented fashion,
313       rather than via tie().
314
315         my $cdb = CDB_File->TIEHASH('/path/to/cdbfile.cdb');
316
317         if ($cdb->EXISTS('key')) {
318             print "Key is: ", $cdb->FETCH('key'), "\n";
319         }
320
321       For more information on the methods available on tied hashes see
322       perltie.
323

THE ALGORITHM

325       This algorithm is described at <http://cr.yp.to/cdb/cdb.txt> It is
326       small enough that it is included inline in the event that the internet
327       loses the page:
328
329   A structure for constant databases
330       Copyright (c) 1996 D. J. Bernstein, djb@pobox.com
331
332       A cdb is an associative array: it maps strings ('keys'') to strings
333       ('data'').
334
335       A cdb contains 256 pointers to linearly probed open hash tables. The
336       hash tables contain pointers to (key,data) pairs. A cdb is stored in a
337       single file on disk:
338
339           +----------------+---------+-------+-------+-----+---------+
340           | p0 p1 ... p255 | records | hash0 | hash1 | ... | hash255 |
341           +----------------+---------+-------+-------+-----+---------+
342
343       Each of the 256 initial pointers states a position and a length. The
344       position is the starting byte position of the hash table. The length is
345       the number of slots in the hash table.
346
347       Records are stored sequentially, without special alignment. A record
348       states a key length, a data length, the key, and the data.
349
350       Each hash table slot states a hash value and a byte position. If the
351       byte position is 0, the slot is empty. Otherwise, the slot points to a
352       record whose key has that hash value.
353
354       Positions, lengths, and hash values are 32-bit quantities, stored in
355       little-endian form in 4 bytes. Thus a cdb must fit into 4 gigabytes.
356
357       A record is located as follows. Compute the hash value of the key in
358       the record. The hash value modulo 256 is the number of a hash table.
359       The hash value divided by 256, modulo the length of that table, is a
360       slot number. Probe that slot, the next higher slot, and so on, until
361       you find the record or run into an empty slot.
362
363       The cdb hash function is "h = ((h << 5) + h) ^ c", with a starting hash
364       of 5381.
365

BUGS

367       The "create()" interface could be done with "TIEHASH".
368

AUTHOR

373       Tim Goodwin, <tjg@star.le.ac.uk>.  CDB_File began on 1997-01-08.
374
375       Work provided through 2008 by Matt Sergeant, <matt@sergeant.org>
376
377       Now maintained  by Todd Rinaldo, <toddr@cpan.org>
378
379
380
381perl v5.34.0                      2021-07-22                       CDB_File(3)