qdbm(3) - f37

1QDBM(3)                     Quick Database Manager                     QDBM(3)
2
3
4

NAME

6       QDBM - quick database manager
7
8

OVERVIEW

10       QDBM is a library of routines for managing a database.  The database is
11       a simple data file containing records, each is a pair of a  key  and  a
12       value.  Every key and value is serial bytes with variable length.  Both
13       binary data and character string can be used as  a  key  and  a  value.
14       There  is  neither  concept of data tables nor data types.  Records are
15       organized in hash table or B+ tree.
16
17       As for database of hash table, each key must be unique within  a  data‐
18       base, so it is impossible to store two or more records with a key over‐
19       laps.  The following access methods are provided to the database: stor‐
20       ing  a  record  with  a  key  and  a value, deleting a record by a key,
21       retrieving a record by a key.  Moreover, traversal access to every  key
22       are  provided,  although  the order is arbitrary.  These access methods
23       are similar to ones of DBM (or its followers: NDBM  and  GDBM)  library
24       defined  in  the UNIX standard.  QDBM is an alternative for DBM because
25       of its higher performance.
26
27       As for database of B+ tree, records whose keys are  duplicated  can  be
28       stored.   Access  methods of storing, deleting, and retrieving are pro‐
29       vided as with the database of hash table.  Records are stored in  order
30       by  a  comparing function assigned by a user.  It is possible to access
31       each record with the cursor in ascending or descending order.   Accord‐
32       ing  to  this  mechanism, forward matching search for strings and range
33       search for integers are realized.  Moreover, transaction  is  available
34       in database of B+ tree.
35
36

EFFECTIVE IMPLEMENTATION OF HASH DATABASE

38       QDBM  is  developed  referring to GDBM for the purpose of the following
39       three points: higher processing speed, smaller size of a database file,
40       and  simpler  API.   They  have been achieved.  Moreover, the following
41       three restrictions of traditional DBM: a process can  handle  only  one
42       database,  the size of a key and a value is bounded, a database file is
43       sparse, are cleared.
44
45       QDBM uses hash algorithm to retrieve records.  If a  bucket  array  has
46       sufficient  number  of  elements,  the  time complexity of retrieval is
47       `O(1)'.  That is, time required for retrieving a  record  is  constant,
48       regardless of the scale of a database.  It is also the same about stor‐
49       ing and deleting.  Collision of hash  values  is  managed  by  separate
50       chaining.  Data structure of the chains is binary search tree.  Even if
51       a bucket array has unusually scarce elements, the  time  complexity  of
52       retrieval is `O(log n)'.
53
54       QDBM  attains improvement in retrieval by loading RAM with the whole of
55       a bucket array.  If a bucket array is on RAM, it is possible to  access
56       a  region  of  a target record by about one path of file operations.  A
57       bucket array saved in a file is not read into RAM with the `read'  call
58       but  directly  mapped to RAM with the `mmap' call.  Therefore, prepara‐
59       tion time on connecting to a database is very short, and  two  or  more
60       processes can share the same memory map.
61
62       If  the  number  of elements of a bucket array is about half of records
63       stored within a database, although it depends on characteristic of  the
64       input,  the  probability  of  collision  of  hash values is about 56.7%
65       (36.8% if the same, 21.3% if twice, 11.5% if four times, 6.0% if  eight
66       times).   In  such  case, it is possible to retrieve a record by two or
67       less paths of file operations.  If it is made into a performance index,
68       in  order  to  handle  a  database containing one million of records, a
69       bucket array with half a million of elements is needed.   The  size  of
70       each  element  is 4 bytes.  That is, if 2M bytes of RAM is available, a
71       database containing one million records can be handled.
72
73       QDBM provides  two  modes  to  connect  to  a  database:  `reader'  and
74       `writer'.   A  reader  can  perform  retrieving but neither storing nor
75       deleting.  A writer can perform all access methods.  Exclusion  control
76       between  processes  is  performed when connecting to a database by file
77       locking.  While a writer is connected to a  database,  neither  readers
78       nor  writers  can be connected.  While a reader is connected to a data‐
79       base, other readers can be connect, but writers can not.  According  to
80       this  mechanism,  data consistency is guaranteed with simultaneous con‐
81       nections in multitasking environment.
82
83       Traditional DBM provides two modes of the storing operations:  `insert'
84       and  `replace'.   In  the  case  a key overlaps an existing record, the
85       insert mode keeps the existing value, while the replace mode transposes
86       it to the specified value.  In addition to the two modes, QDBM provides
87       `concatenate' mode.  In the mode, the specified value  is  concatenated
88       at  the  end  of the existing value and stored.  This feature is useful
89       when adding a element to a value as an array.  Moreover,  although  DBM
90       has  a  method to fetch out a value from a database only by reading the
91       whole of a region of a record, QDBM has a method to fetch out a part of
92       a region of a value.  When a value is treated as an array, this feature
93       is also useful.
94
95       Generally speaking, while  succession  of  updating,  fragmentation  of
96       available  regions  occurs,  and  the size of a database grows rapidly.
97       QDBM deal with this problem by coalescence of dispensable  regions  and
98       reuse of them, and featuring of optimization of a database.  When over‐
99       writing a record with a value whose size is greater than  the  existing
100       one,  it  is  necessary to remove the region to another position of the
101       file.  Because the time complexity of the operation depends on the size
102       of  the  region  of  a record, extending values successively is ineffi‐
103       cient.  However, QDBM deal with this problem by alignment.   If  incre‐
104       ment can be put in padding, it is not necessary to remove the region.
105
106       As  for many file systems, it is impossible to handle a file whose size
107       is more than 2GB.  To deal with this problem, QDBM provides a directory
108       database  containing  multiple database files.  Due to this feature, it
109       is possible to handle a database whose total size is up to 1TB in  the‐
110       ory.   Moreover,  because  database  files  can be deployed on multiple
111       disks, the speed of updating operations can be improved as with  RAID-0
112       (striping).   It  is  also possible for the database files to deploy on
113       multiple file servers using NFS and so on.
114
115

USEFUL IMPLEMENTATION OF B+ TREE DATABASE

117       Although B+ tree database is slower than  hash  database,  it  features
118       ordering  access  to  each record.  The order can be assigned by users.
119       Records of B+ tree are sorted and arranged in  logical  pages.   Sparse
120       index organized in B tree that is multiway balanced tree are maintained
121       for each page.  Thus, the time complexity of retrieval  and  so  on  is
122       `O(log  n)'.   Cursor  is provided to access each record in order.  The
123       cursor can jump to a position specified by a key and can  step  forward
124       or  backward  from the current position.  Because each page is arranged
125       as double linked list,  the  time  complexity  of  stepping  cursor  is
126       `O(1)'.
127
128       B+ tree database is implemented, based on above hash database.  Because
129       each page of B+ tree is stored as each record of hash database, B+ tree
130       database  inherits  efficiency  of storage management of hash database.
131       Because the header of each record is smaller and alignment of each page
132       is  calculated  statistically, in most cases, the size of database file
133       is cut by half compared to one of hash database.  Although operation of
134       many  pages  are required to update B+ tree, QDBM expedites the process
135       by caching pages and reducing file operations.  In most cases,  because
136       whole  of  the  sparse  index  is  cached  on memory, it is possible to
137       retrieve a record by one or less path of file operations.
138
139       B+ tree database features transaction mechanism.   It  is  possible  to
140       commit  a series of operations between the beginning and the end of the
141       transaction in a lump, or to abort the transaction and perform rollback
142       to  the state before the transaction.  Even if the process of an appli‐
143       cation is crushed while the transaction, the database file is not  bro‐
144       ken.
145
146       In case that QDBM is built with ZLIB, LZO, or BZIP2 enabled, a lossless
147       data-compression library, the content of each page of B+ tree  is  com‐
148       pressed  and stored in a file.  Because each record in a page has simi‐
149       lar patterns, high efficiency of compression is  expected  due  to  the
150       Lempel-Ziv  algorithm  and  the  like.  In case handling text data, the
151       size of a database is reduced to about 25%.  If the scale of a database
152       is  large  and  disk I/O is the bottleneck, featuring compression makes
153       the processing speed improved to a large extent.
154
155

SIMPLE BUT VARIOUS INTERFACES

157       QDBM provides very simple APIs.  You can perform database I/O as  usual
158       file  I/O  with  `FILE' pointer defined in ANSI C.  In the basic API of
159       QDBM, entity of a database is recorded as one file.   In  the  extended
160       API,  entity  of  a database is recorded as several files in one direc‐
161       tory.  Because the two APIs are very similar with each  other,  porting
162       an application from one to the other is easy.
163
164       APIs  which  are  compatible  with NDBM and GDBM are also provided.  As
165       there are a lot of applications using NDBM or GDBM, it is easy to  port
166       them  onto QDBM.  In most cases, it is completed only by replacement of
167       header including (#include) and re-compiling.  However,  QDBM  can  not
168       handle database files made by the original NDBM or GDBM.
169
170       In  order  to  handle records on memory easily, the utility API is pro‐
171       vided.  It implements memory allocating functions,  sorting  functions,
172       extensible datum, array list, hash map, and so on.  Using them, you can
173       handle records in C language cheaply as in  such  script  languages  as
174       Perl or Ruby.
175
176       B+  tree  database  is used with the advanced API.  The advanced API is
177       implemented using the basic API  and  the  utility  API.   Because  the
178       advanced  API is also similar to the basic API and the extended API, it
179       is easy to learn how to use it.
180
181       In order to handle an inverted index which is used by full-text  search
182       systems,  the  inverted  API  is  provided.  If it is easy to handle an
183       inverted index of documents, an application can focus on text  process‐
184       ing  and natural language processing.  Because this API does not depend
185       on character codes  nor  languages,  it  is  possible  to  implement  a
186       full-text  search  system  which  can  respond to various requests from
187       users.
188
189       Along with APIs for C, QDBM provides APIs  for  C++,  Java,  Perl,  and
190       Ruby.   APIs  for  C  are  composed  of seven kinds: the basic API, the
191       extended API, the NDBM-compatible API,  the  GDBM-compatible  API,  the
192       utility  API,  the  advanced  API,  and the inverted API.  Command line
193       interfaces corresponding to each API are also provided.  They are  use‐
194       ful for prototyping, testing, debugging, and so on.  The C++ API encap‐
195       sulates database handling functions of the basic API, the extended API,
196       and  the  advanced  API  with class mechanism of C++.  The Java API has
197       native methods calling  the  basic  API,  the  extended  API,  and  the
198       advanced  API  with  Java  Native  Interface.  The Perl API has methods
199       calling the basic API, the extended API, and the advanced API  with  XS
200       language.   The Ruby API has method calling the basic API, the extended
201       API, and the advanced API as modules of Ruby.   Moreover,  CGI  scripts
202       for administration of databases and full-text search are provided.
203
204

WIDE PORTABILITY

206       QDBM  is  implemented  being  based on syntax of ANSI C (C89) and using
207       only APIs defined in ANSI C or POSIX.  Thus, QDBM works  on  most  UNIX
208       and  its  compatible  OSs.  As for C API, checking operations have been
209       done at least on Linux 2.2, Linux 2.4, FreeBSD 4.8, FreeBSD 5.0,  SunOS
210       5.7,  SunOS  5.8, SunOS 5.9, HP-UX 11.00, Cygwin 1.3.10, Mac OS X 10.2,
211       and RISC OS 5.03.  Although a database file created by QDBM depends  on
212       byte  order  of the processor, to do with it, utilities to dump data in
213       format which is independent to byte orders are provided.
214
215

BUILDING

217       For building a program using QDBM, the program should be linked with  a
218       library  file  `libqdbm.a' or `libqdbm.so'.  For example, the following
219       command is executed to build `sample' from `sample.c'.
220
221              gcc -I/usr/local/include  -o  sample  sample.c  -L/usr/local/lib
222              -lqdbm
223
224

AUTHOR

226       QDBM  is  written  by Mikio Hirabayashi.  You can contact the author by
227       e-mail to <mikio@fallabs.com>.  Any suggestion or bug report is welcome
228       to the author.
229
230

COPYRIGHT

232       Copyright(c) 2000-2003 Mikio Hirabayashi
233
234       QDBM  is  free software; you can redistribute it and/or modify it under
235       the terms of the GNU Lesser General Public License as published by  the
236       Free Software Foundation; either version 2 of the License, or any later
237       version.
238
239       QDBM is distributed in the hope that it will be useful, but WITHOUT ANY
240       WARRANTY;  without even the implied warranty of MERCHANTABILITY or FIT‐
241       NESS FOR A PARTICULAR PURPOSE.   See  the  GNU  Lesser  General  Public
242       License for more details.
243
244       You  should  have  received  a  copy  of  the GNU Lesser General Public
245       License along with QDBM; if not, write to the Free Software Foundation,
246       Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
247
248