tokyocabinet(3)

1TOKYOCABINET(3)                  Tokyo Cabinet                 TOKYOCABINET(3)
2
3
4

NAME

6       tokyocabinet - a modern implementation of DBM
7
8

INTRODUCTION

10       Tokyo  Cabinet  is  a library of routines for managing a database.  The
11       database is a simple data file containing records, each is a pair of  a
12       key  and  a  value.   Every key and value is serial bytes with variable
13       length.  Both binary data and character string can be used as a key and
14       a  value.   There  is  neither  concept  of data tables nor data types.
15       Records are organized in hash table, B+ tree, or fixed-length array.
16
17       As for database of hash table, each key must be unique within  a  data‐
18       base, so it is impossible to store two or more records with a key over‐
19       laps.  The following access methods are provided to the database: stor‐
20       ing  a  record  with  a  key  and  a value, deleting a record by a key,
21       retrieving a record by a key.  Moreover, traversal access to every  key
22       are  provided,  although  the order is arbitrary.  These access methods
23       are similar to ones of DBM (or its followers: NDBM  and  GDBM)  library
24       defined  in the UNIX standard.  Tokyo Cabinet is an alternative for DBM
25       because of its higher performance.
26
27       As for database of B+ tree, records whose keys are  duplicated  can  be
28       stored.   Access  methods of storing, deleting, and retrieving are pro‐
29       vided as with the database of hash table.  Records are stored in  order
30       by  a comparison function assigned by a user.  It is possible to access
31       each record with the cursor in ascending or descending order.   Accord‐
32       ing  to  this  mechanism, forward matching search for strings and range
33       search for integers are realized.
34
35       As for database of fixed-length array, records are stored  with  unique
36       natural  numbers.  It is impossible to store two or more records with a
37       key overlaps.  Moreover, the length of each record is  limited  by  the
38       specified  length.   Provided  operations  are the same as ones of hash
39       database.
40
41       Table database is also provided as a variant of  hash  database.   Each
42       record is identified by the primary key and has a set of named columns.
43       Although there is no concept of data schema, it is possible  to  search
44       for  records  with  complex  conditions efficiently by using indices of
45       arbitrary columns.
46
47       Tokyo Cabinet is written in the C language, and provided as API  of  C,
48       Perl,  Ruby,  Java,  and  Lua.  Tokyo Cabinet is available on platforms
49       which have API conforming to C99 and POSIX.  Tokyo Cabinet  is  a  free
50       software licensed under the GNU Lesser General Public License.
51
52

THE DINOSAUR WING OF THE DBM FORKS

54       Tokyo  Cabinet  is  developed  as the successor of GDBM and QDBM on the
55       following purposes.  They are achieved and Tokyo Cabinet replaces  con‐
56       ventional DBM products.
57
58              improves space efficiency : smaller size of database file.
59              improves time efficiency : faster processing speed.
60              improves  parallelism : higher performance in multi-thread envi‐
61              ronment.
62              improves usability : simplified API.
63              improves robustness : database file is not corrupted even  under
64              catastrophic situation.
65              supports  64-bit  architecture : enormous memory space and data‐
66              base file are available.
67
68       As with QDBM, the following three restrictions of  traditional  DBM:  a
69       process  can handle only one database, the size of a key and a value is
70       bounded, a database file is sparse, are cleared.  Moreover, the follow‐
71       ing  three restrictions of QDBM: the size of a database file is limited
72       to 2GB, environments with different byte orders can not share  a  data‐
73       base  file, only one thread can search a database at the same time, are
74       cleared.
75
76       Tokyo Cabinet runs very fast.  For example, elapsed  time  to  store  1
77       million  records  is 0.7 seconds for hash database, and 1.6 seconds for
78       B+ tree database.  Moreover, the size of database of Tokyo  Cabinet  is
79       very  small.   For  example, overhead for a record is 16 bytes for hash
80       database, and 5 bytes for B+ tree database.   Furthermore,  scalability
81       of Tokyo Cabinet is great.  The database size can be up to 8EB (9.22e18
82       bytes).
83
84

EFFECTIVE IMPLEMENTATION OF HASH DATABASE

86       Tokyo Cabinet uses hash algorithm to retrieve  records.   If  a  bucket
87       array  has  sufficient  number  of  elements,  the  time  complexity of
88       retrieval is "O(1)".  That is, time required for retrieving a record is
89       constant,  regardless  of the scale of a database.  It is also the same
90       about storing and deleting.  Collision of hash  values  is  managed  by
91       separate chaining.  Data structure of the chains is binary search tree.
92       Even if a bucket array has unusually scarce elements, the time complex‐
93       ity of retrieval is "O(log n)".
94
95       Tokyo  Cabinet attains improvement in retrieval by loading RAM with the
96       whole of a bucket array.  If a bucket array is on RAM, it  is  possible
97       to  access a region of a target record by about one path of file opera‐
98       tions.  A bucket array saved in a file is not read into  RAM  with  the
99       `read'  call  but  directly mapped to RAM with the `mmap' call.  There‐
100       fore, preparation time on connecting to a database is very  short,  and
101       two or more processes can share the same memory map.
102
103       If  the  number  of elements of a bucket array is about half of records
104       stored within a database, although it depends on characteristic of  the
105       input,  the  probability  of  collision  of  hash values is about 56.7%
106       (36.8% if the same, 21.3% if twice, 11.5% if four times, 6.0% if  eight
107       times).   In  such  case, it is possible to retrieve a record by two or
108       less paths of file operations.  If it is made into a performance index,
109       in  order  to  handle  a  database containing one million of records, a
110       bucket array with half a million of elements is needed.   The  size  of
111       each  element  is 4 bytes.  That is, if 2M bytes of RAM is available, a
112       database containing one million records can be handled.
113
114       Traditional DBM provides two modes of the storing operations:  "insert"
115       and  "replace".   In  the  case  a key overlaps an existing record, the
116       insert mode keeps the existing value, while the replace mode transposes
117       it to the specified value.  In addition to the two modes, Tokyo Cabinet
118       provides "concatenate" mode.  In the mode, the specified value is  con‐
119       catenated at the end of the existing value and stored.  This feature is
120       useful when adding an element to a value as an array.
121
122       Generally speaking, while  succession  of  updating,  fragmentation  of
123       available  regions  occurs,  and  the size of a database grows rapidly.
124       Tokyo Cabinet deal with this  problem  by  coalescence  of  dispensable
125       regions  and  reuse  of  them.   When overwriting a record with a value
126       whose size is greater than the existing one, it is necessary to  remove
127       the  region to another position of the file.  Because the time complex‐
128       ity of the operation depends on the size of the  region  of  a  record,
129       extending  values  successively is inefficient.  However, Tokyo Cabinet
130       deal with this problem by alignment.  If increment can be put  in  pad‐
131       ding, it is not necessary to remove the region.
132
133       The  "free block pool" to reuse dispensable regions efficiently is also
134       implemented.  It keeps a list of  dispensable  regions  and  reuse  the
135       "best  fit" region, that is the smallest region in the list, when a new
136       block is requested.  Because fragmentation is inevitable even then, two
137       kinds  of  optimization  (defragmentation)  mechanisms are implemented.
138       The first is called static optimization which deploys all records  into
139       another  file  and  then writes them back to the original file at once.
140       The second is called dynamic optimization which gathers up  dispensable
141       regions  by  replacing the locations of records and dispensable regions
142       gradually.
143
144

USEFUL IMPLEMENTATION OF B+ TREE DATABASE

146       Although B+ tree database is slower than  hash  database,  it  features
147       ordering  access  to  each record.  The order can be assigned by users.
148       Records of B+ tree are sorted and arranged in  logical  pages.   Sparse
149       index organized in B tree that is multiway balanced tree are maintained
150       for each page.  Thus, the time complexity of retrieval  and  so  on  is
151       "O(log  n)".   Cursor  is provided to access each record in order.  The
152       cursor can jump to a position specified by a key and can  step  forward
153       or  backward  from the current position.  Because each page is arranged
154       as double linked list,  the  time  complexity  of  stepping  cursor  is
155       "O(1)".
156
157       B+ tree database is implemented, based on above hash database.  Because
158       each page of B+ tree is stored as each record of hash database, B+ tree
159       database  inherits  efficiency  of storage management of hash database.
160       Because the header of each record is smaller and alignment of each page
161       is  adjusted  according  to  the  page size, in most cases, the size of
162       database file is  cut  by  half  compared  to  one  of  hash  database.
163       Although  operation  of many pages are required to update B+ tree, QDBM
164       expedites the process by caching pages and  reducing  file  operations.
165       In  most  cases, because whole of the sparse index is cached on memory,
166       it is possible to retrieve a record by one or less path of file  opera‐
167       tions.
168
169       Each  pages  of B+ tree can be stored with compressed.  Two compression
170       method; Deflate of ZLIB and Block  Sorting  of  BZIP2,  are  supported.
171       Because  each record in a page has similar patterns, high efficiency of
172       compression is expected due to the Lempel-Ziv or  the  BWT  algorithms.
173       In  case handling text data, the size of a database is reduced to about
174       25%.  If the scale of a database is large and disk I/O is  the  bottle‐
175       neck,  featuring  compression  makes the processing speed improved to a
176       large extent.
177
178

NAIVE IMPLEMENTATION OF FIXED-LENGTH DATABASE

180       Fixed-length database has restrictions that each key should be a  natu‐
181       ral number and that the length of each value is limited.  However, time
182       efficiency and space efficiency are higher than the other  data  struc‐
183       tures as long as the use case is within the restriction.
184
185       Because  the  whole  region  of the database is mapped on memory by the
186       `mmap' call and referred as  a  multidimensional  array,  the  overhead
187       related  to  the  file I/O is minimized.  Due to this simple structure,
188       fixed-length database works faster than hash database, and its  concur‐
189       rency in multi-thread environment is prominent.
190
191       The  size  of the database is proportional to the range of keys and the
192       limit size of each value.  That is, the smaller the range of keys is or
193       the  smaller  the  length  of each value is, the higher the space effi‐
194       ciency is.  For example, if the maximum key is 1000000  and  the  limit
195       size  of the value is 100 bytes, the size of the database will be about
196       100MB.  Because regions around referred records are only loaded on  the
197       RAM,  you can increase the size of the database to the size of the vir‐
198       tual memory.
199
200

FLEXIBLE IMPLEMENTATION OF TABLE DATABASE

202       Table  database  does  not  express  simple  key/value  structure   but
203       expresses a structure like a table of relational database.  Each record
204       is identified by the primary key and has  a  set  of  multiple  columns
205       named with arbitrary strings.  For example, a stuff in your company can
206       be expressed by a record identified by the primary key of the  employee
207       ID  number and structured by columns of his name, division, salary, and
208       so on.  Unlike relational database, table database  does  not  need  to
209       define  any  data  schema and can contain records of various structures
210       different from each other.
211
212       Table database supports query functions with not only the  primary  key
213       but  also  with conditions about arbitrary columns.  Each column condi‐
214       tion is composed of the name of a column and  a  condition  expression.
215       Operators of full matching, forward matching, regular expression match‐
216       ing, and so on are provided for the string  type.   Operators  of  full
217       matching,  range  matching  and so on are provided for the number type.
218       Operators for tag search and full-text search  are  also  provided.   A
219       query can contain multiple conditions for logical intersection.  Search
220       by multiple queries for logical union is also available.  The order  of
221       the result set can be specified as the ascending or descending order of
222       strings or numbers.
223
224       You can create indices for arbitrary columns to improve performance  of
225       search  and  sorting.  Although columns do not have data types, indices
226       have types for strings or numbers.  Inverted indices  for  space  sepa‐
227       rated tokens and character N-gram tokens are also supported.  The query
228       optimizer uses  indices  in  suitable  way  according  to  each  query.
229       Indices are implemented as different files of B+ tree database.
230
231

PRACTICAL FUNCTIONALITY

233       Databases on the filesystem feature transaction mechanisms.  It is pos‐
234       sible to commit a series of operations between the  beginning  and  the
235       end  of the transaction in a lump, or to abort the transaction and per‐
236       form rollback to the state before the transaction.  Two isolation  lev‐
237       els  are  supported;  serializable and read uncommitted.  Durability is
238       secured by write ahead logging and shadow paging.
239
240       Tokyo Cabinet provides two modes to connect to a database: "reader" and
241       "writer".   A  reader  can  perform  retrieving but neither storing nor
242       deleting.  A writer can perform all access methods.  Exclusion  control
243       between  processes  is  performed when connecting to a database by file
244       locking.  While a writer is connected to a  database,  neither  readers
245       nor  writers  can be connected.  While a reader is connected to a data‐
246       base, other readers can be connect, but writers can not.  According  to
247       this  mechanism,  data consistency is guaranteed with simultaneous con‐
248       nections in multitasking environment.
249
250       Functions of API of  Tokyo  cabinet  are  reentrant  and  available  in
251       multi-thread  environment.  Discrete database object can be operated in
252       parallel entirely.  For simultaneous operations of  the  same  database
253       object,  read-write lock is used for exclusion control.  That is, while
254       a writing thread is operating the database, other reading  threads  and
255       writing  threads are blocked.  However, while a reading thread is oper‐
256       ating the database, reading threads are not blocked.  The locking gran‐
257       ularity  of  hash database and fixed-length database is per record, and
258       that of the other databases is per file.
259
260

SIMPLE BUT VARIOUS INTERFACES

262       Tokyo Cabinet provides simple API based on the object oriented  design.
263       Every  operation  for  database  is encapsulated and published as lucid
264       methods as `open'  (connect),  `close'  (disconnect),  `put'  (insert),
265       `out'  (remove),  `get'  (retrieve),  and  so on.  Because the three of
266       hash, B+ tree, and fixed-length array database APIs  are  very  similar
267       with  each other, porting an application from one to the other is easy.
268       Moreover, the abstract API is provided to handle these  databases  with
269       the same interface.  Applications of the abstract API can determine the
270       type of the database in runtime.
271
272       The utility API is also provided.  Such fundamental data  structure  as
273       list  and  map  are  included.  And, some useful features; memory pool,
274       string processing, encoding, are also included.
275
276       Six kinds of API; the utility API, the hash database API, the  B+  tree
277       database  API,  the  fixed-length database API, the table database API,
278       and the abstract database API, are provided for the C  language.   Com‐
279       mand line interfaces are also provided corresponding to each API.  They
280       are useful for prototyping, test, and debugging.  Except for  C,  Tokyo
281       Cabinet  provides  APIs  for Perl, Ruby, Java, and Lua.  APIs for other
282       languages will hopefully be provided by third party.
283
284       In cases that multiple processes access a database at the same time  or
285       some  processes  access a database on a remote host, the remote service
286       is useful.  The remote service is composed of a database server and its
287       access  library.   Applications can access the database server by using
288       the remote database API.  The server implements HTTP and the  memcached
289       protocol  partly  so  that  client programs on almost all platforms can
290       access the server easily.
291
292

HOW TO USE THE LIBRARY

294       Tokyo Cabinet provides API of the C language and  it  is  available  by
295       programs  conforming  to the C89 (ANSI C) standard or the C99 standard.
296       As the header files  of  Tokyo  Cabinet  are  provided  as  `tcutil.h',
297       `tchdb.h',  and  `tcbdb.h',  applications should include one or more of
298       them accordingly to use  the  API.   As  the  library  is  provided  as
299       `libtokyocabinet.a'   and   `libtokyocabinet.so'   and   they   depends
300       `libz.so',  `librt.so',  `libpthread.so',  `libm.so',  and   `libc.so',
301       linker  options  `-ltokyocabinet', `-lz', `-lbz2', `-lrt', `-lpthread',
302       `-lm', and `-lc' are required for build command.  A typical build  com‐
303       mand is the following.
304
305              gcc -I/usr/local/include tc_example.c -o tc_example \
306                -L/usr/local/lib  -ltokyocabinet  -lz -lbz2 -lrt -lpthread -lm
307              -lc
308
309       You can also use Tokyo Cabinet in programs  written  in  C++.   Because
310       each  header is wrapped in C linkage (`extern "C"' block), you can sim‐
311       ply include them into your C++ programs.
312
313

LICENSE

315       Tokyo Cabinet is free software; you can redistribute it  and/or  modify
316       it  under  the  terms  of the GNU Lesser General Public License as pub‐
317       lished by the Free Software  Foundation;  either  version  2.1  of  the
318       License or any later version.
319
320       Tokyo  Cabinet  is  distributed in the hope that it will be useful, but
321       WITHOUT ANY  WARRANTY;  without  even  the  implied  warranty  of  MER‐
322       CHANTABILITY  or  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser
323       General Public License for more details.
324
325       You should have received a  copy  of  the  GNU  Lesser  General  Public
326       License  along  with  Tokyo  Cabinet  (See the file `COPYING'); if not,
327       write to the Free Software Foundation, Inc.,  59  Temple  Place,  Suite
328       330, Boston, MA 02111-1307 USA.
329
330       Tokyo  Cabinet  was written by FAL Labs.  You can contact the author by
331       e-mail to `info@fallabs.com'.
332
333