1TOKYOCABINET(3) Tokyo Cabinet TOKYOCABINET(3)
2
3
4
6 tokyocabinet - a modern implementation of DBM
7
8
10 Tokyo Cabinet is a library of routines for managing a database. The
11 database is a simple data file containing records, each is a pair of a
12 key and a value. Every key and value is serial bytes with variable
13 length. Both binary data and character string can be used as a key and
14 a value. There is neither concept of data tables nor data types.
15 Records are organized in hash table, B+ tree, or fixed-length array.
16
17 As for database of hash table, each key must be unique within a data‐
18 base, so it is impossible to store two or more records with a key over‐
19 laps. The following access methods are provided to the database: stor‐
20 ing a record with a key and a value, deleting a record by a key,
21 retrieving a record by a key. Moreover, traversal access to every key
22 are provided, although the order is arbitrary. These access methods
23 are similar to ones of DBM (or its followers: NDBM and GDBM) library
24 defined in the UNIX standard. Tokyo Cabinet is an alternative for DBM
25 because of its higher performance.
26
27 As for database of B+ tree, records whose keys are duplicated can be
28 stored. Access methods of storing, deleting, and retrieving are pro‐
29 vided as with the database of hash table. Records are stored in order
30 by a comparison function assigned by a user. It is possible to access
31 each record with the cursor in ascending or descending order. Accord‐
32 ing to this mechanism, forward matching search for strings and range
33 search for integers are realized.
34
35 As for database of fixed-length array, records are stored with unique
36 natural numbers. It is impossible to store two or more records with a
37 key overlaps. Moreover, the length of each record is limited by the
38 specified length. Provided operations are the same as ones of hash
39 database.
40
41 Table database is also provided as a variant of hash database. Each
42 record is identified by the primary key and has a set of named columns.
43 Although there is no concept of data schema, it is possible to search
44 for records with complex conditions efficiently by using indices of
45 arbitrary columns.
46
47 Tokyo Cabinet is written in the C language, and provided as API of C,
48 Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms
49 which have API conforming to C99 and POSIX. Tokyo Cabinet is a free
50 software licensed under the GNU Lesser General Public License.
51
52
54 Tokyo Cabinet is developed as the successor of GDBM and QDBM on the
55 following purposes. They are achieved and Tokyo Cabinet replaces con‐
56 ventional DBM products.
57
58 improves space efficiency : smaller size of database file.
59 improves time efficiency : faster processing speed.
60 improves parallelism : higher performance in multi-thread envi‐
61 ronment.
62 improves usability : simplified API.
63 improves robustness : database file is not corrupted even under
64 catastrophic situation.
65 supports 64-bit architecture : enormous memory space and data‐
66 base file are available.
67
68 As with QDBM, the following three restrictions of traditional DBM: a
69 process can handle only one database, the size of a key and a value is
70 bounded, a database file is sparse, are cleared. Moreover, the follow‐
71 ing three restrictions of QDBM: the size of a database file is limited
72 to 2GB, environments with different byte orders can not share a data‐
73 base file, only one thread can search a database at the same time, are
74 cleared.
75
76 Tokyo Cabinet runs very fast. For example, elapsed time to store 1
77 million records is 0.7 seconds for hash database, and 1.6 seconds for
78 B+ tree database. Moreover, the size of database of Tokyo Cabinet is
79 very small. For example, overhead for a record is 16 bytes for hash
80 database, and 5 bytes for B+ tree database. Furthermore, scalability
81 of Tokyo Cabinet is great. The database size can be up to 8EB (9.22e18
82 bytes).
83
84
86 Tokyo Cabinet uses hash algorithm to retrieve records. If a bucket
87 array has sufficient number of elements, the time complexity of
88 retrieval is "O(1)". That is, time required for retrieving a record is
89 constant, regardless of the scale of a database. It is also the same
90 about storing and deleting. Collision of hash values is managed by
91 separate chaining. Data structure of the chains is binary search tree.
92 Even if a bucket array has unusually scarce elements, the time complex‐
93 ity of retrieval is "O(log n)".
94
95 Tokyo Cabinet attains improvement in retrieval by loading RAM with the
96 whole of a bucket array. If a bucket array is on RAM, it is possible
97 to access a region of a target record by about one path of file opera‐
98 tions. A bucket array saved in a file is not read into RAM with the
99 `read' call but directly mapped to RAM with the `mmap' call. There‐
100 fore, preparation time on connecting to a database is very short, and
101 two or more processes can share the same memory map.
102
103 If the number of elements of a bucket array is about half of records
104 stored within a database, although it depends on characteristic of the
105 input, the probability of collision of hash values is about 56.7%
106 (36.8% if the same, 21.3% if twice, 11.5% if four times, 6.0% if eight
107 times). In such case, it is possible to retrieve a record by two or
108 less paths of file operations. If it is made into a performance index,
109 in order to handle a database containing one million of records, a
110 bucket array with half a million of elements is needed. The size of
111 each element is 4 bytes. That is, if 2M bytes of RAM is available, a
112 database containing one million records can be handled.
113
114 Traditional DBM provides two modes of the storing operations: "insert"
115 and "replace". In the case a key overlaps an existing record, the
116 insert mode keeps the existing value, while the replace mode transposes
117 it to the specified value. In addition to the two modes, Tokyo Cabinet
118 provides "concatenate" mode. In the mode, the specified value is con‐
119 catenated at the end of the existing value and stored. This feature is
120 useful when adding an element to a value as an array.
121
122 Generally speaking, while succession of updating, fragmentation of
123 available regions occurs, and the size of a database grows rapidly.
124 Tokyo Cabinet deal with this problem by coalescence of dispensable
125 regions and reuse of them. When overwriting a record with a value
126 whose size is greater than the existing one, it is necessary to remove
127 the region to another position of the file. Because the time complex‐
128 ity of the operation depends on the size of the region of a record,
129 extending values successively is inefficient. However, Tokyo Cabinet
130 deal with this problem by alignment. If increment can be put in pad‐
131 ding, it is not necessary to remove the region.
132
133 The "free block pool" to reuse dispensable regions efficiently is also
134 implemented. It keeps a list of dispensable regions and reuse the
135 "best fit" region, that is the smallest region in the list, when a new
136 block is requested. Because fragmentation is inevitable even then, two
137 kinds of optimization (defragmentation) mechanisms are implemented.
138 The first is called static optimization which deploys all records into
139 another file and then writes them back to the original file at once.
140 The second is called dynamic optimization which gathers up dispensable
141 regions by replacing the locations of records and dispensable regions
142 gradually.
143
144
146 Although B+ tree database is slower than hash database, it features
147 ordering access to each record. The order can be assigned by users.
148 Records of B+ tree are sorted and arranged in logical pages. Sparse
149 index organized in B tree that is multiway balanced tree are maintained
150 for each page. Thus, the time complexity of retrieval and so on is
151 "O(log n)". Cursor is provided to access each record in order. The
152 cursor can jump to a position specified by a key and can step forward
153 or backward from the current position. Because each page is arranged
154 as double linked list, the time complexity of stepping cursor is
155 "O(1)".
156
157 B+ tree database is implemented, based on above hash database. Because
158 each page of B+ tree is stored as each record of hash database, B+ tree
159 database inherits efficiency of storage management of hash database.
160 Because the header of each record is smaller and alignment of each page
161 is adjusted according to the page size, in most cases, the size of
162 database file is cut by half compared to one of hash database.
163 Although operation of many pages are required to update B+ tree, QDBM
164 expedites the process by caching pages and reducing file operations.
165 In most cases, because whole of the sparse index is cached on memory,
166 it is possible to retrieve a record by one or less path of file opera‐
167 tions.
168
169 Each pages of B+ tree can be stored with compressed. Two compression
170 method; Deflate of ZLIB and Block Sorting of BZIP2, are supported.
171 Because each record in a page has similar patterns, high efficiency of
172 compression is expected due to the Lempel-Ziv or the BWT algorithms.
173 In case handling text data, the size of a database is reduced to about
174 25%. If the scale of a database is large and disk I/O is the bottle‐
175 neck, featuring compression makes the processing speed improved to a
176 large extent.
177
178
180 Fixed-length database has restrictions that each key should be a natu‐
181 ral number and that the length of each value is limited. However, time
182 efficiency and space efficiency are higher than the other data struc‐
183 tures as long as the use case is within the restriction.
184
185 Because the whole region of the database is mapped on memory by the
186 `mmap' call and referred as a multidimensional array, the overhead
187 related to the file I/O is minimized. Due to this simple structure,
188 fixed-length database works faster than hash database, and its concur‐
189 rency in multi-thread environment is prominent.
190
191 The size of the database is proportional to the range of keys and the
192 limit size of each value. That is, the smaller the range of keys is or
193 the smaller the length of each value is, the higher the space effi‐
194 ciency is. For example, if the maximum key is 1000000 and the limit
195 size of the value is 100 bytes, the size of the database will be about
196 100MB. Because regions around referred records are only loaded on the
197 RAM, you can increase the size of the database to the size of the vir‐
198 tual memory.
199
200
202 Table database does not express simple key/value structure but
203 expresses a structure like a table of relational database. Each record
204 is identified by the primary key and has a set of multiple columns
205 named with arbitrary strings. For example, a stuff in your company can
206 be expressed by a record identified by the primary key of the employee
207 ID number and structured by columns of his name, division, salary, and
208 so on. Unlike relational database, table database does not need to
209 define any data schema and can contain records of various structures
210 different from each other.
211
212 Table database supports query functions with not only the primary key
213 but also with conditions about arbitrary columns. Each column condi‐
214 tion is composed of the name of a column and a condition expression.
215 Operators of full matching, forward matching, regular expression match‐
216 ing, and so on are provided for the string type. Operators of full
217 matching, range matching and so on are provided for the number type.
218 Operators for tag search and full-text search are also provided. A
219 query can contain multiple conditions for logical intersection. Search
220 by multiple queries for logical union is also available. The order of
221 the result set can be specified as the ascending or descending order of
222 strings or numbers.
223
224 You can create indices for arbitrary columns to improve performance of
225 search and sorting. Although columns do not have data types, indices
226 have types for strings or numbers. Inverted indices for space sepa‐
227 rated tokens and character N-gram tokens are also supported. The query
228 optimizer uses indices in suitable way according to each query.
229 Indices are implemented as different files of B+ tree database.
230
231
233 Databases on the filesystem feature transaction mechanisms. It is pos‐
234 sible to commit a series of operations between the beginning and the
235 end of the transaction in a lump, or to abort the transaction and per‐
236 form rollback to the state before the transaction. Two isolation lev‐
237 els are supported; serializable and read uncommitted. Durability is
238 secured by write ahead logging and shadow paging.
239
240 Tokyo Cabinet provides two modes to connect to a database: "reader" and
241 "writer". A reader can perform retrieving but neither storing nor
242 deleting. A writer can perform all access methods. Exclusion control
243 between processes is performed when connecting to a database by file
244 locking. While a writer is connected to a database, neither readers
245 nor writers can be connected. While a reader is connected to a data‐
246 base, other readers can be connect, but writers can not. According to
247 this mechanism, data consistency is guaranteed with simultaneous con‐
248 nections in multitasking environment.
249
250 Functions of API of Tokyo cabinet are reentrant and available in
251 multi-thread environment. Discrete database object can be operated in
252 parallel entirely. For simultaneous operations of the same database
253 object, read-write lock is used for exclusion control. That is, while
254 a writing thread is operating the database, other reading threads and
255 writing threads are blocked. However, while a reading thread is oper‐
256 ating the database, reading threads are not blocked. The locking gran‐
257 ularity of hash database and fixed-length database is per record, and
258 that of the other databases is per file.
259
260
262 Tokyo Cabinet provides simple API based on the object oriented design.
263 Every operation for database is encapsulated and published as lucid
264 methods as `open' (connect), `close' (disconnect), `put' (insert),
265 `out' (remove), `get' (retrieve), and so on. Because the three of
266 hash, B+ tree, and fixed-length array database APIs are very similar
267 with each other, porting an application from one to the other is easy.
268 Moreover, the abstract API is provided to handle these databases with
269 the same interface. Applications of the abstract API can determine the
270 type of the database in runtime.
271
272 The utility API is also provided. Such fundamental data structure as
273 list and map are included. And, some useful features; memory pool,
274 string processing, encoding, are also included.
275
276 Six kinds of API; the utility API, the hash database API, the B+ tree
277 database API, the fixed-length database API, the table database API,
278 and the abstract database API, are provided for the C language. Com‐
279 mand line interfaces are also provided corresponding to each API. They
280 are useful for prototyping, test, and debugging. Except for C, Tokyo
281 Cabinet provides APIs for Perl, Ruby, Java, and Lua. APIs for other
282 languages will hopefully be provided by third party.
283
284 In cases that multiple processes access a database at the same time or
285 some processes access a database on a remote host, the remote service
286 is useful. The remote service is composed of a database server and its
287 access library. Applications can access the database server by using
288 the remote database API. The server implements HTTP and the memcached
289 protocol partly so that client programs on almost all platforms can
290 access the server easily.
291
292
294 Tokyo Cabinet provides API of the C language and it is available by
295 programs conforming to the C89 (ANSI C) standard or the C99 standard.
296 As the header files of Tokyo Cabinet are provided as `tcutil.h',
297 `tchdb.h', and `tcbdb.h', applications should include one or more of
298 them accordingly to use the API. As the library is provided as
299 `libtokyocabinet.a' and `libtokyocabinet.so' and they depends
300 `libz.so', `librt.so', `libpthread.so', `libm.so', and `libc.so',
301 linker options `-ltokyocabinet', `-lz', `-lbz2', `-lrt', `-lpthread',
302 `-lm', and `-lc' are required for build command. A typical build com‐
303 mand is the following.
304
305 gcc -I/usr/local/include tc_example.c -o tc_example \
306 -L/usr/local/lib -ltokyocabinet -lz -lbz2 -lrt -lpthread -lm
307 -lc
308
309 You can also use Tokyo Cabinet in programs written in C++. Because
310 each header is wrapped in C linkage (`extern "C"' block), you can sim‐
311 ply include them into your C++ programs.
312
313
315 Tokyo Cabinet is free software; you can redistribute it and/or modify
316 it under the terms of the GNU Lesser General Public License as pub‐
317 lished by the Free Software Foundation; either version 2.1 of the
318 License or any later version.
319
320 Tokyo Cabinet is distributed in the hope that it will be useful, but
321 WITHOUT ANY WARRANTY; without even the implied warranty of MER‐
322 CHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser
323 General Public License for more details.
324
325 You should have received a copy of the GNU Lesser General Public
326 License along with Tokyo Cabinet (See the file `COPYING'); if not,
327 write to the Free Software Foundation, Inc., 59 Temple Place, Suite
328 330, Boston, MA 02111-1307 USA.
329
330 Tokyo Cabinet was written by FAL Labs. You can contact the author by
331 e-mail to `info@fallabs.com'.
332
333
335 tcutil(3), tchdb(3), tcbdb(3), tcfdb(3), tctdb(3), tcadb(3)
336
337 Please see http://1978th.net/tokyocabinet/ for detail.
338
339
340
341Man Page 2010-08-05 TOKYOCABINET(3)