1DBD::SQLite::Fulltext_sUesaerrchC(o3n)tributed Perl DocuDmBeDn:t:aStQiLointe::Fulltext_search(3)
2
3
4
6 DBD::SQLite::Fulltext_search - Using fulltext searches with DBD::SQLite
7
9 Introduction
10 SQLite is bundled with an extension module called "FTS" for full-text
11 indexing. Tables with this feature enabled can be efficiently queried
12 to find rows that contain one or more instances of some specified words
13 (also called "tokens"), in any column, even if the table contains many
14 large documents.
15
16 The first full-text search modules for SQLite were called "FTS1" and
17 "FTS2" and are now obsolete. The latest version is "FTS4", but it
18 shares many features with the former module "FTS3", which is why parts
19 of the API and parts of the documentation still refer to "FTS3"; from a
20 client point of view, both can be considered largely equivalent.
21 Detailed documentation can be found at
22 <http://www.sqlite.org/fts3.html>.
23
24 Short example
25 Here is a very short example of using FTS :
26
27 $dbh->do(<<"") or die DBI::errstr;
28 CREATE VIRTUAL TABLE fts_example USING fts4(content)
29
30 my $sth = $dbh->prepare("INSERT INTO fts_example(content) VALUES (?)");
31 $sth->execute($_) foreach @docs_to_insert;
32
33 my $results = $dbh->selectall_arrayref(<<"");
34 SELECT docid, snippet(fts_example) FROM fts_example WHERE content MATCH 'foo'
35
36 The key points in this example are :
37
38 • The syntax for creating FTS tables is
39
40 CREATE VIRTUAL TABLE <table_name> USING fts4(<columns>)
41
42 where "<columns>" is a list of column names. Columns may be typed,
43 but the type information is ignored. If no columns are specified,
44 the default is a single column named "content". In addition, FTS
45 tables have an implicit column called "docid" (or also "rowid") for
46 numbering the stored documents.
47
48 • Statements for inserting, updating or deleting records use the same
49 syntax as for regular SQLite tables.
50
51 • Full-text searches are specified with the "MATCH" operator, and an
52 operand which may be a single word, a word prefix ending with '*',
53 a list of words, a "phrase query" in double quotes, or a boolean
54 combination of the above.
55
56 • The builtin function "snippet(...)" builds a formatted excerpt of
57 the document text, where the words pertaining to the query are
58 highlighted.
59
60 There are many more details to building and searching FTS tables, so we
61 strongly invite you to read the full documentation at
62 <http://www.sqlite.org/fts3.html>.
63
65 Here are some explanation about FTS queries, borrowed from the sqlite
66 documentation.
67
68 Token or token prefix queries
69 An FTS table may be queried for all documents that contain a specified
70 term, or for all documents that contain a term with a specified prefix.
71 The query expression for a specific term is simply the term itself. The
72 query expression used to search for a term prefix is the prefix itself
73 with a '*' character appended to it. For example:
74
75 -- Virtual table declaration
76 CREATE VIRTUAL TABLE docs USING fts3(title, body);
77
78 -- Query for all documents containing the term "linux":
79 SELECT * FROM docs WHERE docs MATCH 'linux';
80
81 -- Query for all documents containing a term with the prefix "lin".
82 SELECT * FROM docs WHERE docs MATCH 'lin*';
83
84 If a search token (on the right-hand side of the MATCH operator) begins
85 with "^" then that token must be the first in its field of the document
86 : so for example "^lin*" matches 'linux kernel changes ...' but does
87 not match 'new linux implementation'.
88
89 Column specifications
90 Normally, a token or token prefix query is matched against the FTS
91 table column specified as the right-hand side of the MATCH operator.
92 Or, if the special column with the same name as the FTS table itself is
93 specified, against all columns. This may be overridden by specifying a
94 column-name followed by a ":" character before a basic term query.
95 There may be space between the ":" and the term to query for, but not
96 between the column-name and the ":" character. For example:
97
98 -- Query the database for documents for which the term "linux" appears in
99 -- the document title, and the term "problems" appears in either the title
100 -- or body of the document.
101 SELECT * FROM docs WHERE docs MATCH 'title:linux problems';
102
103 -- Query the database for documents for which the term "linux" appears in
104 -- the document title, and the term "driver" appears in the body of the document
105 -- ("driver" may also appear in the title, but this alone will not satisfy the.
106 -- query criteria).
107 SELECT * FROM docs WHERE body MATCH 'title:linux driver';
108
109 Phrase queries
110 A phrase query is a query that retrieves all documents that contain a
111 nominated set of terms or term prefixes in a specified order with no
112 intervening tokens. Phrase queries are specified by enclosing a space
113 separated sequence of terms or term prefixes in double quotes ("). For
114 example:
115
116 -- Query for all documents that contain the phrase "linux applications".
117 SELECT * FROM docs WHERE docs MATCH '"linux applications"';
118
119 -- Query for all documents that contain a phrase that matches "lin* app*".
120 -- As well as "linux applications", this will match common phrases such
121 -- as "linoleum appliances" or "link apprentice".
122 SELECT * FROM docs WHERE docs MATCH '"lin* app*"';
123
124 NEAR queries.
125 A NEAR query is a query that returns documents that contain a two or
126 more nominated terms or phrases within a specified proximity of each
127 other (by default with 10 or less intervening terms). A NEAR query is
128 specified by putting the keyword "NEAR" between two phrase, term or
129 prefix queries. To specify a proximity other than the default, an
130 operator of the form "NEAR/<N>" may be used, where <N> is the maximum
131 number of intervening terms allowed. For example:
132
133 -- Virtual table declaration.
134 CREATE VIRTUAL TABLE docs USING fts4();
135
136 -- Virtual table data.
137 INSERT INTO docs VALUES('SQLite is an ACID compliant embedded relational database management system');
138
139 -- Search for a document that contains the terms "sqlite" and "database" with
140 -- not more than 10 intervening terms. This matches the only document in
141 -- table docs (since there are only six terms between "SQLite" and "database"
142 -- in the document).
143 SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR database';
144
145 -- Search for a document that contains the terms "sqlite" and "database" with
146 -- not more than 6 intervening terms. This also matches the only document in
147 -- table docs. Note that the order in which the terms appear in the document
148 -- does not have to be the same as the order in which they appear in the query.
149 SELECT * FROM docs WHERE docs MATCH 'database NEAR/6 sqlite';
150
151 -- Search for a document that contains the terms "sqlite" and "database" with
152 -- not more than 5 intervening terms. This query matches no documents.
153 SELECT * FROM docs WHERE docs MATCH 'database NEAR/5 sqlite';
154
155 -- Search for a document that contains the phrase "ACID compliant" and the term
156 -- "database" with not more than 2 terms separating the two. This matches the
157 -- document stored in table docs.
158 SELECT * FROM docs WHERE docs MATCH 'database NEAR/2 "ACID compliant"';
159
160 -- Search for a document that contains the phrase "ACID compliant" and the term
161 -- "sqlite" with not more than 2 terms separating the two. This also matches
162 -- the only document stored in table docs.
163 SELECT * FROM docs WHERE docs MATCH '"ACID compliant" NEAR/2 sqlite';
164
165 More than one NEAR operator may appear in a single query. In this case
166 each pair of terms or phrases separated by a NEAR operator must appear
167 within the specified proximity of each other in the document. Using the
168 same table and data as in the block of examples above:
169
170 -- The following query selects documents that contains an instance of the term
171 -- "sqlite" separated by two or fewer terms from an instance of the term "acid",
172 -- which is in turn separated by two or fewer terms from an instance of the term
173 -- "relational".
174 SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR/2 acid NEAR/2 relational';
175
176 -- This query matches no documents. There is an instance of the term "sqlite" with
177 -- sufficient proximity to an instance of "acid" but it is not sufficiently close
178 -- to an instance of the term "relational".
179 SELECT * FROM docs WHERE docs MATCH 'acid NEAR/2 sqlite NEAR/2 relational';
180
181 Phrase and NEAR queries may not span multiple columns within a row.
182
183 Set operations
184 The three basic query types described above may be used to query the
185 full-text index for the set of documents that match the specified
186 criteria. Using the FTS query expression language it is possible to
187 perform various set operations on the results of basic queries. There
188 are currently three supported operations:
189
190 • The AND operator determines the intersection of two sets of
191 documents.
192
193 • The OR operator calculates the union of two sets of documents.
194
195 • The NOT operator may be used to compute the relative complement of
196 one set of documents with respect to another.
197
198 The AND, OR and NOT binary set operators must be entered using capital
199 letters; otherwise, they are interpreted as basic term queries instead
200 of set operators. Each of the two operands to an operator may be a
201 basic FTS query, or the result of another AND, OR or NOT set operation.
202 Parenthesis may be used to control precedence and grouping.
203
204 The AND operator is implicit for adjacent basic queries without any
205 explicit operator. For example, the query expression "implicit
206 operator" is a more succinct version of "implicit AND operator".
207
208 Boolean operations as just described correspond to the so-called
209 "enhanced query syntax" of sqlite; this is the version compiled with
210 "DBD::SQLite", starting from version 1.31. A former version, called
211 the "standard query syntax", used to support tokens prefixed with '+'
212 or '-' signs (for token inclusion or exclusion); if your application
213 needs to support this old syntax, use DBD::SQLite::FTS3Transitional
214 (published in a separate distribution) for doing the conversion.
215
217 Concept
218 The behaviour of full-text indexes strongly depends on how documents
219 are split into tokens; therefore FTS table declarations can explicitly
220 specify how to perform tokenization:
221
222 CREATE ... USING fts4(<columns>, tokenize=<tokenizer>)
223
224 where "<tokenizer>" is a sequence of space-separated words that
225 triggers a specific tokenizer. Tokenizers can be SQLite builtins,
226 written in C code, or Perl tokenizers. Both are as explained below.
227
228 SQLite builtin tokenizers
229 SQLite comes with some builtin tokenizers (see
230 <http://www.sqlite.org/fts3.html#tokenizer>) :
231
232 simple
233 Under the simple tokenizer, a term is a contiguous sequence of
234 eligible characters, where eligible characters are all alphanumeric
235 characters, the "_" character, and all characters with UTF
236 codepoints greater than or equal to 128. All other characters are
237 discarded when splitting a document into terms. They serve only to
238 separate adjacent terms.
239
240 All uppercase characters within the ASCII range (UTF codepoints
241 less than 128), are transformed to their lowercase equivalents as
242 part of the tokenization process. Thus, full-text queries are case-
243 insensitive when using the simple tokenizer.
244
245 porter
246 The porter tokenizer uses the same rules to separate the input
247 document into terms, but as well as folding all terms to lower case
248 it uses the Porter Stemming algorithm to reduce related English
249 language words to a common root.
250
251 icu The icu tokenizer uses the ICU library to decide how to identify
252 word characters in different languages; however, this requires
253 SQLite to be compiled with the "SQLITE_ENABLE_ICU" pre-processor
254 symbol defined. So, to use this tokenizer, you need edit
255 Makefile.PL to add this flag in @CC_DEFINE, and then recompile
256 "DBD::SQLite"; of course, the prerequisite is to have an ICU
257 library available on your system.
258
259 unicode61
260 The unicode61 tokenizer works very much like "simple" except that
261 it does full unicode case folding according to rules in Unicode
262 Version 6.1 and it recognizes unicode space and punctuation
263 characters and uses those to separate tokens. By contrast, the
264 simple tokenizer only does case folding of ASCII characters and
265 only recognizes ASCII space and punctuation characters as token
266 separators.
267
268 By default, "unicode61" also removes all diacritics from Latin
269 script characters. This behaviour can be overridden by adding the
270 tokenizer argument "remove_diacritics=0". For example:
271
272 -- Create tables that remove diacritics from Latin script characters
273 -- as part of tokenization.
274 CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61);
275 CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1");
276
277 -- Create a table that does not remove diacritics from Latin script
278 -- characters as part of tokenization.
279 CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0");
280
281 Additional options can customize the set of codepoints that
282 unicode61 treats as separator characters or as token characters --
283 see the documentation in
284 <http://www.sqlite.org/fts3.html#unicode61>.
285
286 If a more complex tokenizing algorithm is required, for example to
287 implement stemming, discard punctuation, or to recognize compound
288 words, use the perl tokenizer to implement your own logic, as explained
289 below.
290
291 Perl tokenizers
292 Declaring a perl tokenizer
293
294 In addition to the builtin SQLite tokenizers, "DBD::SQLite" implements
295 a perl tokenizer, that can hook to any tokenizing algorithm written in
296 Perl. This is specified as follows :
297
298 CREATE ... USING fts4(<columns>, tokenize=perl '<perl_function>')
299
300 where "<perl_function>" is a fully qualified Perl function name (i.e.
301 prefixed by the name of the package in which that function is
302 declared). So for example if the function is "my_func" in the main
303 program, write
304
305 CREATE ... USING fts4(<columns>, tokenize=perl 'main::my_func')
306
307 Writing a perl tokenizer by hand
308
309 That function should return a code reference that takes a string as
310 single argument, and returns an iterator (another function), which
311 returns a tuple "($term, $len, $start, $end, $index)" for each term.
312 Here is a simple example that tokenizes on words according to the
313 current perl locale
314
315 sub locale_tokenizer {
316 return sub {
317 my $string = shift;
318
319 use locale;
320 my $regex = qr/\w+/;
321 my $term_index = 0;
322
323 return sub { # closure
324 $string =~ /$regex/g or return; # either match, or no more token
325 my ($start, $end) = ($-[0], $+[0]);
326 my $len = $end-$start;
327 my $term = substr($string, $start, $len);
328 return ($term, $len, $start, $end, $term_index++);
329 }
330 };
331 }
332
333 There must be three levels of subs, in a kind of "Russian dolls"
334 structure, because :
335
336 • the external, named sub is called whenever accessing a FTS table
337 with that tokenizer
338
339 • the inner, anonymous sub is called whenever a new string needs to
340 be tokenized (either for inserting new text into the table, or for
341 analyzing a query).
342
343 • the innermost, anonymous sub is called repeatedly for retrieving
344 all terms within that string.
345
346 Using Search::Tokenizer
347
348 Instead of writing tokenizers by hand, you can grab one of those
349 already implemented in the Search::Tokenizer module. For example, if
350 you want ignore differences between accented characters, you can write
351 :
352
353 use Search::Tokenizer;
354 $dbh->do(<<"") or die DBI::errstr;
355 CREATE ... USING fts4(<columns>,
356 tokenize=perl 'Search::Tokenizer::unaccent')
357
358 Alternatively, you can use "new" in Search::Tokenizer to build your own
359 tokenizer. Here is an example that treats compound words (words with an
360 internal dash or dot) as single tokens :
361
362 sub my_tokenizer {
363 return Search::Tokenizer->new(
364 regex => qr{\p{Word}+(?:[-./]\p{Word}+)*},
365 );
366 }
367
369 The content of a full-text index can be accessed through the virtual
370 table module "fts4aux". For example, assuming that our database
371 contains a full-text indexed table named "ft", we can declare :
372
373 CREATE VIRTUAL TABLE ft_terms USING fts4aux(ft)
374
375 and then query the "ft_terms" table to access the list of terms, their
376 frequency, etc. Examples are documented in
377 <http://www.sqlite.org/fts3.html#fts4aux>.
378
380 By default, FTS stores a complete copy of the indexed documents,
381 together with the fulltext index. On a large collection of documents,
382 this can consume quite a lot of disk space. However, FTS has some
383 options for compressing the documents, or even for not storing them at
384 all -- see <http://www.sqlite.org/fts3.html#fts4_options>.
385
386 In particular, the option for contentless FTS tables only stores the
387 fulltext index, without the original document content. This is
388 specified as "content=""", like in the following example :
389
390 CREATE VIRTUAL TABLE t1 USING fts4(content="", a, b)
391
392 Data can be inserted into such an FTS4 table using an INSERT
393 statements. However, unlike ordinary FTS4 tables, the user must supply
394 an explicit integer docid value. For example:
395
396 -- This statement is Ok:
397 INSERT INTO t1(docid, a, b) VALUES(1, 'a b c', 'd e f');
398
399 -- This statement causes an error, as no docid value has been provided:
400 INSERT INTO t1(a, b) VALUES('j k l', 'm n o');
401
402 Of course your application will need an algorithm for finding the
403 external resource corresponding to any docid stored within SQLite.
404
405 When using placeholders, the docid must be explicitly typed to INTEGER,
406 because this is a "hidden column" for which sqlite is not able to
407 automatically infer the proper type. So the following doesn't work :
408
409 my $sth = $dbh->prepare("INSERT INTO t1(docid, a, b) VALUES(?, ?, ?)");
410 $sth->execute(2, 'aa', 'bb'); # constraint error
411
412 but it works with an explicitly cast :
413
414 my $sql = "INSERT INTO t1(docid, a, b) VALUES(CAST(? AS INTEGER), ?, ?)",
415 my $sth = $dbh->prepare(sql);
416 $sth->execute(2, 'aa', 'bb');
417
418 or with an explicitly typed "bind_param" in DBI :
419
420 use DBI qw/SQL_INTEGER/;
421 my $sql = "INSERT INTO t1(docid, a, b) VALUES(?, ?, ?)";
422 my $sth = $dbh->prepare(sql);
423 $sth->bind_param(1, 2, SQL_INTEGER);
424 $sth->bind_param(2, "aa");
425 $sth->bind_param(3, "bb");
426 $sth->execute();
427
428 It is not possible to UPDATE or DELETE a row stored in a contentless
429 FTS4 table. Attempting to do so is an error.
430
431 Contentless FTS4 tables also support SELECT statements. However, it is
432 an error to attempt to retrieve the value of any table column other
433 than the docid column. The auxiliary function "matchinfo()" may be
434 used, but "snippet()" and "offsets()" may not, so if such functionality
435 is needed, it has to be directly programmed within the Perl
436 application.
437
439 Laurent Dami <dami@cpan.org>
440
442 Copyright 2014 Laurent Dami.
443
444 Some parts borrowed from the <http://sqlite.org> documentation,
445 copyright 2014.
446
447 This documentation is in the public domain; you can redistribute it
448 and/or modify it under the same terms as Perl itself.
449
450
451
452perl v5.34.0 2021-08-02 DBD::SQLite::Fulltext_search(3)