1DBD::SQLite::Fulltext_sUesaerrchC(o3n)tributed Perl DocuDmBeDn:t:aStQiLointe::Fulltext_search(3)
2
3
4

NAME

6       DBD::SQLite::Fulltext_search - Using fulltext searches with DBD::SQLite
7

DESCRIPTION

9   Introduction
10       SQLite is bundled with an extension module called "FTS" for full-text
11       indexing. Tables with this feature enabled can be efficiently queried
12       to find rows that contain one or more instances of some specified words
13       (also called "tokens"), in any column, even if the table contains many
14       large documents.
15
16       The first full-text search modules for SQLite were called "FTS1" and
17       "FTS2" and are now obsolete. The latest version is "FTS4", but it
18       shares many features with the former module "FTS3", which is why parts
19       of the API and parts of the documentation still refer to "FTS3"; from a
20       client point of view, both can be considered largely equivalent.
21       Detailed documentation can be found at
22       <http://www.sqlite.org/fts3.html>.
23
24   Short example
25       Here is a very short example of using FTS :
26
27         $dbh->do(<<"") or die DBI::errstr;
28         CREATE VIRTUAL TABLE fts_example USING fts4(content)
29
30         my $sth = $dbh->prepare("INSERT INTO fts_example(content) VALUES (?)");
31         $sth->execute($_) foreach @docs_to_insert;
32
33         my $results = $dbh->selectall_arrayref(<<"");
34         SELECT docid, snippet(fts_example) FROM fts_example WHERE content MATCH 'foo'
35
36       The key points in this example are :
37
38       •   The syntax for creating FTS tables is
39
40             CREATE VIRTUAL TABLE <table_name> USING fts4(<columns>)
41
42           where "<columns>" is a list of column names. Columns may be typed,
43           but the type information is ignored. If no columns are specified,
44           the default is a single column named "content".  In addition, FTS
45           tables have an implicit column called "docid" (or also "rowid") for
46           numbering the stored documents.
47
48       •   Statements for inserting, updating or deleting records use the same
49           syntax as for regular SQLite tables.
50
51       •   Full-text searches are specified with the "MATCH" operator, and an
52           operand which may be a single word, a word prefix ending with '*',
53           a list of words, a "phrase query" in double quotes, or a boolean
54           combination of the above.
55
56       •   The builtin function "snippet(...)" builds a formatted excerpt of
57           the document text, where the words pertaining to the query are
58           highlighted.
59
60       There are many more details to building and searching FTS tables, so we
61       strongly invite you to read the full documentation at
62       <http://www.sqlite.org/fts3.html>.
63

QUERY SYNTAX

65       Here are some explanation about FTS queries, borrowed from the sqlite
66       documentation.
67
68   Token or token prefix queries
69       An FTS table may be queried for all documents that contain a specified
70       term, or for all documents that contain a term with a specified prefix.
71       The query expression for a specific term is simply the term itself. The
72       query expression used to search for a term prefix is the prefix itself
73       with a '*' character appended to it. For example:
74
75         -- Virtual table declaration
76         CREATE VIRTUAL TABLE docs USING fts3(title, body);
77
78         -- Query for all documents containing the term "linux":
79         SELECT * FROM docs WHERE docs MATCH 'linux';
80
81         -- Query for all documents containing a term with the prefix "lin".
82         SELECT * FROM docs WHERE docs MATCH 'lin*';
83
84       If a search token (on the right-hand side of the MATCH operator) begins
85       with "^" then that token must be the first in its field of the document
86       : so for example "^lin*" matches 'linux kernel changes ...' but does
87       not match 'new linux implementation'.
88
89   Column specifications
90       Normally, a token or token prefix query is matched against the FTS
91       table column specified as the right-hand side of the MATCH operator.
92       Or, if the special column with the same name as the FTS table itself is
93       specified, against all columns. This may be overridden by specifying a
94       column-name followed by a ":" character before a basic term query.
95       There may be space between the ":" and the term to query for, but not
96       between the column-name and the ":" character. For example:
97
98         -- Query the database for documents for which the term "linux" appears in
99         -- the document title, and the term "problems" appears in either the title
100         -- or body of the document.
101         SELECT * FROM docs WHERE docs MATCH 'title:linux problems';
102
103         -- Query the database for documents for which the term "linux" appears in
104         -- the document title, and the term "driver" appears in the body of the document
105         -- ("driver" may also appear in the title, but this alone will not satisfy the.
106         -- query criteria).
107         SELECT * FROM docs WHERE body MATCH 'title:linux driver';
108
109   Phrase queries
110       A phrase query is a query that retrieves all documents that contain a
111       nominated set of terms or term prefixes in a specified order with no
112       intervening tokens. Phrase queries are specified by enclosing a space
113       separated sequence of terms or term prefixes in double quotes ("). For
114       example:
115
116         -- Query for all documents that contain the phrase "linux applications".
117         SELECT * FROM docs WHERE docs MATCH '"linux applications"';
118
119         -- Query for all documents that contain a phrase that matches "lin* app*".
120         -- As well as "linux applications", this will match common phrases such
121         -- as "linoleum appliances" or "link apprentice".
122         SELECT * FROM docs WHERE docs MATCH '"lin* app*"';
123
124   NEAR queries.
125       A NEAR query is a query that returns documents that contain a two or
126       more nominated terms or phrases within a specified proximity of each
127       other (by default with 10 or less intervening terms). A NEAR query is
128       specified by putting the keyword "NEAR" between two phrase, term or
129       prefix queries. To specify a proximity other than the default, an
130       operator of the form "NEAR/<N>" may be used, where <N> is the maximum
131       number of intervening terms allowed. For example:
132
133         -- Virtual table declaration.
134         CREATE VIRTUAL TABLE docs USING fts4();
135
136         -- Virtual table data.
137         INSERT INTO docs VALUES('SQLite is an ACID compliant embedded relational database management system');
138
139         -- Search for a document that contains the terms "sqlite" and "database" with
140         -- not more than 10 intervening terms. This matches the only document in
141         -- table docs (since there are only six terms between "SQLite" and "database"
142         -- in the document).
143         SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR database';
144
145         -- Search for a document that contains the terms "sqlite" and "database" with
146         -- not more than 6 intervening terms. This also matches the only document in
147         -- table docs. Note that the order in which the terms appear in the document
148         -- does not have to be the same as the order in which they appear in the query.
149         SELECT * FROM docs WHERE docs MATCH 'database NEAR/6 sqlite';
150
151         -- Search for a document that contains the terms "sqlite" and "database" with
152         -- not more than 5 intervening terms. This query matches no documents.
153         SELECT * FROM docs WHERE docs MATCH 'database NEAR/5 sqlite';
154
155         -- Search for a document that contains the phrase "ACID compliant" and the term
156         -- "database" with not more than 2 terms separating the two. This matches the
157         -- document stored in table docs.
158         SELECT * FROM docs WHERE docs MATCH 'database NEAR/2 "ACID compliant"';
159
160         -- Search for a document that contains the phrase "ACID compliant" and the term
161         -- "sqlite" with not more than 2 terms separating the two. This also matches
162         -- the only document stored in table docs.
163         SELECT * FROM docs WHERE docs MATCH '"ACID compliant" NEAR/2 sqlite';
164
165       More than one NEAR operator may appear in a single query. In this case
166       each pair of terms or phrases separated by a NEAR operator must appear
167       within the specified proximity of each other in the document. Using the
168       same table and data as in the block of examples above:
169
170         -- The following query selects documents that contains an instance of the term
171         -- "sqlite" separated by two or fewer terms from an instance of the term "acid",
172         -- which is in turn separated by two or fewer terms from an instance of the term
173         -- "relational".
174         SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR/2 acid NEAR/2 relational';
175
176         -- This query matches no documents. There is an instance of the term "sqlite" with
177         -- sufficient proximity to an instance of "acid" but it is not sufficiently close
178         -- to an instance of the term "relational".
179         SELECT * FROM docs WHERE docs MATCH 'acid NEAR/2 sqlite NEAR/2 relational';
180
181       Phrase and NEAR queries may not span multiple columns within a row.
182
183   Set operations
184       The three basic query types described above may be used to query the
185       full-text index for the set of documents that match the specified
186       criteria. Using the FTS query expression language it is possible to
187       perform various set operations on the results of basic queries. There
188       are currently three supported operations:
189
190       •   The AND operator determines the intersection of two sets of
191           documents.
192
193       •   The OR operator calculates the union of two sets of documents.
194
195       •   The NOT operator may be used to compute the relative complement of
196           one set of documents with respect to another.
197
198       The AND, OR and NOT binary set operators must be entered using capital
199       letters; otherwise, they are interpreted as basic term queries instead
200       of set operators.  Each of the two operands to an operator may be a
201       basic FTS query, or the result of another AND, OR or NOT set operation.
202       Parenthesis may be used to control precedence and grouping.
203
204       The AND operator is implicit for adjacent basic queries without any
205       explicit operator. For example, the query expression "implicit
206       operator" is a more succinct version of "implicit AND operator".
207
208       Boolean operations as just described correspond to the so-called
209       "enhanced query syntax" of sqlite; this is the version compiled with
210       "DBD::SQLite", starting from version 1.31.  A former version, called
211       the "standard query syntax", used to support tokens prefixed with '+'
212       or '-' signs (for token inclusion or exclusion); if your application
213       needs to support this old syntax, use  DBD::SQLite::FTS3Transitional
214       (published in a separate distribution) for doing the conversion.
215

TOKENIZERS

217   Concept
218       The behaviour of full-text indexes strongly depends on how documents
219       are split into tokens; therefore FTS table declarations can explicitly
220       specify how to perform tokenization:
221
222         CREATE ... USING fts4(<columns>, tokenize=<tokenizer>)
223
224       where "<tokenizer>" is a sequence of space-separated words that
225       triggers a specific tokenizer. Tokenizers can be SQLite builtins,
226       written in C code, or Perl tokenizers.  Both are as explained below.
227
228   SQLite builtin tokenizers
229       SQLite comes with some builtin tokenizers (see
230       <http://www.sqlite.org/fts3.html#tokenizer>) :
231
232       simple
233           Under the simple tokenizer, a term is a contiguous sequence of
234           eligible characters, where eligible characters are all alphanumeric
235           characters, the "_" character, and all characters with UTF
236           codepoints greater than or equal to 128. All other characters are
237           discarded when splitting a document into terms. They serve only to
238           separate adjacent terms.
239
240           All uppercase characters within the ASCII range (UTF codepoints
241           less than 128), are transformed to their lowercase equivalents as
242           part of the tokenization process. Thus, full-text queries are case-
243           insensitive when using the simple tokenizer.
244
245       porter
246           The porter tokenizer uses the same rules to separate the input
247           document into terms, but as well as folding all terms to lower case
248           it uses the Porter Stemming algorithm to reduce related English
249           language words to a common root.
250
251       icu The icu tokenizer uses the ICU library to decide how to identify
252           word characters in different languages; however, this requires
253           SQLite to be compiled with the "SQLITE_ENABLE_ICU" pre-processor
254           symbol defined. So, to use this tokenizer, you need edit
255           Makefile.PL to add this flag in @CC_DEFINE, and then recompile
256           "DBD::SQLite"; of course, the prerequisite is to have an ICU
257           library available on your system.
258
259       unicode61
260           The unicode61 tokenizer works very much like "simple" except that
261           it does full unicode case folding according to rules in Unicode
262           Version 6.1 and it recognizes unicode space and punctuation
263           characters and uses those to separate tokens. By contrast, the
264           simple tokenizer only does case folding of ASCII characters and
265           only recognizes ASCII space and punctuation characters as token
266           separators.
267
268           By default, "unicode61" also removes all diacritics from Latin
269           script characters. This behaviour can be overridden by adding the
270           tokenizer argument "remove_diacritics=0". For example:
271
272             -- Create tables that remove diacritics from Latin script characters
273             -- as part of tokenization.
274             CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61);
275             CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1");
276
277             -- Create a table that does not remove diacritics from Latin script
278             -- characters as part of tokenization.
279             CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0");
280
281           Additional options can customize the set of codepoints that
282           unicode61 treats as separator characters or as token characters --
283           see the documentation in
284           <http://www.sqlite.org/fts3.html#unicode61>.
285
286       If a more complex tokenizing algorithm is required, for example to
287       implement stemming, discard punctuation, or to recognize compound
288       words, use the perl tokenizer to implement your own logic, as explained
289       below.
290
291   Perl tokenizers
292       Declaring a perl tokenizer
293
294       In addition to the builtin SQLite tokenizers, "DBD::SQLite" implements
295       a perl tokenizer, that can hook to any tokenizing algorithm written in
296       Perl. This is specified as follows :
297
298         CREATE ... USING fts4(<columns>, tokenize=perl '<perl_function>')
299
300       where "<perl_function>" is a fully qualified Perl function name (i.e.
301       prefixed by the name of the package in which that function is
302       declared). So for example if the function is "my_func" in the main
303       program, write
304
305         CREATE ... USING fts4(<columns>, tokenize=perl 'main::my_func')
306
307       Writing a perl tokenizer by hand
308
309       That function should return a code reference that takes a string as
310       single argument, and returns an iterator (another function), which
311       returns a tuple "($term, $len, $start, $end, $index)" for each term.
312       Here is a simple example that tokenizes on words according to the
313       current perl locale
314
315         sub locale_tokenizer {
316           return sub {
317             my $string = shift;
318
319             use locale;
320             my $regex      = qr/\w+/;
321             my $term_index = 0;
322
323             return sub { # closure
324               $string =~ /$regex/g or return; # either match, or no more token
325               my ($start, $end) = ($-[0], $+[0]);
326               my $len           = $end-$start;
327               my $term          = substr($string, $start, $len);
328               return ($term, $len, $start, $end, $term_index++);
329             }
330           };
331         }
332
333       There must be three levels of subs, in a kind of "Russian dolls"
334       structure, because :
335
336       •   the external, named sub is called whenever accessing a FTS table
337           with that tokenizer
338
339       •   the inner, anonymous sub is called whenever a new string needs to
340           be tokenized (either for inserting new text into the table, or for
341           analyzing a query).
342
343       •   the innermost, anonymous sub is called repeatedly for retrieving
344           all terms within that string.
345
346       Using Search::Tokenizer
347
348       Instead of writing tokenizers by hand, you can grab one of those
349       already implemented in the Search::Tokenizer module. For example, if
350       you want ignore differences between accented characters, you can write
351       :
352
353         use Search::Tokenizer;
354         $dbh->do(<<"") or die DBI::errstr;
355         CREATE ... USING fts4(<columns>,
356                               tokenize=perl 'Search::Tokenizer::unaccent')
357
358       Alternatively, you can use "new" in Search::Tokenizer to build your own
359       tokenizer. Here is an example that treats compound words (words with an
360       internal dash or dot) as single tokens :
361
362         sub my_tokenizer {
363           return Search::Tokenizer->new(
364             regex => qr{\p{Word}+(?:[-./]\p{Word}+)*},
365            );
366         }
367

Fts4aux - Direct Access to the Full-Text Index

369       The content of a full-text index can be accessed through the virtual
370       table module "fts4aux". For example, assuming that our database
371       contains a full-text indexed table named "ft", we can declare :
372
373         CREATE VIRTUAL TABLE ft_terms USING fts4aux(ft)
374
375       and then query the "ft_terms" table to access the list of terms, their
376       frequency, etc.  Examples are documented in
377       <http://www.sqlite.org/fts3.html#fts4aux>.
378

How to spare database space

380       By default, FTS stores a complete copy of the indexed documents,
381       together with the fulltext index. On a large collection of documents,
382       this can consume quite a lot of disk space. However, FTS has some
383       options for compressing the documents, or even for not storing them at
384       all -- see <http://www.sqlite.org/fts3.html#fts4_options>.
385
386       In particular, the option for contentless FTS tables only stores the
387       fulltext index, without the original document content. This is
388       specified as "content=""", like in the following example :
389
390         CREATE VIRTUAL TABLE t1 USING fts4(content="", a, b)
391
392       Data can be inserted into such an FTS4 table using an INSERT
393       statements. However, unlike ordinary FTS4 tables, the user must supply
394       an explicit integer docid value. For example:
395
396         -- This statement is Ok:
397         INSERT INTO t1(docid, a, b) VALUES(1, 'a b c', 'd e f');
398
399         -- This statement causes an error, as no docid value has been provided:
400         INSERT INTO t1(a, b) VALUES('j k l', 'm n o');
401
402       Of course your application will need an algorithm for finding the
403       external resource corresponding to any docid stored within SQLite.
404
405       When using placeholders, the docid must be explicitly typed to INTEGER,
406       because this is a "hidden column" for which sqlite is not able to
407       automatically infer the proper type. So the following doesn't work :
408
409         my $sth = $dbh->prepare("INSERT INTO t1(docid, a, b) VALUES(?, ?, ?)");
410         $sth->execute(2, 'aa', 'bb'); # constraint error
411
412       but it works with an explicitly cast  :
413
414         my $sql = "INSERT INTO t1(docid, a, b) VALUES(CAST(? AS INTEGER), ?, ?)",
415         my $sth = $dbh->prepare(sql);
416         $sth->execute(2, 'aa', 'bb');
417
418       or with an explicitly typed "bind_param" in DBI :
419
420         use DBI qw/SQL_INTEGER/;
421         my $sql = "INSERT INTO t1(docid, a, b) VALUES(?, ?, ?)";
422         my $sth = $dbh->prepare(sql);
423         $sth->bind_param(1, 2, SQL_INTEGER);
424         $sth->bind_param(2, "aa");
425         $sth->bind_param(3, "bb");
426         $sth->execute();
427
428       It is not possible to UPDATE or DELETE a row stored in a contentless
429       FTS4 table. Attempting to do so is an error.
430
431       Contentless FTS4 tables also support SELECT statements. However, it is
432       an error to attempt to retrieve the value of any table column other
433       than the docid column. The auxiliary function "matchinfo()" may be
434       used, but "snippet()" and "offsets()" may not, so if such functionality
435       is needed, it has to be directly programmed within the Perl
436       application.
437

AUTHOR

439       Laurent Dami <dami@cpan.org>
440
442       Copyright 2014 Laurent Dami.
443
444       Some parts borrowed from the <http://sqlite.org> documentation,
445       copyright 2014.
446
447       This documentation is in the public domain; you can redistribute it
448       and/or modify it under the same terms as Perl itself.
449
450
451
452perl v5.34.0                      2021-08-02   DBD::SQLite::Fulltext_search(3)
Impressum