1Estraier(3) User Contributed Perl Documentation Estraier(3)
2
3
4
6 Perl Binding of Hyper Estraier
7
9 use Estraier;
10
12 Hyper Estraier is a full-text search system for communities.
13
14 This is a package implementing the core API of Hyper Estraier (
15 http://hyperestraier.sourceforge.net/ ), including native codes written
16 in C with XS macros. As it works on Linux, Mac OS X, Windows, and so
17 on, native libraries for each environment are required to run programs.
18 This package requires Perl 5.8.8 or later versions.
19
20 Setting
21 Install the latest version of Hyper Estraier.
22
23 Enter the sub directory `perlnative' in the extracted package then
24 perform installation.
25
26 cd perlnative
27 ./configure
28 make
29 su
30 make install
31
32 On Linux and other UNIX systems: set the environment variable
33 LD_LIBRARY_PATH to find libraries; "libestraier.so". On Mac OS X: set
34 the environment variable DYLD_LIBRARY_PATH to find libraries;
35 "libestraier.dylib". On Windows: set the environment variable PATH to
36 find libraries; "estraier.dll".
37
38 The package `Estraier' should be loaded in each source file of
39 application programs.
40
41 use Estraier;
42
43 If you want to enable runtime assertion, set the variable
44 `$Estraier::DEBUG' to be true.
45
46 $Estraier::DEBUG = 1;
47
49 Class Document
50 $doc = new Document(draft)
51 Create a document object. `draft' specifies a string of draft
52 data. If it is omitted, an empty document object is created.
53
54 $doc->add_attr(name, value)
55 Add an attribute. `name' specifies the name of an attribute.
56 `value' specifies the value of the attribute. If it is `undef',
57 the attribute is removed. The return value is always `undef'.
58
59 $doc->add_text(text)
60 Add a sentence of text. `text' specifies a sentence of text. The
61 return value is always `undef'.
62
63 $doc->add_hidden_text(text)
64 Add a hidden sentence. `text' specifies a hidden sentence. The
65 return value is always `undef'.
66
67 $doc->set_keywords(kwords)
68 Attach keywords. `kwords' specifies the reference of a hash object
69 of keywords. Keys of the hash should be keywords of the document
70 and values should be their scores in decimal string. The return
71 value is always `undef'.
72
73 $doc->set_score(score)
74 Set the substitute score. `score' specifies the substitute score.
75 It should be zero or positive. The return value is always `undef'.
76
77 $doc->id()
78 Get the ID number. The return value is the ID number of the
79 document object. If the object has never been registered, -1 is
80 returned.
81
82 $doc->attr_names()
83 Get an array of attribute names of a document object. The return
84 value is a reference of an array object of attribute names.
85
86 $doc->attr()
87 Get the value of an attribute. `name' specifies the name of an
88 attribute. The return value is the value of the attribute or
89 `undef' if it does not exist.
90
91 $doc->texts()
92 Get an array of sentences of the text. The return value is a
93 reference of an array object of sentences of the text.
94
95 $doc->cat_texts()
96 Concatenate sentences of the text of a document object. The return
97 value is concatenated sentences.
98
99 $doc->keywords()
100 Get attached keywords. The return value is a reference of a hash
101 object of keywords and their scores in decimal string. If no
102 keyword is attached, `undef' is returned.
103
104 $doc->score()
105 Get the substitute score. The return value is the substitute score
106 or -1 if it is not set.
107
108 $doc->dump_draft()
109 Dump draft data of a document object. The return value is draft
110 data.
111
112 $doc->make_snippet(words, wwidth, hwidth, awidth)
113 Make a snippet of the body text. `words' specifies a reference of
114 an array object of words to be highlight. `wwidth' specifies whole
115 width of the result. `hwidth' specifies width of strings picked up
116 from the beginning of the text. `awidth' width of strings picked
117 up around each highlighted word. The return value is a snippet
118 string of the body text. There are tab separated values. Each
119 line is a string to be shown. Though most lines have only one
120 field, some lines have two fields. If the second field exists, the
121 first field is to be shown with highlighted, and the second field
122 means its normalized form.
123
124 Class Condition
125 Condition::SURE = 1 << 0
126 option: check every N-gram key
127
128 Condition::USUAL = 1 << 1
129 option: check N-gram keys skipping by one
130
131 Condition::FAST = 1 << 2
132 option: check N-gram keys skipping by two
133
134 Condition::AGITO = 1 << 3
135 option: check N-gram keys skipping by three
136
137 Condition::NOIDF = 1 << 4
138 option: without TF-IDF tuning
139
140 Condition::SIMPLE = 1 << 10
141 option: with the simplified phrase
142
143 Condition::ROUGH = 1 << 11
144 option: with the rough phrase
145
146 Condition::UNION = 1 << 15
147 option: with the union phrase
148
149 Condition::ISECT = 1 << 16
150 option: with the intersection phrase
151
152 Condition::ECLSIMURL = 10.0
153 eclipse tuning: consider URL
154
155 Condition::ECLSERV = 100.0
156 eclipse tuning: on server basis
157
158 Condition::ECLDIR = 101.0
159 eclipse tuning: on directory basis
160
161 Condition::ECLFILE = 102.0
162 eclipse tuning: on file basis
163
164 $cond = new Condition()
165 Create a search condition object.
166
167 $cond->set_phrase(phrase)
168 Set the search phrase. `phrase' specifies a search phrase. The
169 return value is always `undef'.
170
171 $cond->add_attr(expr)
172 Add an expression for an attribute. `expr' specifies an expression
173 for an attribute. The return value is always `undef'.
174
175 $cond->set_order(expr)
176 Set the order of a condition object. `expr' specifies an
177 expression for the order. By default, the order is by score
178 descending. The return value is always `undef'.
179
180 $cond->set_max(max)
181 Set the maximum number of retrieval. `max' specifies the maximum
182 number of retrieval. By default, the number of retrieval is not
183 limited.
184
185 $cond->set_skip(skip)
186 Set the number of skipped documents. `skip' specifies the number
187 of documents to be skipped in the search result. The return value
188 is always `undef'.
189
190 $cond->set_options(options)
191 Set options of retrieval. `options' specifies options:
192 `Condition::SURE' specifies that it checks every N-gram key,
193 `Condition::USU', which is the default, specifies that it checks
194 N-gram keys with skipping one key, `Condition::FAST' skips two
195 keys, `Condition::AGITO' skips three keys, `Condition::NOIDF'
196 specifies not to perform TF-IDF tuning, `Condition::SIMPLE'
197 specifies to use simplified phrase, `Condition::ROUGH' specifies to
198 use rough phrase, `Condition::UNION' specifies to use union phrase,
199 `Condition::ISECT' specifies to use intersection phrase. Each
200 option can be specified at the same time by bitwise or. If keys
201 are skipped, though search speed is improved, the relevance ratio
202 grows less. The return value is always `undef'.
203
204 $cond->set_auxiliary(min)
205 Set permission to adopt result of the auxiliary index. `min'
206 specifies the minimum hits to adopt result of the auxiliary index.
207 If it is not more than 0, the auxiliary index is not used. By
208 default, it is 32.
209
210 $cond->set_eclipse(limit)
211 Set the lower limit of similarity eclipse. `limit' specifies the
212 lower limit of similarity for documents to be eclipsed. Similarity
213 is between 0.0 and 1.0. If the limit is added by
214 `Condition::ECLSIMURL', similarity is weighted by URL. If the
215 limit is `Condition::ECLSERV', similarity is ignored and documents
216 in the same server are eclipsed. If the limit is
217 `Condition::ECLDIR', similarity is ignored and documents in the
218 same directory are eclipsed. If the limit is `Condition::ECLFILE',
219 similarity is ignored and documents of the same file are eclipsed.
220
221 $cond->set_distinct(name)
222 Set the attribute distinction filter. `name' specifies the name of
223 an attribute to be distinct. The return value is always `undef'.
224
225 Class Result
226 $result->doc_num()
227 Get the number of documents. The return value is the number of
228 documents in the result.
229
230 $result->get_doc_id(index)
231 Get the ID number of a document. `index' specifies the index of a
232 document. The return value is the ID number of the document or -1
233 if the index is out of bounds.
234
235 $result->get_dbidx(index)
236 Get the index of the container database of a document. `index'
237 specifies the index of a document. The return value is the index
238 of the container database of the document or -1 if the index is out
239 of bounds.
240
241 $result->hint_words()
242 Get an array of hint words. The return value is a reference of an
243 array of hint words.
244
245 $result->hint(word)
246 Get the value of a hint word. `word' specifies a hint word. An
247 empty string means the number of whole result. The return value is
248 the number of documents corresponding the hint word. If the word
249 is in a negative condition, the value is negative.
250
251 $result->get_score(index)
252 Get the score of a document. `index' specifies the index of a
253 document. The return value is the score of the document or -1 if
254 the index is out of bounds.
255
256 $result->get_shadows(id)
257 Get an array of ID numbers of eclipsed docuemnts of a document.
258 `id' specifies the ID number of a parent document. The return
259 value is a reference of an array whose elements expresse the ID
260 numbers and their scores alternately.
261
262 Class Database
263 Database::VERSION = "0.0.0"
264 version of Hyper Estraier
265
266 Database::ERRNOERR = 0
267 error code: no error
268
269 Database::ERRINVAL = 1
270 error code: invalid argument
271
272 Database::ERRACCES = 2
273 error code: access forbidden
274
275 Database::ERRLOCK = 3
276 error code: lock failure
277
278 Database::ERRDB = 4
279 error code: database problem
280
281 Database::ERRIO = 5
282 error code: I/O problem
283
284 Database::ERRNOITEM = 6
285 error code: no item
286
287 Database::ERRMISC = 9999
288 error code: miscellaneous
289
290 Database::DBREADER = 1 << 0
291 open mode: open as a reader
292
293 Database::DBWRITER = 1 << 1
294 open mode: open as a writer
295
296 Database::DBCREAT = 1 << 2
297 open mode: a writer creating
298
299 Database::DBTRUNC = 1 << 3
300 open mode: a writer truncating
301
302 Database::DBNOLCK = 1 << 4
303 open mode: open without locking
304
305 Database::DBLCKNB = 1 << 5
306 open mode: lock without blocking
307
308 Database::DBPERFNG = 1 << 10
309 open mode: use perfect N-gram analyzer
310
311 Database::DBCHRCAT = 1 << 11
312 open mode: use character category analyzer
313
314 Database::DBSMALL= 1 << 20
315 open mode: small tuning
316
317 Database::DBLARGE = 1 << 21
318 open mode: large tuning
319
320 Database::DBHUGE = 1 << 22
321 open mode: huge tuning
322
323 Database::DBHUGE2 = 1 << 23
324 open mode: huge tuning second
325
326 Database::DBHUGE3 = 1 << 24
327 open mode: huge tuning third
328
329 Database::DBSCVOID = 1 << 25
330 open mode: store scores as void
331
332 Database::DBSCINT = 1 << 26
333 open mode: store scores as integer
334
335 Database::DBSCASIS = 1 << 27
336 open mode: refrain from adjustment of scores
337
338 Database::IDXATTRSEQ = 0
339 attribute index type: for multipurpose sequencial access method
340
341 Database::IDXATTRSTR = 1
342 attribute index type: for narrowing with attributes as strings
343
344 Database::IDXATTRNUM = 2
345 attribute index type: for narrowing with attributes as numbers
346
347 Database::OPTNOPURGE = 1 << 0
348 optimize option: omit purging dispensable region of deleted
349
350 Database::OPTNODBOPT = 1 << 1
351 optimize option: omit optimization of the database files
352
353 Database::MGCLEAN = 1 << 0
354 merge option: clean up dispensable regions
355
356 Database::PDCLEAN = 1 << 0
357 put_doc option: clean up dispensable regions
358
359 Database::PDWEIGHT = 1 << 1
360 put_doc option: weight scores statically when indexing
361
362 Database::ODCLEAN = 1 << 0
363 out_doc option: clean up dispensable regions
364
365 Database::GDNOATTR = 1 << 0
366 get_doc option: no attributes
367
368 Database::GDNOTEXT = 1 << 1
369 get_doc option: no text
370
371 Database::GDNOKWD = 1 << 2
372 get_doc option: no keywords
373
374 $db = new Database()
375 Create a database object.
376
377 Database::search_meta(dbs, cond)
378 Search plural databases for documents corresponding a condition.
379 `dbs' specifies a reference of an array whose elements are database
380 objects. `cond' specifies a condition object. The return value is
381 a result object. On error, `undef' is returned.
382
383 $db->err_msg(ecode)
384 Get the string of an error code. `ecode' specifies an error code.
385 The return value is the string of the error code.
386
387 $db->open(name, omode)
388 Open a database. `name' specifies the name of a database
389 directory. `omode' specifies open modes: `Database::DBWRITER' as a
390 writer, `Database::DBREADER' as a reader. If the mode is
391 `Database::DBWRITER', the following may be added by bitwise or:
392 `Database::DBCREAT', which means it creates a new database if not
393 exist, `Database::DBTRUNC', which means it creates a new database
394 regardless if one exists. Both of `Database::DBREADER' and
395 `Database::DBWRITER' can be added to by bitwise or:
396 `Database::DBNOLCK', which means it opens a database file without
397 file locking, or `Database::DBLCKNB', which means locking is
398 performed without blocking. If `Database::DBNOLCK' is used, the
399 application is responsible for exclusion control.
400 `Database::DBCREAT' can be added to by bitwise or:
401 `Database::DBPERFNG', which means N-gram analysis is performed
402 against European text also, `Database::DBCHACAT', which means
403 character category analysis is performed instead of N-gram
404 analysis, `Database::DBSMALL', which means the index is tuned to
405 register less than 50000 documents, `Database::DBLARGE', which
406 means the index is tuned to register more than 300000 documents,
407 `Database::DBHUGE', which means the index is tuned to register more
408 than 1000000 documents, `Database::DBHUGE2', which means the index
409 is tuned to register more than 5000000 documents,
410 `Database::DBHUGE3', which means the index is tuned to register
411 more than 10000000 documents, `Database::DBSCVOID', which means
412 scores are stored as void, `Database::DBSCINT', which means scores
413 are stored as 32-bit integer, `Database::DBSCASIS', which means
414 scores are stored as-is and marked not to be tuned when search.
415 The return value is true if success, else it is false.
416
417 $db->close()
418 Close the database. The return value is true if success, else it
419 is false.
420
421 $db->error()
422 Get the last happened error code. The return value is the last
423 happened error code.
424
425 $db->fatal()
426 Check whether the database has a fatal error. The return value is
427 true if the database has fatal erroor, else it is false.
428
429 $db->add_attr_index(name, type)
430 Add an index for narrowing or sorting with document attributes.
431 `name' specifies the name of an attribute. `type' specifies the
432 data type of attribute index; `Database::IDXATTRSEQ' for
433 multipurpose sequencial access method, `Database::IDXATTRSTR' for
434 narrowing with attributes as strings, `Database::IDXATTRNUM' for
435 narrowing with attributes as numbers. The return value is true if
436 success, else it is false.
437
438 $db->flush(max)
439 Flush index words in the cache. `max' specifies the maximum number
440 of words to be flushed. If it not more than zero, all words are
441 flushed. The return value is true if success, else it is false.
442
443 $db->sync()
444 Synchronize updating contents. The return value is true if
445 success, else it is false.
446
447 $db->optimize(options)
448 Optimize the database. `options' specifies options:
449 `Database::OPTNOPURGE' to omit purging dispensable region of
450 deleted documents, `Database::OPTNODBOPT' to omit optimization of
451 the database files. The two can be specified at the same time by
452 bitwise or. The return value is true if success, else it is false.
453
454 $db->merge(name, options)
455 Merge another database. `name' specifies the name of another
456 database directory. `options' specifies options:
457 `Database::MGCLEAN' to clean up dispensable regions of the deleted
458 document. The return value is true if success, else it is false.
459
460 $db->put_doc(doc, options)
461 Add a document. `doc' specifies a document object. The document
462 object should have the URI attribute. `options' specifies options:
463 `Database::PDCLEAN' to clean up dispensable regions of the
464 overwritten document. The return value is true if success, else it
465 is false.
466
467 $db->out_doc(id, options)
468 Remove a document. `id' specifies the ID number of a registered
469 document. `options' specifies options: `Database::ODCLEAN' to
470 clean up dispensable regions of the deleted document. The return
471 value is true if success, else it is false.
472
473 $db->edit_doc(doc)
474 Edit attributes of a document. `doc' specifies a document object.
475 The return value is true if success, else it is false.
476
477 $db->get_doc(id, options)
478 Retrieve a document. `id' specifies the ID number of a registered
479 document. `options' specifies options: `Database::GDNOATTR' to
480 ignore attributes, `Database::GDNOTEXT' to ignore the body text,
481 `Database::GDNOKWD' to ignore keywords. The three can be specified
482 at the same time by bitwise or. The return value is a document
483 object. On error, `undef' is returned.
484
485 $db->get_doc_attr(id, name)
486 Retrieve the value of an attribute of a document. `id' specifies
487 the ID number of a registered document. `name' specifies the name
488 of an attribute. The return value is the value of the attribute or
489 `undef' if it does not exist.
490
491 $db->uri_to_id(uri)
492 Get the ID of a document specified by URI. `uri' specifies the URI
493 of a registered document. The return value is the ID of the
494 document. On error, -1 is returned.
495
496 $db->name()
497 Get the name. The return value is the name of the database.
498
499 $db->doc_num()
500 Get the number of documents. The return value is the number of
501 documents in the database.
502
503 $db->word_num()
504 Get the number of unique words. The return value is the number of
505 unique words in the database.
506
507 $db->size()
508 Get the size. The return value is the size of the database.
509
510 $db->search(cond)
511 Search for documents corresponding a condition. `cond' specifies a
512 condition object. The return value is a result object. On error,
513 `undef' is returned.
514
515 $db->scan_doc(doc, cond)
516 Check whether a document object matches the phrase of a search
517 condition object definitely. `doc' specifies a document object.
518 `cond' specifies a search condition object. The return value is
519 true if the document matches the phrase of the condition object
520 definitely, else it is false.
521
522 $db->set_cache_size(size, anum, tnum, rnum)
523 Set the maximum size of the cache memory. `size' specifies the
524 maximum size of the index cache. By default, it is 64MB. If it is
525 not more than 0, the current size is not changed. `anum' specifies
526 the maximum number of cached records for document attributes. By
527 default, it is 8192. If it is not more than 0, the current size is
528 not changed. `tnum' specifies the maximum number of cached records
529 for document texts. By default, it is 1024. If it is not more
530 than 0, the current size is not changed. `rnum' specifies the
531 maximum number of cached records for occurrence results. By
532 default, it is 256. If it is not more than 0, the current size is
533 not changed. The return value is always `undef'.
534
535 $db->add_pseudo_index(path)
536 Add a pseudo index directory. `path' specifies the path of a
537 pseudo index directory. The return value is true if success, else
538 it is false.
539
540 $db->set_wildmax(num)
541 Set the maximum number of expansion of wild cards. `num' specifies
542 the maximum number of expansion of wild cards. The return value is
543 always `undef'.
544
545 $db->set_informer(informer)
546 Set the callback function to inform of database events. `informer'
547 specifies the name of an arbitrary function. The function should
548 have one parameter for a string of a message of each event. The
549 return value is always `undef'.
550
552 Gatherer
553 The following is the simplest implementation of a gatherer.
554
555 use strict;
556 use warnings;
557 use Estraier;
558 $Estraier::DEBUG = 1;
559
560 # create the database object
561 my $db = new Database();
562
563 # open the database
564 unless($db->open("casket", Database::DBWRITER | Database::DBCREAT)){
565 printf("error: %s\n", $db->err_msg($db->error()));
566 exit;
567 }
568
569 # create a document object
570 my $doc = new Document();
571
572 # add attributes to the document object
573 $doc->add_attr('@uri', "https://estraier.gov/example.txt");
574 $doc->add_attr('@title', "Over the Rainbow");
575
576 # add the body text to the document object
577 $doc->add_text("Somewhere over the rainbow. Way up high.");
578 $doc->add_text("There's a land that I heard of once in a lullaby.");
579
580 # register the document object to the database
581 unless($db->put_doc($doc, Database::PDCLEAN)){
582 printf("error: %s\n", $db->err_msg($db->error()));
583 }
584
585 # close the database
586 unless($db->close()){
587 printf("error: %s\n", $db->err_msg($db->error()));
588 }
589
590 Searcher
591 The following is the simplest implementation of a searcher.
592
593 use strict;
594 use warnings;
595 use Estraier;
596 $Estraier::DEBUG = 1;
597
598 # create the database object
599 my $db = new Database();
600
601 # open the database
602 unless($db->open("casket", Database::DBREADER)){
603 printf("error: %s\n", $db->err_msg($db->error()));
604 exit;
605 }
606
607 # create a search condition object
608 my $cond = new Condition();
609
610 # set the search phrase to the search condition object
611 $cond->set_phrase("rainbow AND lullaby");
612
613 # get the result of search
614 my $result = $db->search($cond);
615
616 # for each document in the result
617 my $dnum = $result->doc_num();
618 foreach my $i (0..$dnum-1){
619 # retrieve the document object
620 my $doc = $db->get_doc($result->get_doc_id($i), 0);
621 next unless(defined($doc));
622 # display attributes
623 my $uri = $doc->attr('@uri');
624 printf("URI: %s\n", $uri) if defined($uri);
625 my $title = $doc->attr('@title');
626 printf("Title: %s\n", $title) if defined($title);
627 # display the body text
628 my $texts = $doc->texts();
629 foreach my $text (@$texts){
630 printf("%s\n", $text);
631 }
632 }
633
634 # close the database
635 unless($db.close()){
636 printf("error: %s\n", $db->err_msg($db->error()));
637 }
638
640 Copyright (C) 2004-2007 Mikio Hirabayashi
641 All rights reserved.
642
643 Hyper Estraier is free software; you can redistribute it and/or modify
644 it under the terms of the GNU Lesser General Public License as
645 published by the Free Software Foundation; either version 2.1 of the
646 License or any later version. Hyper Estraier is distributed in the
647 hope that it will be useful, but WITHOUT ANY WARRANTY; without even the
648 implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
649 PURPOSE. See the GNU Lesser General Public License for more details.
650 You should have received a copy of the GNU Lesser General Public
651 License along with Hyper Estraier; if not, write to the Free Software
652 Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
653 USA.
654
655
656
657perl v5.34.0 2021-07-25 Estraier(3)