1Estraier(3) User Contributed Perl Documentation Estraier(3)
2
3
4
6 Perl Binding of Hyper Estraier
7
9 use Estraier;
10
12 Hyper Estraier is a full-text search system for communities.
13
14 This is a package implementing the core API of Hyper Estraier (
15 http://hyperestraier.sourceforge.net/ ), including native codes written
16 in C with XS macros. As it works on Linux, Mac OS X, Windows, and so
17 on, native libraries for each environment are required to run programs.
18 This package requires Perl 5.8.8 or later versions.
19
20 Setting
21
22 Install the latest version of Hyper Estraier.
23
24 Enter the sub directory `perlnative' in the extracted package then per‐
25 form installation.
26
27 cd perlnative
28 ./configure
29 make
30 su
31 make install
32
33 On Linux and other UNIX systems: set the environment variable
34 LD_LIBRARY_PATH to find libraries; "libestraier.so". On Mac OS X: set
35 the environment variable DYLD_LIBRARY_PATH to find libraries; "libe‐
36 straier.dylib". On Windows: set the environment variable PATH to find
37 libraries; "estraier.dll".
38
39 The package `Estraier' should be loaded in each source file of applica‐
40 tion programs.
41
42 use Estraier;
43
44 If you want to enable runtime assertion, set the variable
45 `$Estraier::DEBUG' to be true.
46
47 $Estraier::DEBUG = 1;
48
50 Class Document
51
52 $doc = new Document(draft)
53 Create a document object. `draft' specifies a string of draft
54 data. If it is omitted, an empty document object is created.
55
56 $doc->add_attr(name, value)
57 Add an attribute. `name' specifies the name of an attribute.
58 `value' specifies the value of the attribute. If it is `undef',
59 the attribute is removed. The return value is always `undef'.
60
61 $doc->add_text(text)
62 Add a sentence of text. `text' specifies a sentence of text. The
63 return value is always `undef'.
64
65 $doc->add_hidden_text(text)
66 Add a hidden sentence. `text' specifies a hidden sentence. The
67 return value is always `undef'.
68
69 $doc->set_keywords(kwords)
70 Attach keywords. `kwords' specifies the reference of a hash object
71 of keywords. Keys of the hash should be keywords of the document
72 and values should be their scores in decimal string. The return
73 value is always `undef'.
74
75 $doc->set_score(score)
76 Set the substitute score. `score' specifies the substitute score.
77 It should be zero or positive. The return value is always `undef'.
78
79 $doc->id()
80 Get the ID number. The return value is the ID number of the docu‐
81 ment object. If the object has never been registered, -1 is
82 returned.
83
84 $doc->attr_names()
85 Get an array of attribute names of a document object. The return
86 value is a reference of an array object of attribute names.
87
88 $doc->attr()
89 Get the value of an attribute. `name' specifies the name of an
90 attribute. The return value is the value of the attribute or
91 `undef' if it does not exist.
92
93 $doc->texts()
94 Get an array of sentences of the text. The return value is a ref‐
95 erence of an array object of sentences of the text.
96
97 $doc->cat_texts()
98 Concatenate sentences of the text of a document object. The return
99 value is concatenated sentences.
100
101 $doc->keywords()
102 Get attached keywords. The return value is a reference of a hash
103 object of keywords and their scores in decimal string. If no key‐
104 word is attached, `undef' is returned.
105
106 $doc->score()
107 Get the substitute score. The return value is the substitute score
108 or -1 if it is not set.
109
110 $doc->dump_draft()
111 Dump draft data of a document object. The return value is draft
112 data.
113
114 $doc->make_snippet(words, wwidth, hwidth, awidth)
115 Make a snippet of the body text. `words' specifies a reference of
116 an array object of words to be highlight. `wwidth' specifies whole
117 width of the result. `hwidth' specifies width of strings picked up
118 from the beginning of the text. `awidth' width of strings picked
119 up around each highlighted word. The return value is a snippet
120 string of the body text. There are tab separated values. Each
121 line is a string to be shown. Though most lines have only one
122 field, some lines have two fields. If the second field exists, the
123 first field is to be shown with highlighted, and the second field
124 means its normalized form.
125
126 Class Condition
127
128 Condition::SURE = 1 << 0
129 option: check every N-gram key
130
131 Condition::USUAL = 1 << 1
132 option: check N-gram keys skipping by one
133
134 Condition::FAST = 1 << 2
135 option: check N-gram keys skipping by two
136
137 Condition::AGITO = 1 << 3
138 option: check N-gram keys skipping by three
139
140 Condition::NOIDF = 1 << 4
141 option: without TF-IDF tuning
142
143 Condition::SIMPLE = 1 << 10
144 option: with the simplified phrase
145
146 Condition::ROUGH = 1 << 11
147 option: with the rough phrase
148
149 Condition::UNION = 1 << 15
150 option: with the union phrase
151
152 Condition::ISECT = 1 << 16
153 option: with the intersection phrase
154
155 Condition::ECLSIMURL = 10.0
156 eclipse tuning: consider URL
157
158 Condition::ECLSERV = 100.0
159 eclipse tuning: on server basis
160
161 Condition::ECLDIR = 101.0
162 eclipse tuning: on directory basis
163
164 Condition::ECLFILE = 102.0
165 eclipse tuning: on file basis
166
167 $cond = new Condition()
168 Create a search condition object.
169
170 $cond->set_phrase(phrase)
171 Set the search phrase. `phrase' specifies a search phrase. The
172 return value is always `undef'.
173
174 $cond->add_attr(expr)
175 Add an expression for an attribute. `expr' specifies an expression
176 for an attribute. The return value is always `undef'.
177
178 $cond->set_order(expr)
179 Set the order of a condition object. `expr' specifies an expres‐
180 sion for the order. By default, the order is by score descending.
181 The return value is always `undef'.
182
183 $cond->set_max(max)
184 Set the maximum number of retrieval. `max' specifies the maximum
185 number of retrieval. By default, the number of retrieval is not
186 limited.
187
188 $cond->set_skip(skip)
189 Set the number of skipped documents. `skip' specifies the number
190 of documents to be skipped in the search result. The return value
191 is always `undef'.
192
193 $cond->set_options(options)
194 Set options of retrieval. `options' specifies options: `Condi‐
195 tion::SURE' specifies that it checks every N-gram key, `Condi‐
196 tion::USU', which is the default, specifies that it checks N-gram
197 keys with skipping one key, `Condition::FAST' skips two keys, `Con‐
198 dition::AGITO' skips three keys, `Condition::NOIDF' specifies not
199 to perform TF-IDF tuning, `Condition::SIMPLE' specifies to use sim‐
200 plified phrase, `Condition::ROUGH' specifies to use rough phrase,
201 `Condition::UNION' specifies to use union phrase, `Condi‐
202 tion::ISECT' specifies to use intersection phrase. Each option can
203 be specified at the same time by bitwise or. If keys are skipped,
204 though search speed is improved, the relevance ratio grows less.
205 The return value is always `undef'.
206
207 $cond->set_auxiliary(min)
208 Set permission to adopt result of the auxiliary index. `min' spec‐
209 ifies the minimum hits to adopt result of the auxiliary index. If
210 it is not more than 0, the auxiliary index is not used. By
211 default, it is 32.
212
213 $cond->set_eclipse(limit)
214 Set the lower limit of similarity eclipse. `limit' specifies the
215 lower limit of similarity for documents to be eclipsed. Similarity
216 is between 0.0 and 1.0. If the limit is added by `Condi‐
217 tion::ECLSIMURL', similarity is weighted by URL. If the limit is
218 `Condition::ECLSERV', similarity is ignored and documents in the
219 same server are eclipsed. If the limit is `Condition::ECLDIR',
220 similarity is ignored and documents in the same directory are
221 eclipsed. If the limit is `Condition::ECLFILE', similarity is
222 ignored and documents of the same file are eclipsed.
223
224 $cond->set_distinct(name)
225 Set the attribute distinction filter. `name' specifies the name of
226 an attribute to be distinct. The return value is always `undef'.
227
228 Class Result
229
230 $result->doc_num()
231 Get the number of documents. The return value is the number of
232 documents in the result.
233
234 $result->get_doc_id(index)
235 Get the ID number of a document. `index' specifies the index of a
236 document. The return value is the ID number of the document or -1
237 if the index is out of bounds.
238
239 $result->get_dbidx(index)
240 Get the index of the container database of a document. `index'
241 specifies the index of a document. The return value is the index
242 of the container database of the document or -1 if the index is out
243 of bounds.
244
245 $result->hint_words()
246 Get an array of hint words. The return value is a reference of an
247 array of hint words.
248
249 $result->hint(word)
250 Get the value of a hint word. `word' specifies a hint word. An
251 empty string means the number of whole result. The return value is
252 the number of documents corresponding the hint word. If the word
253 is in a negative condition, the value is negative.
254
255 $result->get_score(index)
256 Get the score of a document. `index' specifies the index of a doc‐
257 ument. The return value is the score of the document or -1 if the
258 index is out of bounds.
259
260 $result->get_shadows(id)
261 Get an array of ID numbers of eclipsed docuemnts of a document.
262 `id' specifies the ID number of a parent document. The return
263 value is a reference of an array whose elements expresse the ID
264 numbers and their scores alternately.
265
266 Class Database
267
268 Database::VERSION = "0.0.0"
269 version of Hyper Estraier
270
271 Database::ERRNOERR = 0
272 error code: no error
273
274 Database::ERRINVAL = 1
275 error code: invalid argument
276
277 Database::ERRACCES = 2
278 error code: access forbidden
279
280 Database::ERRLOCK = 3
281 error code: lock failure
282
283 Database::ERRDB = 4
284 error code: database problem
285
286 Database::ERRIO = 5
287 error code: I/O problem
288
289 Database::ERRNOITEM = 6
290 error code: no item
291
292 Database::ERRMISC = 9999
293 error code: miscellaneous
294
295 Database::DBREADER = 1 << 0
296 open mode: open as a reader
297
298 Database::DBWRITER = 1 << 1
299 open mode: open as a writer
300
301 Database::DBCREAT = 1 << 2
302 open mode: a writer creating
303
304 Database::DBTRUNC = 1 << 3
305 open mode: a writer truncating
306
307 Database::DBNOLCK = 1 << 4
308 open mode: open without locking
309
310 Database::DBLCKNB = 1 << 5
311 open mode: lock without blocking
312
313 Database::DBPERFNG = 1 << 10
314 open mode: use perfect N-gram analyzer
315
316 Database::DBCHRCAT = 1 << 11
317 open mode: use character category analyzer
318
319 Database::DBSMALL= 1 << 20
320 open mode: small tuning
321
322 Database::DBLARGE = 1 << 21
323 open mode: large tuning
324
325 Database::DBHUGE = 1 << 22
326 open mode: huge tuning
327
328 Database::DBHUGE2 = 1 << 23
329 open mode: huge tuning second
330
331 Database::DBHUGE3 = 1 << 24
332 open mode: huge tuning third
333
334 Database::DBSCVOID = 1 << 25
335 open mode: store scores as void
336
337 Database::DBSCINT = 1 << 26
338 open mode: store scores as integer
339
340 Database::DBSCASIS = 1 << 27
341 open mode: refrain from adjustment of scores
342
343 Database::IDXATTRSEQ = 0
344 attribute index type: for multipurpose sequencial access method
345
346 Database::IDXATTRSTR = 1
347 attribute index type: for narrowing with attributes as strings
348
349 Database::IDXATTRNUM = 2
350 attribute index type: for narrowing with attributes as numbers
351
352 Database::OPTNOPURGE = 1 << 0
353 optimize option: omit purging dispensable region of deleted
354
355 Database::OPTNODBOPT = 1 << 1
356 optimize option: omit optimization of the database files
357
358 Database::MGCLEAN = 1 << 0
359 merge option: clean up dispensable regions
360
361 Database::PDCLEAN = 1 << 0
362 put_doc option: clean up dispensable regions
363
364 Database::PDWEIGHT = 1 << 1
365 put_doc option: weight scores statically when indexing
366
367 Database::ODCLEAN = 1 << 0
368 out_doc option: clean up dispensable regions
369
370 Database::GDNOATTR = 1 << 0
371 get_doc option: no attributes
372
373 Database::GDNOTEXT = 1 << 1
374 get_doc option: no text
375
376 Database::GDNOKWD = 1 << 2
377 get_doc option: no keywords
378
379 $db = new Database()
380 Create a database object.
381
382 Database::search_meta(dbs, cond)
383 Search plural databases for documents corresponding a condition.
384 `dbs' specifies a reference of an array whose elements are database
385 objects. `cond' specifies a condition object. The return value is
386 a result object. On error, `undef' is returned.
387
388 $db->err_msg(ecode)
389 Get the string of an error code. `ecode' specifies an error code.
390 The return value is the string of the error code.
391
392 $db->open(name, omode)
393 Open a database. `name' specifies the name of a database direc‐
394 tory. `omode' specifies open modes: `Database::DBWRITER' as a
395 writer, `Database::DBREADER' as a reader. If the mode is `Data‐
396 base::DBWRITER', the following may be added by bitwise or: `Data‐
397 base::DBCREAT', which means it creates a new database if not exist,
398 `Database::DBTRUNC', which means it creates a new database regard‐
399 less if one exists. Both of `Database::DBREADER' and `Data‐
400 base::DBWRITER' can be added to by bitwise or: `Database::DBNOLCK',
401 which means it opens a database file without file locking, or
402 `Database::DBLCKNB', which means locking is performed without
403 blocking. If `Database::DBNOLCK' is used, the application is
404 responsible for exclusion control. `Database::DBCREAT' can be
405 added to by bitwise or: `Database::DBPERFNG', which means N-gram
406 analysis is performed against European text also, `Database::DBCHA‐
407 CAT', which means character category analysis is performed instead
408 of N-gram analysis, `Database::DBSMALL', which means the index is
409 tuned to register less than 50000 documents, `Database::DBLARGE',
410 which means the index is tuned to register more than 300000 docu‐
411 ments, `Database::DBHUGE', which means the index is tuned to regis‐
412 ter more than 1000000 documents, `Database::DBHUGE2', which means
413 the index is tuned to register more than 5000000 documents, `Data‐
414 base::DBHUGE3', which means the index is tuned to register more
415 than 10000000 documents, `Database::DBSCVOID', which means scores
416 are stored as void, `Database::DBSCINT', which means scores are
417 stored as 32-bit integer, `Database::DBSCASIS', which means scores
418 are stored as-is and marked not to be tuned when search. The
419 return value is true if success, else it is false.
420
421 $db->close()
422 Close the database. The return value is true if success, else it
423 is false.
424
425 $db->error()
426 Get the last happened error code. The return value is the last
427 happened error code.
428
429 $db->fatal()
430 Check whether the database has a fatal error. The return value is
431 true if the database has fatal erroor, else it is false.
432
433 $db->add_attr_index(name, type)
434 Add an index for narrowing or sorting with document attributes.
435 `name' specifies the name of an attribute. `type' specifies the
436 data type of attribute index; `Database::IDXATTRSEQ' for multipur‐
437 pose sequencial access method, `Database::IDXATTRSTR' for narrowing
438 with attributes as strings, `Database::IDXATTRNUM' for narrowing
439 with attributes as numbers. The return value is true if success,
440 else it is false.
441
442 $db->flush(max)
443 Flush index words in the cache. `max' specifies the maximum number
444 of words to be flushed. If it not more than zero, all words are
445 flushed. The return value is true if success, else it is false.
446
447 $db->sync()
448 Synchronize updating contents. The return value is true if suc‐
449 cess, else it is false.
450
451 $db->optimize(options)
452 Optimize the database. `options' specifies options: `Data‐
453 base::OPTNOPURGE' to omit purging dispensable region of deleted
454 documents, `Database::OPTNODBOPT' to omit optimization of the data‐
455 base files. The two can be specified at the same time by bitwise
456 or. The return value is true if success, else it is false.
457
458 $db->merge(name, options)
459 Merge another database. `name' specifies the name of another data‐
460 base directory. `options' specifies options: `Database::MGCLEAN'
461 to clean up dispensable regions of the deleted document. The
462 return value is true if success, else it is false.
463
464 $db->put_doc(doc, options)
465 Add a document. `doc' specifies a document object. The document
466 object should have the URI attribute. `options' specifies options:
467 `Database::PDCLEAN' to clean up dispensable regions of the over‐
468 written document. The return value is true if success, else it is
469 false.
470
471 $db->out_doc(id, options)
472 Remove a document. `id' specifies the ID number of a registered
473 document. `options' specifies options: `Database::ODCLEAN' to
474 clean up dispensable regions of the deleted document. The return
475 value is true if success, else it is false.
476
477 $db->edit_doc(doc)
478 Edit attributes of a document. `doc' specifies a document object.
479 The return value is true if success, else it is false.
480
481 $db->get_doc(id, options)
482 Retrieve a document. `id' specifies the ID number of a registered
483 document. `options' specifies options: `Database::GDNOATTR' to
484 ignore attributes, `Database::GDNOTEXT' to ignore the body text,
485 `Database::GDNOKWD' to ignore keywords. The three can be specified
486 at the same time by bitwise or. The return value is a document
487 object. On error, `undef' is returned.
488
489 $db->get_doc_attr(id, name)
490 Retrieve the value of an attribute of a document. `id' specifies
491 the ID number of a registered document. `name' specifies the name
492 of an attribute. The return value is the value of the attribute or
493 `undef' if it does not exist.
494
495 $db->uri_to_id(uri)
496 Get the ID of a document specified by URI. `uri' specifies the URI
497 of a registered document. The return value is the ID of the docu‐
498 ment. On error, -1 is returned.
499
500 $db->name()
501 Get the name. The return value is the name of the database.
502
503 $db->doc_num()
504 Get the number of documents. The return value is the number of
505 documents in the database.
506
507 $db->word_num()
508 Get the number of unique words. The return value is the number of
509 unique words in the database.
510
511 $db->size()
512 Get the size. The return value is the size of the database.
513
514 $db->search(cond)
515 Search for documents corresponding a condition. `cond' specifies a
516 condition object. The return value is a result object. On error,
517 `undef' is returned.
518
519 $db->scan_doc(doc, cond)
520 Check whether a document object matches the phrase of a search con‐
521 dition object definitely. `doc' specifies a document object.
522 `cond' specifies a search condition object. The return value is
523 true if the document matches the phrase of the condition object
524 definitely, else it is false.
525
526 $db->set_cache_size(size, anum, tnum, rnum)
527 Set the maximum size of the cache memory. `size' specifies the
528 maximum size of the index cache. By default, it is 64MB. If it is
529 not more than 0, the current size is not changed. `anum' specifies
530 the maximum number of cached records for document attributes. By
531 default, it is 8192. If it is not more than 0, the current size is
532 not changed. `tnum' specifies the maximum number of cached records
533 for document texts. By default, it is 1024. If it is not more
534 than 0, the current size is not changed. `rnum' specifies the max‐
535 imum number of cached records for occurrence results. By default,
536 it is 256. If it is not more than 0, the current size is not
537 changed. The return value is always `undef'.
538
539 $db->add_pseudo_index(path)
540 Add a pseudo index directory. `path' specifies the path of a
541 pseudo index directory. The return value is true if success, else
542 it is false.
543
544 $db->set_wildmax(num)
545 Set the maximum number of expansion of wild cards. `num' specifies
546 the maximum number of expansion of wild cards. The return value is
547 always `undef'.
548
549 $db->set_informer(informer)
550 Set the callback function to inform of database events. `informer'
551 specifies the name of an arbitrary function. The function should
552 have one parameter for a string of a message of each event. The
553 return value is always `undef'.
554
556 Gatherer
557
558 The following is the simplest implementation of a gatherer.
559
560 use strict;
561 use warnings;
562 use Estraier;
563 $Estraier::DEBUG = 1;
564
565 # create the database object
566 my $db = new Database();
567
568 # open the database
569 unless($db->open("casket", Database::DBWRITER ⎪ Database::DBCREAT)){
570 printf("error: %s\n", $db->err_msg($db->error()));
571 exit;
572 }
573
574 # create a document object
575 my $doc = new Document();
576
577 # add attributes to the document object
578 $doc->add_attr('@uri', "https://estraier.gov/example.txt");
579 $doc->add_attr('@title', "Over the Rainbow");
580
581 # add the body text to the document object
582 $doc->add_text("Somewhere over the rainbow. Way up high.");
583 $doc->add_text("There's a land that I heard of once in a lullaby.");
584
585 # register the document object to the database
586 unless($db->put_doc($doc, Database::PDCLEAN)){
587 printf("error: %s\n", $db->err_msg($db->error()));
588 }
589
590 # close the database
591 unless($db->close()){
592 printf("error: %s\n", $db->err_msg($db->error()));
593 }
594
595 Searcher
596
597 The following is the simplest implementation of a searcher.
598
599 use strict;
600 use warnings;
601 use Estraier;
602 $Estraier::DEBUG = 1;
603
604 # create the database object
605 my $db = new Database();
606
607 # open the database
608 unless($db->open("casket", Database::DBREADER)){
609 printf("error: %s\n", $db->err_msg($db->error()));
610 exit;
611 }
612
613 # create a search condition object
614 my $cond = new Condition();
615
616 # set the search phrase to the search condition object
617 $cond->set_phrase("rainbow AND lullaby");
618
619 # get the result of search
620 my $result = $db->search($cond);
621
622 # for each document in the result
623 my $dnum = $result->doc_num();
624 foreach my $i (0..$dnum-1){
625 # retrieve the document object
626 my $doc = $db->get_doc($result->get_doc_id($i), 0);
627 next unless(defined($doc));
628 # display attributes
629 my $uri = $doc->attr('@uri');
630 printf("URI: %s\n", $uri) if defined($uri);
631 my $title = $doc->attr('@title');
632 printf("Title: %s\n", $title) if defined($title);
633 # display the body text
634 my $texts = $doc->texts();
635 foreach my $text (@$texts){
636 printf("%s\n", $text);
637 }
638 }
639
640 # close the database
641 unless($db.close()){
642 printf("error: %s\n", $db->err_msg($db->error()));
643 }
644
646 Copyright (C) 2004-2007 Mikio Hirabayashi
647 All rights reserved.
648
649 Hyper Estraier is free software; you can redistribute it and/or modify
650 it under the terms of the GNU Lesser General Public License as pub‐
651 lished by the Free Software Foundation; either version 2.1 of the
652 License or any later version. Hyper Estraier is distributed in the
653 hope that it will be useful, but WITHOUT ANY WARRANTY; without even the
654 implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PUR‐
655 POSE. See the GNU Lesser General Public License for more details. You
656 should have received a copy of the GNU Lesser General Public License
657 along with Hyper Estraier; if not, write to the Free Software Founda‐
658 tion, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
659
660
661
662perl v5.8.8 2007-02-20 Estraier(3)