1Estraier(3)           User Contributed Perl Documentation          Estraier(3)
2
3
4

NAME

6       Perl Binding of Hyper Estraier
7

SYNOPSYS

9         use Estraier;
10

INTRODUCTION

12       Hyper Estraier is a full-text search system for communities.
13
14       This is a package implementing the core API of Hyper Estraier (
15       http://hyperestraier.sourceforge.net/ ), including native codes written
16       in C with XS macros.  As it works on Linux, Mac OS X, Windows, and so
17       on, native libraries for each environment are required to run programs.
18       This package requires Perl 5.8.8 or later versions.
19
20       Setting
21
22       Install the latest version of Hyper Estraier.
23
24       Enter the sub directory `perlnative' in the extracted package then per‐
25       form installation.
26
27         cd perlnative
28         ./configure
29         make
30         su
31         make install
32
33       On Linux and other UNIX systems: set the environment variable
34       LD_LIBRARY_PATH to find libraries; "libestraier.so".  On Mac OS X: set
35       the environment variable DYLD_LIBRARY_PATH to find libraries; "libe‐
36       straier.dylib".  On Windows: set the environment variable PATH to find
37       libraries; "estraier.dll".
38
39       The package `Estraier' should be loaded in each source file of applica‐
40       tion programs.
41
42         use Estraier;
43
44       If you want to enable runtime assertion, set the variable
45       `$Estraier::DEBUG' to be true.
46
47         $Estraier::DEBUG = 1;
48

DESCRIPTION

50       Class Document
51
52       $doc = new Document(draft)
53           Create a document object.  `draft' specifies a string of draft
54           data.  If it is omitted, an empty document object is created.
55
56       $doc->add_attr(name, value)
57           Add an attribute.  `name' specifies the name of an attribute.
58           `value' specifies the value of the attribute.  If it is `undef',
59           the attribute is removed.  The return value is always `undef'.
60
61       $doc->add_text(text)
62           Add a sentence of text.  `text' specifies a sentence of text.  The
63           return value is always `undef'.
64
65       $doc->add_hidden_text(text)
66           Add a hidden sentence.  `text' specifies a hidden sentence.  The
67           return value is always `undef'.
68
69       $doc->set_keywords(kwords)
70           Attach keywords.  `kwords' specifies the reference of a hash object
71           of keywords.  Keys of the hash should be keywords of the document
72           and values should be their scores in decimal string.  The return
73           value is always `undef'.
74
75       $doc->set_score(score)
76           Set the substitute score.  `score' specifies the substitute score.
77           It should be zero or positive.  The return value is always `undef'.
78
79       $doc->id()
80           Get the ID number.  The return value is the ID number of the docu‐
81           ment object.  If the object has never been registered, -1 is
82           returned.
83
84       $doc->attr_names()
85           Get an array of attribute names of a document object.  The return
86           value is a reference of an array object of attribute names.
87
88       $doc->attr()
89           Get the value of an attribute.  `name' specifies the name of an
90           attribute.  The return value is the value of the attribute or
91           `undef' if it does not exist.
92
93       $doc->texts()
94           Get an array of sentences of the text.  The return value is a ref‐
95           erence of an array object of sentences of the text.
96
97       $doc->cat_texts()
98           Concatenate sentences of the text of a document object.  The return
99           value is concatenated sentences.
100
101       $doc->keywords()
102           Get attached keywords.  The return value is a reference of a hash
103           object of keywords and their scores in decimal string.  If no key‐
104           word is attached, `undef' is returned.
105
106       $doc->score()
107           Get the substitute score.  The return value is the substitute score
108           or -1 if it is not set.
109
110       $doc->dump_draft()
111           Dump draft data of a document object.  The return value is draft
112           data.
113
114       $doc->make_snippet(words, wwidth, hwidth, awidth)
115           Make a snippet of the body text.  `words' specifies a reference of
116           an array object of words to be highlight.  `wwidth' specifies whole
117           width of the result.  `hwidth' specifies width of strings picked up
118           from the beginning of the text.  `awidth' width of strings picked
119           up around each highlighted word.  The return value is a snippet
120           string of the body text.  There are tab separated values.  Each
121           line is a string to be shown.  Though most lines have only one
122           field, some lines have two fields.  If the second field exists, the
123           first field is to be shown with highlighted, and the second field
124           means its normalized form.
125
126       Class Condition
127
128       Condition::SURE = 1 << 0
129           option: check every N-gram key
130
131       Condition::USUAL = 1 << 1
132           option: check N-gram keys skipping by one
133
134       Condition::FAST = 1 << 2
135           option: check N-gram keys skipping by two
136
137       Condition::AGITO = 1 << 3
138           option: check N-gram keys skipping by three
139
140       Condition::NOIDF = 1 << 4
141           option: without TF-IDF tuning
142
143       Condition::SIMPLE = 1 << 10
144           option: with the simplified phrase
145
146       Condition::ROUGH = 1 << 11
147           option: with the rough phrase
148
149       Condition::UNION = 1 << 15
150           option: with the union phrase
151
152       Condition::ISECT = 1 << 16
153           option: with the intersection phrase
154
155       Condition::ECLSIMURL = 10.0
156           eclipse tuning: consider URL
157
158       Condition::ECLSERV = 100.0
159           eclipse tuning: on server basis
160
161       Condition::ECLDIR = 101.0
162           eclipse tuning: on directory basis
163
164       Condition::ECLFILE = 102.0
165           eclipse tuning: on file basis
166
167       $cond = new Condition()
168           Create a search condition object.
169
170       $cond->set_phrase(phrase)
171           Set the search phrase.  `phrase' specifies a search phrase.  The
172           return value is always `undef'.
173
174       $cond->add_attr(expr)
175           Add an expression for an attribute.  `expr' specifies an expression
176           for an attribute.  The return value is always `undef'.
177
178       $cond->set_order(expr)
179           Set the order of a condition object.  `expr' specifies an expres‐
180           sion for the order.  By default, the order is by score descending.
181           The return value is always `undef'.
182
183       $cond->set_max(max)
184           Set the maximum number of retrieval.  `max' specifies the maximum
185           number of retrieval.  By default, the number of retrieval is not
186           limited.
187
188       $cond->set_skip(skip)
189           Set the number of skipped documents.  `skip' specifies the number
190           of documents to be skipped in the search result.  The return value
191           is always `undef'.
192
193       $cond->set_options(options)
194           Set options of retrieval.  `options' specifies options: `Condi‐
195           tion::SURE' specifies that it checks every N-gram key, `Condi‐
196           tion::USU', which is the default, specifies that it checks N-gram
197           keys with skipping one key, `Condition::FAST' skips two keys, `Con‐
198           dition::AGITO' skips three keys, `Condition::NOIDF' specifies not
199           to perform TF-IDF tuning, `Condition::SIMPLE' specifies to use sim‐
200           plified phrase, `Condition::ROUGH' specifies to use rough phrase,
201           `Condition::UNION' specifies to use union phrase, `Condi‐
202           tion::ISECT' specifies to use intersection phrase.  Each option can
203           be specified at the same time by bitwise or.  If keys are skipped,
204           though search speed is improved, the relevance ratio grows less.
205           The return value is always `undef'.
206
207       $cond->set_auxiliary(min)
208           Set permission to adopt result of the auxiliary index.  `min' spec‐
209           ifies the minimum hits to adopt result of the auxiliary index.  If
210           it is not more than 0, the auxiliary index is not used.  By
211           default, it is 32.
212
213       $cond->set_eclipse(limit)
214           Set the lower limit of similarity eclipse.  `limit' specifies the
215           lower limit of similarity for documents to be eclipsed.  Similarity
216           is between 0.0 and 1.0.  If the limit is added by `Condi‐
217           tion::ECLSIMURL', similarity is weighted by URL.  If the limit is
218           `Condition::ECLSERV', similarity is ignored and documents in the
219           same server are eclipsed.  If the limit is `Condition::ECLDIR',
220           similarity is ignored and documents in the same directory are
221           eclipsed.  If the limit is `Condition::ECLFILE', similarity is
222           ignored and documents of the same file are eclipsed.
223
224       $cond->set_distinct(name)
225           Set the attribute distinction filter.  `name' specifies the name of
226           an attribute to be distinct.  The return value is always `undef'.
227
228       Class Result
229
230       $result->doc_num()
231           Get the number of documents.  The return value is the number of
232           documents in the result.
233
234       $result->get_doc_id(index)
235           Get the ID number of a document.  `index' specifies the index of a
236           document.  The return value is the ID number of the document or -1
237           if the index is out of bounds.
238
239       $result->get_dbidx(index)
240           Get the index of the container database of a document.  `index'
241           specifies the index of a document.  The return value is the index
242           of the container database of the document or -1 if the index is out
243           of bounds.
244
245       $result->hint_words()
246           Get an array of hint words.  The return value is a reference of an
247           array of hint words.
248
249       $result->hint(word)
250           Get the value of a hint word.  `word' specifies a hint word.  An
251           empty string means the number of whole result.  The return value is
252           the number of documents corresponding the hint word.  If the word
253           is in a negative condition, the value is negative.
254
255       $result->get_score(index)
256           Get the score of a document.  `index' specifies the index of a doc‐
257           ument.  The return value is the score of the document or -1 if the
258           index is out of bounds.
259
260       $result->get_shadows(id)
261           Get an array of ID numbers of eclipsed docuemnts of a document.
262           `id' specifies the ID number of a parent document.  The return
263           value is a reference of an array whose elements expresse the ID
264           numbers and their scores alternately.
265
266       Class Database
267
268       Database::VERSION = "0.0.0"
269           version of Hyper Estraier
270
271       Database::ERRNOERR = 0
272           error code: no error
273
274       Database::ERRINVAL = 1
275           error code: invalid argument
276
277       Database::ERRACCES = 2
278           error code: access forbidden
279
280       Database::ERRLOCK = 3
281           error code: lock failure
282
283       Database::ERRDB = 4
284           error code: database problem
285
286       Database::ERRIO = 5
287           error code: I/O problem
288
289       Database::ERRNOITEM = 6
290           error code: no item
291
292       Database::ERRMISC = 9999
293           error code: miscellaneous
294
295       Database::DBREADER = 1 << 0
296           open mode: open as a reader
297
298       Database::DBWRITER = 1 << 1
299           open mode: open as a writer
300
301       Database::DBCREAT = 1 << 2
302           open mode: a writer creating
303
304       Database::DBTRUNC = 1 << 3
305           open mode: a writer truncating
306
307       Database::DBNOLCK = 1 << 4
308           open mode: open without locking
309
310       Database::DBLCKNB = 1 << 5
311           open mode: lock without blocking
312
313       Database::DBPERFNG = 1 << 10
314           open mode: use perfect N-gram analyzer
315
316       Database::DBCHRCAT = 1 << 11
317           open mode: use character category analyzer
318
319       Database::DBSMALL= 1 << 20
320           open mode: small tuning
321
322       Database::DBLARGE = 1 << 21
323           open mode: large tuning
324
325       Database::DBHUGE = 1 << 22
326           open mode: huge tuning
327
328       Database::DBHUGE2 = 1 << 23
329           open mode: huge tuning second
330
331       Database::DBHUGE3 = 1 << 24
332           open mode: huge tuning third
333
334       Database::DBSCVOID = 1 << 25
335           open mode: store scores as void
336
337       Database::DBSCINT = 1 << 26
338           open mode: store scores as integer
339
340       Database::DBSCASIS = 1 << 27
341           open mode: refrain from adjustment of scores
342
343       Database::IDXATTRSEQ = 0
344           attribute index type: for multipurpose sequencial access method
345
346       Database::IDXATTRSTR = 1
347           attribute index type: for narrowing with attributes as strings
348
349       Database::IDXATTRNUM = 2
350           attribute index type: for narrowing with attributes as numbers
351
352       Database::OPTNOPURGE = 1 << 0
353           optimize option: omit purging dispensable region of deleted
354
355       Database::OPTNODBOPT = 1 << 1
356           optimize option: omit optimization of the database files
357
358       Database::MGCLEAN = 1 << 0
359           merge option: clean up dispensable regions
360
361       Database::PDCLEAN = 1 << 0
362           put_doc option: clean up dispensable regions
363
364       Database::PDWEIGHT = 1 << 1
365           put_doc option: weight scores statically when indexing
366
367       Database::ODCLEAN = 1 << 0
368           out_doc option: clean up dispensable regions
369
370       Database::GDNOATTR = 1 << 0
371           get_doc option: no attributes
372
373       Database::GDNOTEXT = 1 << 1
374           get_doc option: no text
375
376       Database::GDNOKWD = 1 << 2
377           get_doc option: no keywords
378
379       $db = new Database()
380           Create a database object.
381
382       Database::search_meta(dbs, cond)
383           Search plural databases for documents corresponding a condition.
384           `dbs' specifies a reference of an array whose elements are database
385           objects.  `cond' specifies a condition object.  The return value is
386           a result object.  On error, `undef' is returned.
387
388       $db->err_msg(ecode)
389           Get the string of an error code.  `ecode' specifies an error code.
390           The return value is the string of the error code.
391
392       $db->open(name, omode)
393           Open a database.  `name' specifies the name of a database direc‐
394           tory.  `omode' specifies open modes: `Database::DBWRITER' as a
395           writer, `Database::DBREADER' as a reader.  If the mode is `Data‐
396           base::DBWRITER', the following may be added by bitwise or: `Data‐
397           base::DBCREAT', which means it creates a new database if not exist,
398           `Database::DBTRUNC', which means it creates a new database regard‐
399           less if one exists.  Both of `Database::DBREADER' and  `Data‐
400           base::DBWRITER' can be added to by bitwise or: `Database::DBNOLCK',
401           which means it opens a database file without file locking, or
402           `Database::DBLCKNB', which means locking is performed without
403           blocking.  If `Database::DBNOLCK' is used, the application is
404           responsible for exclusion control.  `Database::DBCREAT' can be
405           added to by bitwise or: `Database::DBPERFNG', which means N-gram
406           analysis is performed against European text also, `Database::DBCHA‐
407           CAT', which means character category analysis is performed instead
408           of N-gram analysis, `Database::DBSMALL', which means the index is
409           tuned to register less than 50000 documents, `Database::DBLARGE',
410           which means the index is tuned to register more than 300000 docu‐
411           ments, `Database::DBHUGE', which means the index is tuned to regis‐
412           ter more than 1000000 documents, `Database::DBHUGE2', which means
413           the index is tuned to register more than 5000000 documents, `Data‐
414           base::DBHUGE3', which means the index is tuned to register more
415           than 10000000 documents, `Database::DBSCVOID', which means scores
416           are stored as void, `Database::DBSCINT', which means scores are
417           stored as 32-bit integer, `Database::DBSCASIS', which means scores
418           are stored as-is and marked not to be tuned when search.  The
419           return value is true if success, else it is false.
420
421       $db->close()
422           Close the database.  The return value is true if success, else it
423           is false.
424
425       $db->error()
426           Get the last happened error code.  The return value is the last
427           happened error code.
428
429       $db->fatal()
430           Check whether the database has a fatal error.  The return value is
431           true if the database has fatal erroor, else it is false.
432
433       $db->add_attr_index(name, type)
434           Add an index for narrowing or sorting with document attributes.
435           `name' specifies the name of an attribute.  `type' specifies the
436           data type of attribute index; `Database::IDXATTRSEQ' for multipur‐
437           pose sequencial access method, `Database::IDXATTRSTR' for narrowing
438           with attributes as strings, `Database::IDXATTRNUM' for narrowing
439           with attributes as numbers.  The return value is true if success,
440           else it is false.
441
442       $db->flush(max)
443           Flush index words in the cache.  `max' specifies the maximum number
444           of words to be flushed.  If it not more than zero, all words are
445           flushed.  The return value is true if success, else it is false.
446
447       $db->sync()
448           Synchronize updating contents.  The return value is true if suc‐
449           cess, else it is false.
450
451       $db->optimize(options)
452           Optimize the database.  `options' specifies options: `Data‐
453           base::OPTNOPURGE' to omit purging dispensable region of deleted
454           documents, `Database::OPTNODBOPT' to omit optimization of the data‐
455           base files.  The two can be specified at the same time by bitwise
456           or.  The return value is true if success, else it is false.
457
458       $db->merge(name, options)
459           Merge another database.  `name' specifies the name of another data‐
460           base directory.  `options' specifies options: `Database::MGCLEAN'
461           to clean up dispensable regions of the deleted document.  The
462           return value is true if success, else it is false.
463
464       $db->put_doc(doc, options)
465           Add a document.  `doc' specifies a document object.  The document
466           object should have the URI attribute.  `options' specifies options:
467           `Database::PDCLEAN' to clean up dispensable regions of the over‐
468           written document.  The return value is true if success, else it is
469           false.
470
471       $db->out_doc(id, options)
472           Remove a document.  `id' specifies the ID number of a registered
473           document.  `options' specifies options: `Database::ODCLEAN' to
474           clean up dispensable regions of the deleted document.  The return
475           value is true if success, else it is false.
476
477       $db->edit_doc(doc)
478           Edit attributes of a document.  `doc' specifies a document object.
479           The return value is true if success, else it is false.
480
481       $db->get_doc(id, options)
482           Retrieve a document.  `id' specifies the ID number of a registered
483           document.  `options' specifies options: `Database::GDNOATTR' to
484           ignore attributes, `Database::GDNOTEXT' to ignore the body text,
485           `Database::GDNOKWD' to ignore keywords.  The three can be specified
486           at the same time by bitwise or.  The return value is a document
487           object.  On error, `undef' is returned.
488
489       $db->get_doc_attr(id, name)
490           Retrieve the value of an attribute of a document.  `id' specifies
491           the ID number of a registered document.  `name' specifies the name
492           of an attribute.  The return value is the value of the attribute or
493           `undef' if it does not exist.
494
495       $db->uri_to_id(uri)
496           Get the ID of a document specified by URI.  `uri' specifies the URI
497           of a registered document.  The return value is the ID of the docu‐
498           ment.  On error, -1 is returned.
499
500       $db->name()
501           Get the name.  The return value is the name of the database.
502
503       $db->doc_num()
504           Get the number of documents.  The return value is the number of
505           documents in the database.
506
507       $db->word_num()
508           Get the number of unique words.  The return value is the number of
509           unique words in the database.
510
511       $db->size()
512           Get the size.  The return value is the size of the database.
513
514       $db->search(cond)
515           Search for documents corresponding a condition.  `cond' specifies a
516           condition object.  The return value is a result object.  On error,
517           `undef' is returned.
518
519       $db->scan_doc(doc, cond)
520           Check whether a document object matches the phrase of a search con‐
521           dition object definitely.  `doc' specifies a document object.
522           `cond' specifies a search condition object.  The return value is
523           true if the document matches the phrase of the condition object
524           definitely, else it is false.
525
526       $db->set_cache_size(size, anum, tnum, rnum)
527           Set the maximum size of the cache memory.  `size' specifies the
528           maximum size of the index cache.  By default, it is 64MB.  If it is
529           not more than 0, the current size is not changed.  `anum' specifies
530           the maximum number of cached records for document attributes.  By
531           default, it is 8192.  If it is not more than 0, the current size is
532           not changed.  `tnum' specifies the maximum number of cached records
533           for document texts.  By default, it is 1024.  If it is not more
534           than 0, the current size is not changed.  `rnum' specifies the max‐
535           imum number of cached records for occurrence results.  By default,
536           it is 256.  If it is not more than 0, the current size is not
537           changed.  The return value is always `undef'.
538
539       $db->add_pseudo_index(path)
540           Add a pseudo index directory.  `path' specifies the path of a
541           pseudo index directory.  The return value is true if success, else
542           it is false.
543
544       $db->set_wildmax(num)
545           Set the maximum number of expansion of wild cards.  `num' specifies
546           the maximum number of expansion of wild cards.  The return value is
547           always `undef'.
548
549       $db->set_informer(informer)
550           Set the callback function to inform of database events.  `informer'
551           specifies the name of an arbitrary function.  The function should
552           have one parameter for a string of a message of each event.  The
553           return value is always `undef'.
554

EXAMPLE

556       Gatherer
557
558       The following is the simplest implementation of a gatherer.
559
560         use strict;
561         use warnings;
562         use Estraier;
563         $Estraier::DEBUG = 1;
564
565         # create the database object
566         my $db = new Database();
567
568         # open the database
569         unless($db->open("casket", Database::DBWRITER ⎪ Database::DBCREAT)){
570             printf("error: %s\n", $db->err_msg($db->error()));
571             exit;
572         }
573
574         # create a document object
575         my $doc = new Document();
576
577         # add attributes to the document object
578         $doc->add_attr('@uri', "https://estraier.gov/example.txt");
579         $doc->add_attr('@title', "Over the Rainbow");
580
581         # add the body text to the document object
582         $doc->add_text("Somewhere over the rainbow.  Way up high.");
583         $doc->add_text("There's a land that I heard of once in a lullaby.");
584
585         # register the document object to the database
586         unless($db->put_doc($doc, Database::PDCLEAN)){
587             printf("error: %s\n", $db->err_msg($db->error()));
588         }
589
590         # close the database
591         unless($db->close()){
592             printf("error: %s\n", $db->err_msg($db->error()));
593         }
594
595       Searcher
596
597       The following is the simplest implementation of a searcher.
598
599         use strict;
600         use warnings;
601         use Estraier;
602         $Estraier::DEBUG = 1;
603
604         # create the database object
605         my $db = new Database();
606
607         # open the database
608         unless($db->open("casket", Database::DBREADER)){
609             printf("error: %s\n", $db->err_msg($db->error()));
610             exit;
611         }
612
613         # create a search condition object
614         my $cond = new Condition();
615
616         # set the search phrase to the search condition object
617         $cond->set_phrase("rainbow AND lullaby");
618
619         # get the result of search
620         my $result = $db->search($cond);
621
622         # for each document in the result
623         my $dnum = $result->doc_num();
624         foreach my $i (0..$dnum-1){
625             # retrieve the document object
626             my $doc = $db->get_doc($result->get_doc_id($i), 0);
627             next unless(defined($doc));
628             # display attributes
629             my $uri = $doc->attr('@uri');
630             printf("URI: %s\n", $uri) if defined($uri);
631             my $title = $doc->attr('@title');
632             printf("Title: %s\n", $title) if defined($title);
633             # display the body text
634             my $texts = $doc->texts();
635             foreach my $text (@$texts){
636                 printf("%s\n", $text);
637             }
638         }
639
640         # close the database
641         unless($db.close()){
642             printf("error: %s\n", $db->err_msg($db->error()));
643         }
644

LICENSE

646        Copyright (C) 2004-2007 Mikio Hirabayashi
647        All rights reserved.
648
649       Hyper Estraier is free software; you can redistribute it and/or modify
650       it under the terms of the GNU Lesser General Public License as pub‐
651       lished by the Free Software Foundation; either version 2.1 of the
652       License or any later version.  Hyper Estraier is distributed in the
653       hope that it will be useful, but WITHOUT ANY WARRANTY; without even the
654       implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PUR‐
655       POSE.  See the GNU Lesser General Public License for more details.  You
656       should have received a copy of the GNU Lesser General Public License
657       along with Hyper Estraier; if not, write to the Free Software Founda‐
658       tion, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
659
660
661
662perl v5.8.8                       2007-02-20                       Estraier(3)
Impressum