1Estraier(3)           User Contributed Perl Documentation          Estraier(3)
2
3
4

NAME

6       Perl Binding of Hyper Estraier
7

SYNOPSYS

9         use Estraier;
10

INTRODUCTION

12       Hyper Estraier is a full-text search system for communities.
13
14       This is a package implementing the core API of Hyper Estraier (
15       http://hyperestraier.sourceforge.net/ ), including native codes written
16       in C with XS macros.  As it works on Linux, Mac OS X, Windows, and so
17       on, native libraries for each environment are required to run programs.
18       This package requires Perl 5.8.8 or later versions.
19
20   Setting
21       Install the latest version of Hyper Estraier.
22
23       Enter the sub directory `perlnative' in the extracted package then
24       perform installation.
25
26         cd perlnative
27         ./configure
28         make
29         su
30         make install
31
32       On Linux and other UNIX systems: set the environment variable
33       LD_LIBRARY_PATH to find libraries; "libestraier.so".  On Mac OS X: set
34       the environment variable DYLD_LIBRARY_PATH to find libraries;
35       "libestraier.dylib".  On Windows: set the environment variable PATH to
36       find libraries; "estraier.dll".
37
38       The package `Estraier' should be loaded in each source file of
39       application programs.
40
41         use Estraier;
42
43       If you want to enable runtime assertion, set the variable
44       `$Estraier::DEBUG' to be true.
45
46         $Estraier::DEBUG = 1;
47

DESCRIPTION

49   Class Document
50       $doc = new Document(draft)
51           Create a document object.  `draft' specifies a string of draft
52           data.  If it is omitted, an empty document object is created.
53
54       $doc->add_attr(name, value)
55           Add an attribute.  `name' specifies the name of an attribute.
56           `value' specifies the value of the attribute.  If it is `undef',
57           the attribute is removed.  The return value is always `undef'.
58
59       $doc->add_text(text)
60           Add a sentence of text.  `text' specifies a sentence of text.  The
61           return value is always `undef'.
62
63       $doc->add_hidden_text(text)
64           Add a hidden sentence.  `text' specifies a hidden sentence.  The
65           return value is always `undef'.
66
67       $doc->set_keywords(kwords)
68           Attach keywords.  `kwords' specifies the reference of a hash object
69           of keywords.  Keys of the hash should be keywords of the document
70           and values should be their scores in decimal string.  The return
71           value is always `undef'.
72
73       $doc->set_score(score)
74           Set the substitute score.  `score' specifies the substitute score.
75           It should be zero or positive.  The return value is always `undef'.
76
77       $doc->id()
78           Get the ID number.  The return value is the ID number of the
79           document object.  If the object has never been registered, -1 is
80           returned.
81
82       $doc->attr_names()
83           Get an array of attribute names of a document object.  The return
84           value is a reference of an array object of attribute names.
85
86       $doc->attr()
87           Get the value of an attribute.  `name' specifies the name of an
88           attribute.  The return value is the value of the attribute or
89           `undef' if it does not exist.
90
91       $doc->texts()
92           Get an array of sentences of the text.  The return value is a
93           reference of an array object of sentences of the text.
94
95       $doc->cat_texts()
96           Concatenate sentences of the text of a document object.  The return
97           value is concatenated sentences.
98
99       $doc->keywords()
100           Get attached keywords.  The return value is a reference of a hash
101           object of keywords and their scores in decimal string.  If no
102           keyword is attached, `undef' is returned.
103
104       $doc->score()
105           Get the substitute score.  The return value is the substitute score
106           or -1 if it is not set.
107
108       $doc->dump_draft()
109           Dump draft data of a document object.  The return value is draft
110           data.
111
112       $doc->make_snippet(words, wwidth, hwidth, awidth)
113           Make a snippet of the body text.  `words' specifies a reference of
114           an array object of words to be highlight.  `wwidth' specifies whole
115           width of the result.  `hwidth' specifies width of strings picked up
116           from the beginning of the text.  `awidth' width of strings picked
117           up around each highlighted word.  The return value is a snippet
118           string of the body text.  There are tab separated values.  Each
119           line is a string to be shown.  Though most lines have only one
120           field, some lines have two fields.  If the second field exists, the
121           first field is to be shown with highlighted, and the second field
122           means its normalized form.
123
124   Class Condition
125       Condition::SURE = 1 << 0
126           option: check every N-gram key
127
128       Condition::USUAL = 1 << 1
129           option: check N-gram keys skipping by one
130
131       Condition::FAST = 1 << 2
132           option: check N-gram keys skipping by two
133
134       Condition::AGITO = 1 << 3
135           option: check N-gram keys skipping by three
136
137       Condition::NOIDF = 1 << 4
138           option: without TF-IDF tuning
139
140       Condition::SIMPLE = 1 << 10
141           option: with the simplified phrase
142
143       Condition::ROUGH = 1 << 11
144           option: with the rough phrase
145
146       Condition::UNION = 1 << 15
147           option: with the union phrase
148
149       Condition::ISECT = 1 << 16
150           option: with the intersection phrase
151
152       Condition::ECLSIMURL = 10.0
153           eclipse tuning: consider URL
154
155       Condition::ECLSERV = 100.0
156           eclipse tuning: on server basis
157
158       Condition::ECLDIR = 101.0
159           eclipse tuning: on directory basis
160
161       Condition::ECLFILE = 102.0
162           eclipse tuning: on file basis
163
164       $cond = new Condition()
165           Create a search condition object.
166
167       $cond->set_phrase(phrase)
168           Set the search phrase.  `phrase' specifies a search phrase.  The
169           return value is always `undef'.
170
171       $cond->add_attr(expr)
172           Add an expression for an attribute.  `expr' specifies an expression
173           for an attribute.  The return value is always `undef'.
174
175       $cond->set_order(expr)
176           Set the order of a condition object.  `expr' specifies an
177           expression for the order.  By default, the order is by score
178           descending.  The return value is always `undef'.
179
180       $cond->set_max(max)
181           Set the maximum number of retrieval.  `max' specifies the maximum
182           number of retrieval.  By default, the number of retrieval is not
183           limited.
184
185       $cond->set_skip(skip)
186           Set the number of skipped documents.  `skip' specifies the number
187           of documents to be skipped in the search result.  The return value
188           is always `undef'.
189
190       $cond->set_options(options)
191           Set options of retrieval.  `options' specifies options:
192           `Condition::SURE' specifies that it checks every N-gram key,
193           `Condition::USU', which is the default, specifies that it checks
194           N-gram keys with skipping one key, `Condition::FAST' skips two
195           keys, `Condition::AGITO' skips three keys, `Condition::NOIDF'
196           specifies not to perform TF-IDF tuning, `Condition::SIMPLE'
197           specifies to use simplified phrase, `Condition::ROUGH' specifies to
198           use rough phrase, `Condition::UNION' specifies to use union phrase,
199           `Condition::ISECT' specifies to use intersection phrase.  Each
200           option can be specified at the same time by bitwise or.  If keys
201           are skipped, though search speed is improved, the relevance ratio
202           grows less.  The return value is always `undef'.
203
204       $cond->set_auxiliary(min)
205           Set permission to adopt result of the auxiliary index.  `min'
206           specifies the minimum hits to adopt result of the auxiliary index.
207           If it is not more than 0, the auxiliary index is not used.  By
208           default, it is 32.
209
210       $cond->set_eclipse(limit)
211           Set the lower limit of similarity eclipse.  `limit' specifies the
212           lower limit of similarity for documents to be eclipsed.  Similarity
213           is between 0.0 and 1.0.  If the limit is added by
214           `Condition::ECLSIMURL', similarity is weighted by URL.  If the
215           limit is `Condition::ECLSERV', similarity is ignored and documents
216           in the same server are eclipsed.  If the limit is
217           `Condition::ECLDIR', similarity is ignored and documents in the
218           same directory are eclipsed.  If the limit is `Condition::ECLFILE',
219           similarity is ignored and documents of the same file are eclipsed.
220
221       $cond->set_distinct(name)
222           Set the attribute distinction filter.  `name' specifies the name of
223           an attribute to be distinct.  The return value is always `undef'.
224
225   Class Result
226       $result->doc_num()
227           Get the number of documents.  The return value is the number of
228           documents in the result.
229
230       $result->get_doc_id(index)
231           Get the ID number of a document.  `index' specifies the index of a
232           document.  The return value is the ID number of the document or -1
233           if the index is out of bounds.
234
235       $result->get_dbidx(index)
236           Get the index of the container database of a document.  `index'
237           specifies the index of a document.  The return value is the index
238           of the container database of the document or -1 if the index is out
239           of bounds.
240
241       $result->hint_words()
242           Get an array of hint words.  The return value is a reference of an
243           array of hint words.
244
245       $result->hint(word)
246           Get the value of a hint word.  `word' specifies a hint word.  An
247           empty string means the number of whole result.  The return value is
248           the number of documents corresponding the hint word.  If the word
249           is in a negative condition, the value is negative.
250
251       $result->get_score(index)
252           Get the score of a document.  `index' specifies the index of a
253           document.  The return value is the score of the document or -1 if
254           the index is out of bounds.
255
256       $result->get_shadows(id)
257           Get an array of ID numbers of eclipsed docuemnts of a document.
258           `id' specifies the ID number of a parent document.  The return
259           value is a reference of an array whose elements expresse the ID
260           numbers and their scores alternately.
261
262   Class Database
263       Database::VERSION = "0.0.0"
264           version of Hyper Estraier
265
266       Database::ERRNOERR = 0
267           error code: no error
268
269       Database::ERRINVAL = 1
270           error code: invalid argument
271
272       Database::ERRACCES = 2
273           error code: access forbidden
274
275       Database::ERRLOCK = 3
276           error code: lock failure
277
278       Database::ERRDB = 4
279           error code: database problem
280
281       Database::ERRIO = 5
282           error code: I/O problem
283
284       Database::ERRNOITEM = 6
285           error code: no item
286
287       Database::ERRMISC = 9999
288           error code: miscellaneous
289
290       Database::DBREADER = 1 << 0
291           open mode: open as a reader
292
293       Database::DBWRITER = 1 << 1
294           open mode: open as a writer
295
296       Database::DBCREAT = 1 << 2
297           open mode: a writer creating
298
299       Database::DBTRUNC = 1 << 3
300           open mode: a writer truncating
301
302       Database::DBNOLCK = 1 << 4
303           open mode: open without locking
304
305       Database::DBLCKNB = 1 << 5
306           open mode: lock without blocking
307
308       Database::DBPERFNG = 1 << 10
309           open mode: use perfect N-gram analyzer
310
311       Database::DBCHRCAT = 1 << 11
312           open mode: use character category analyzer
313
314       Database::DBSMALL= 1 << 20
315           open mode: small tuning
316
317       Database::DBLARGE = 1 << 21
318           open mode: large tuning
319
320       Database::DBHUGE = 1 << 22
321           open mode: huge tuning
322
323       Database::DBHUGE2 = 1 << 23
324           open mode: huge tuning second
325
326       Database::DBHUGE3 = 1 << 24
327           open mode: huge tuning third
328
329       Database::DBSCVOID = 1 << 25
330           open mode: store scores as void
331
332       Database::DBSCINT = 1 << 26
333           open mode: store scores as integer
334
335       Database::DBSCASIS = 1 << 27
336           open mode: refrain from adjustment of scores
337
338       Database::IDXATTRSEQ = 0
339           attribute index type: for multipurpose sequencial access method
340
341       Database::IDXATTRSTR = 1
342           attribute index type: for narrowing with attributes as strings
343
344       Database::IDXATTRNUM = 2
345           attribute index type: for narrowing with attributes as numbers
346
347       Database::OPTNOPURGE = 1 << 0
348           optimize option: omit purging dispensable region of deleted
349
350       Database::OPTNODBOPT = 1 << 1
351           optimize option: omit optimization of the database files
352
353       Database::MGCLEAN = 1 << 0
354           merge option: clean up dispensable regions
355
356       Database::PDCLEAN = 1 << 0
357           put_doc option: clean up dispensable regions
358
359       Database::PDWEIGHT = 1 << 1
360           put_doc option: weight scores statically when indexing
361
362       Database::ODCLEAN = 1 << 0
363           out_doc option: clean up dispensable regions
364
365       Database::GDNOATTR = 1 << 0
366           get_doc option: no attributes
367
368       Database::GDNOTEXT = 1 << 1
369           get_doc option: no text
370
371       Database::GDNOKWD = 1 << 2
372           get_doc option: no keywords
373
374       $db = new Database()
375           Create a database object.
376
377       Database::search_meta(dbs, cond)
378           Search plural databases for documents corresponding a condition.
379           `dbs' specifies a reference of an array whose elements are database
380           objects.  `cond' specifies a condition object.  The return value is
381           a result object.  On error, `undef' is returned.
382
383       $db->err_msg(ecode)
384           Get the string of an error code.  `ecode' specifies an error code.
385           The return value is the string of the error code.
386
387       $db->open(name, omode)
388           Open a database.  `name' specifies the name of a database
389           directory.  `omode' specifies open modes: `Database::DBWRITER' as a
390           writer, `Database::DBREADER' as a reader.  If the mode is
391           `Database::DBWRITER', the following may be added by bitwise or:
392           `Database::DBCREAT', which means it creates a new database if not
393           exist, `Database::DBTRUNC', which means it creates a new database
394           regardless if one exists.  Both of `Database::DBREADER' and
395           `Database::DBWRITER' can be added to by bitwise or:
396           `Database::DBNOLCK', which means it opens a database file without
397           file locking, or `Database::DBLCKNB', which means locking is
398           performed without blocking.  If `Database::DBNOLCK' is used, the
399           application is responsible for exclusion control.
400           `Database::DBCREAT' can be added to by bitwise or:
401           `Database::DBPERFNG', which means N-gram analysis is performed
402           against European text also, `Database::DBCHACAT', which means
403           character category analysis is performed instead of N-gram
404           analysis, `Database::DBSMALL', which means the index is tuned to
405           register less than 50000 documents, `Database::DBLARGE', which
406           means the index is tuned to register more than 300000 documents,
407           `Database::DBHUGE', which means the index is tuned to register more
408           than 1000000 documents, `Database::DBHUGE2', which means the index
409           is tuned to register more than 5000000 documents,
410           `Database::DBHUGE3', which means the index is tuned to register
411           more than 10000000 documents, `Database::DBSCVOID', which means
412           scores are stored as void, `Database::DBSCINT', which means scores
413           are stored as 32-bit integer, `Database::DBSCASIS', which means
414           scores are stored as-is and marked not to be tuned when search.
415           The return value is true if success, else it is false.
416
417       $db->close()
418           Close the database.  The return value is true if success, else it
419           is false.
420
421       $db->error()
422           Get the last happened error code.  The return value is the last
423           happened error code.
424
425       $db->fatal()
426           Check whether the database has a fatal error.  The return value is
427           true if the database has fatal erroor, else it is false.
428
429       $db->add_attr_index(name, type)
430           Add an index for narrowing or sorting with document attributes.
431           `name' specifies the name of an attribute.  `type' specifies the
432           data type of attribute index; `Database::IDXATTRSEQ' for
433           multipurpose sequencial access method, `Database::IDXATTRSTR' for
434           narrowing with attributes as strings, `Database::IDXATTRNUM' for
435           narrowing with attributes as numbers.  The return value is true if
436           success, else it is false.
437
438       $db->flush(max)
439           Flush index words in the cache.  `max' specifies the maximum number
440           of words to be flushed.  If it not more than zero, all words are
441           flushed.  The return value is true if success, else it is false.
442
443       $db->sync()
444           Synchronize updating contents.  The return value is true if
445           success, else it is false.
446
447       $db->optimize(options)
448           Optimize the database.  `options' specifies options:
449           `Database::OPTNOPURGE' to omit purging dispensable region of
450           deleted documents, `Database::OPTNODBOPT' to omit optimization of
451           the database files.  The two can be specified at the same time by
452           bitwise or.  The return value is true if success, else it is false.
453
454       $db->merge(name, options)
455           Merge another database.  `name' specifies the name of another
456           database directory.  `options' specifies options:
457           `Database::MGCLEAN' to clean up dispensable regions of the deleted
458           document.  The return value is true if success, else it is false.
459
460       $db->put_doc(doc, options)
461           Add a document.  `doc' specifies a document object.  The document
462           object should have the URI attribute.  `options' specifies options:
463           `Database::PDCLEAN' to clean up dispensable regions of the
464           overwritten document.  The return value is true if success, else it
465           is false.
466
467       $db->out_doc(id, options)
468           Remove a document.  `id' specifies the ID number of a registered
469           document.  `options' specifies options: `Database::ODCLEAN' to
470           clean up dispensable regions of the deleted document.  The return
471           value is true if success, else it is false.
472
473       $db->edit_doc(doc)
474           Edit attributes of a document.  `doc' specifies a document object.
475           The return value is true if success, else it is false.
476
477       $db->get_doc(id, options)
478           Retrieve a document.  `id' specifies the ID number of a registered
479           document.  `options' specifies options: `Database::GDNOATTR' to
480           ignore attributes, `Database::GDNOTEXT' to ignore the body text,
481           `Database::GDNOKWD' to ignore keywords.  The three can be specified
482           at the same time by bitwise or.  The return value is a document
483           object.  On error, `undef' is returned.
484
485       $db->get_doc_attr(id, name)
486           Retrieve the value of an attribute of a document.  `id' specifies
487           the ID number of a registered document.  `name' specifies the name
488           of an attribute.  The return value is the value of the attribute or
489           `undef' if it does not exist.
490
491       $db->uri_to_id(uri)
492           Get the ID of a document specified by URI.  `uri' specifies the URI
493           of a registered document.  The return value is the ID of the
494           document.  On error, -1 is returned.
495
496       $db->name()
497           Get the name.  The return value is the name of the database.
498
499       $db->doc_num()
500           Get the number of documents.  The return value is the number of
501           documents in the database.
502
503       $db->word_num()
504           Get the number of unique words.  The return value is the number of
505           unique words in the database.
506
507       $db->size()
508           Get the size.  The return value is the size of the database.
509
510       $db->search(cond)
511           Search for documents corresponding a condition.  `cond' specifies a
512           condition object.  The return value is a result object.  On error,
513           `undef' is returned.
514
515       $db->scan_doc(doc, cond)
516           Check whether a document object matches the phrase of a search
517           condition object definitely.  `doc' specifies a document object.
518           `cond' specifies a search condition object.  The return value is
519           true if the document matches the phrase of the condition object
520           definitely, else it is false.
521
522       $db->set_cache_size(size, anum, tnum, rnum)
523           Set the maximum size of the cache memory.  `size' specifies the
524           maximum size of the index cache.  By default, it is 64MB.  If it is
525           not more than 0, the current size is not changed.  `anum' specifies
526           the maximum number of cached records for document attributes.  By
527           default, it is 8192.  If it is not more than 0, the current size is
528           not changed.  `tnum' specifies the maximum number of cached records
529           for document texts.  By default, it is 1024.  If it is not more
530           than 0, the current size is not changed.  `rnum' specifies the
531           maximum number of cached records for occurrence results.  By
532           default, it is 256.  If it is not more than 0, the current size is
533           not changed.  The return value is always `undef'.
534
535       $db->add_pseudo_index(path)
536           Add a pseudo index directory.  `path' specifies the path of a
537           pseudo index directory.  The return value is true if success, else
538           it is false.
539
540       $db->set_wildmax(num)
541           Set the maximum number of expansion of wild cards.  `num' specifies
542           the maximum number of expansion of wild cards.  The return value is
543           always `undef'.
544
545       $db->set_informer(informer)
546           Set the callback function to inform of database events.  `informer'
547           specifies the name of an arbitrary function.  The function should
548           have one parameter for a string of a message of each event.  The
549           return value is always `undef'.
550

EXAMPLE

552   Gatherer
553       The following is the simplest implementation of a gatherer.
554
555         use strict;
556         use warnings;
557         use Estraier;
558         $Estraier::DEBUG = 1;
559
560         # create the database object
561         my $db = new Database();
562
563         # open the database
564         unless($db->open("casket", Database::DBWRITER | Database::DBCREAT)){
565             printf("error: %s\n", $db->err_msg($db->error()));
566             exit;
567         }
568
569         # create a document object
570         my $doc = new Document();
571
572         # add attributes to the document object
573         $doc->add_attr('@uri', "https://estraier.gov/example.txt");
574         $doc->add_attr('@title', "Over the Rainbow");
575
576         # add the body text to the document object
577         $doc->add_text("Somewhere over the rainbow.  Way up high.");
578         $doc->add_text("There's a land that I heard of once in a lullaby.");
579
580         # register the document object to the database
581         unless($db->put_doc($doc, Database::PDCLEAN)){
582             printf("error: %s\n", $db->err_msg($db->error()));
583         }
584
585         # close the database
586         unless($db->close()){
587             printf("error: %s\n", $db->err_msg($db->error()));
588         }
589
590   Searcher
591       The following is the simplest implementation of a searcher.
592
593         use strict;
594         use warnings;
595         use Estraier;
596         $Estraier::DEBUG = 1;
597
598         # create the database object
599         my $db = new Database();
600
601         # open the database
602         unless($db->open("casket", Database::DBREADER)){
603             printf("error: %s\n", $db->err_msg($db->error()));
604             exit;
605         }
606
607         # create a search condition object
608         my $cond = new Condition();
609
610         # set the search phrase to the search condition object
611         $cond->set_phrase("rainbow AND lullaby");
612
613         # get the result of search
614         my $result = $db->search($cond);
615
616         # for each document in the result
617         my $dnum = $result->doc_num();
618         foreach my $i (0..$dnum-1){
619             # retrieve the document object
620             my $doc = $db->get_doc($result->get_doc_id($i), 0);
621             next unless(defined($doc));
622             # display attributes
623             my $uri = $doc->attr('@uri');
624             printf("URI: %s\n", $uri) if defined($uri);
625             my $title = $doc->attr('@title');
626             printf("Title: %s\n", $title) if defined($title);
627             # display the body text
628             my $texts = $doc->texts();
629             foreach my $text (@$texts){
630                 printf("%s\n", $text);
631             }
632         }
633
634         # close the database
635         unless($db.close()){
636             printf("error: %s\n", $db->err_msg($db->error()));
637         }
638

LICENSE

640        Copyright (C) 2004-2007 Mikio Hirabayashi
641        All rights reserved.
642
643       Hyper Estraier is free software; you can redistribute it and/or modify
644       it under the terms of the GNU Lesser General Public License as
645       published by the Free Software Foundation; either version 2.1 of the
646       License or any later version.  Hyper Estraier is distributed in the
647       hope that it will be useful, but WITHOUT ANY WARRANTY; without even the
648       implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
649       PURPOSE.  See the GNU Lesser General Public License for more details.
650       You should have received a copy of the GNU Lesser General Public
651       License along with Hyper Estraier; if not, write to the Free Software
652       Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
653       USA.
654
655
656
657perl v5.12.0                      2007-02-20                       Estraier(3)
Impressum