AI::Categorizer::Document(3pm)

1AI::Categorizer::DocumeUnste(r3)Contributed Perl DocumenAtIa:t:iCoantegorizer::Document(3)
2
3
4

NAME

6       AI::Categorizer::Document - Embodies a document
7

SYNOPSIS

9        use AI::Categorizer::Document;
10
11        # Simplest way to create a document:
12        my $d = new AI::Categorizer::Document(name => $string,
13                                              content => $string);
14
15        # Other parameters are accepted:
16        my $d = new AI::Categorizer::Document(name => $string,
17                                              categories => \@category_objects,
18                                              content => { subject => $string,
19                                                           body => $string2, ... },
20                                              content_weights => { subject => 3,
21                                                                   body => 1, ... },
22                                              stopwords => \%skip_these_words,
23                                              stemming => $string,
24                                              front_bias => $float,
25                                              use_features => $feature_vector,
26                                             );
27
28        # Specify explicit feature vector:
29        my $d = new AI::Categorizer::Document(name => $string);
30        $d->features( $feature_vector );
31
32        # Now pass the document to a categorization algorithm:
33        my $learner = AI::Categorizer::Learner::NaiveBayes->restore_state($path);
34        my $hypothesis = $learner->categorize($document);
35

DESCRIPTION

37       The Document class embodies the data in a single document, and contains
38       methods for turning this data into a FeatureVector.  Usually documents
39       are plain text, but subclasses of the Document class may handle any
40       kind of data.
41

METHODS

43       new(%parameters)
44           Creates a new Document object.  Document objects are used during
45           training (for the training documents), testing (for the test
46           documents), and when categorizing new unseen documents in an
47           application (for the unseen documents).  However, you'll typically
48           only call new() in the latter case, since the KnowledgeSet or
49           Collection classes will create Document objects for you in the
50           former cases.
51
52           The new() method accepts the following parameters:
53
54           name
55               A string that identifies this document.  Required.
56
57           content
58               The raw content of this document.  May be specified as either a
59               string or as a hash reference, allowing structured document
60               types.
61
62           content_weights
63               A hash reference indicating the weights that should be assigned
64               to features in different sections of a structured document when
65               creating its feature vector.  The weight is a multiplier of the
66               feature vector values.  For instance, if a "subject" section
67               has a weight of 3 and a "body" section has a weight of 1, and
68               word counts are used as feature vector values, then it will be
69               as if all words appearing in the "subject" appeared 3 times.
70
71               If no weights are specified, all weights are set to 1.
72
73           front_bias
74               Allows smooth bias of the weights of words in a document
75               according to their position.  The value should be a number
76               between -1 and 1.  Positive numbers indicate that words toward
77               the beginning of the document should have higher weight than
78               words toward the end of the document.  Negative numbers
79               indicate the opposite.  A bias of 0 indicates that no biasing
80               should be done.
81
82           categories
83               A reference to an array of Category objects that this document
84               belongs to.  Optional.
85
86           stopwords
87               A list/hash of features (words) that should be ignored when
88               parsing document content.  A hash reference is preferred, with
89               the features as the keys.  If you pass an array reference
90               containing the features, it will be converted to a hash
91               reference internally.
92
93           use_features
94               A Feature Vector specifying the only features that should be
95               considered when parsing this document.  This is an alternative
96               to using "stopwords".
97
98           stemming
99               Indicates the linguistic procedure that should be used to
100               convert tokens in the document to features.  Possible values
101               are "none", which indicates that the tokens should be used
102               without change, or "porter", indicating that the Porter
103               stemming algorithm should be applied to each token.  This
104               requires the "Lingua::Stem" module from CPAN.
105
106           stopword_behavior
107               There are a few ways you might want the stopword list
108               (specified with the "stopwords" parameter) to interact with the
109               stemming algorithm (specified with the "stemming" parameter).
110               These options can be controlled with the "stopword_behavior"
111               parameter, which can take the following values:
112
113               no_stem
114                   Match stopwords against non-stemmed document words.
115
116               stem
117                   Stem stopwords according to 'stemming' parameter, then
118                   match them against stemmed document words.
119
120               pre_stemmed
121                   Stopwords are already stemmed, match them against stemmed
122                   document words.
123
124               The default value is "stem", which seems to produce the best
125               results in most cases I've tried.  I'm not aware of any studies
126               comparing the "no_stem" behavior to the "stem" behavior in the
127               general case.
128
129               This parameter has no effect if there are no stopwords being
130               used, or if stemming is not being used.  In the latter case,
131               the list of stopwords will always be matched as-is against the
132               document words.
133
134               Note that if the "stem" option is used, the data structure
135               passed as the "stopwords" parameter will be modified in-place
136               to contain the stemmed versions of the stopwords supplied.
137
138       read( path => $path )
139           An alternative constructor method which reads a file on disk and
140           returns a document with that file's contents.
141
142       parse( content => $content )
143       name()
144           Returns this document's "name" property as specified when the
145           document was created.
146
147       features()
148           Returns the Feature Vector associated with this document.
149
150       categories()
151           In a list context, returns a list of Category objects to which this
152           document belongs.  In a scalar context, returns the number of such
153           categories.
154
155       create_feature_vector()
156           Creates this document's Feature Vector by parsing its content.  You
157           won't call this method directly, it's called by new().
158

AUTHOR

160       Ken Williams <ken@mathforum.org>
161

COPYRIGHT

163       This distribution is free software; you can redistribute it and/or
164       modify it under the same terms as Perl itself.  These terms apply to
165       every file in the distribution - if you have questions, please contact
166       the author.
167
168
169
170perl v5.36.0                      2023-01-19      AI::Categorizer::Document(3)