1AI::Categorizer::DocumeUnste(r3)Contributed Perl DocumenAtIa:t:iCoantegorizer::Document(3)
2
3
4
6 AI::Categorizer::Document - Embodies a document
7
9 use AI::Categorizer::Document;
10
11 # Simplest way to create a document:
12 my $d = new AI::Categorizer::Document(name => $string,
13 content => $string);
14
15 # Other parameters are accepted:
16 my $d = new AI::Categorizer::Document(name => $string,
17 categories => \@category_objects,
18 content => { subject => $string,
19 body => $string2, ... },
20 content_weights => { subject => 3,
21 body => 1, ... },
22 stopwords => \%skip_these_words,
23 stemming => $string,
24 front_bias => $float,
25 use_features => $feature_vector,
26 );
27
28 # Specify explicit feature vector:
29 my $d = new AI::Categorizer::Document(name => $string);
30 $d->features( $feature_vector );
31
32 # Now pass the document to a categorization algorithm:
33 my $learner = AI::Categorizer::Learner::NaiveBayes->restore_state($path);
34 my $hypothesis = $learner->categorize($document);
35
37 The Document class embodies the data in a single document, and contains
38 methods for turning this data into a FeatureVector. Usually documents
39 are plain text, but subclasses of the Document class may handle any
40 kind of data.
41
43 new(%parameters)
44 Creates a new Document object. Document objects are used during
45 training (for the training documents), testing (for the test
46 documents), and when categorizing new unseen documents in an
47 application (for the unseen documents). However, you'll typically
48 only call "new()" in the latter case, since the KnowledgeSet or
49 Collection classes will create Document objects for you in the
50 former cases.
51
52 The "new()" method accepts the following parameters:
53
54 name
55 A string that identifies this document. Required.
56
57 content
58 The raw content of this document. May be specified as either a
59 string or as a hash reference, allowing structured document
60 types.
61
62 content_weights
63 A hash reference indicating the weights that should be assigned
64 to features in different sections of a structured document when
65 creating its feature vector. The weight is a multiplier of the
66 feature vector values. For instance, if a "subject" section
67 has a weight of 3 and a "body" section has a weight of 1, and
68 word counts are used as feature vector values, then it will be
69 as if all words appearing in the "subject" appeared 3 times.
70
71 If no weights are specified, all weights are set to 1.
72
73 front_bias
74 Allows smooth bias of the weights of words in a document
75 according to their position. The value should be a number
76 between -1 and 1. Positive numbers indicate that words toward
77 the beginning of the document should have higher weight than
78 words toward the end of the document. Negative numbers
79 indicate the opposite. A bias of 0 indicates that no biasing
80 should be done.
81
82 categories
83 A reference to an array of Category objects that this document
84 belongs to. Optional.
85
86 stopwords
87 A list/hash of features (words) that should be ignored when
88 parsing document content. A hash reference is preferred, with
89 the features as the keys. If you pass an array reference
90 containing the features, it will be converted to a hash
91 reference internally.
92
93 use_features
94 A Feature Vector specifying the only features that should be
95 considered when parsing this document. This is an alternative
96 to using "stopwords".
97
98 stemming
99 Indicates the linguistic procedure that should be used to
100 convert tokens in the document to features. Possible values
101 are "none", which indicates that the tokens should be used
102 without change, or "porter", indicating that the Porter
103 stemming algorithm should be applied to each token. This
104 requires the "Lingua::Stem" module from CPAN.
105
106 stopword_behavior
107 There are a few ways you might want the stopword list
108 (specified with the "stopwords" parameter) to interact with the
109 stemming algorithm (specified with the "stemming" parameter).
110 These options can be controlled with the "stopword_behavior"
111 parameter, which can take the following values:
112
113 no_stem
114 Match stopwords against non-stemmed document words.
115
116 stem
117 Stem stopwords according to 'stemming' parameter, then
118 match them against stemmed document words.
119
120 pre_stemmed
121 Stopwords are already stemmed, match them against stemmed
122 document words.
123
124 The default value is "stem", which seems to produce the best
125 results in most cases I've tried. I'm not aware of any studies
126 comparing the "no_stem" behavior to the "stem" behavior in the
127 general case.
128
129 This parameter has no effect if there are no stopwords being
130 used, or if stemming is not being used. In the latter case,
131 the list of stopwords will always be matched as-is against the
132 document words.
133
134 Note that if the "stem" option is used, the data structure
135 passed as the "stopwords" parameter will be modified in-place
136 to contain the stemmed versions of the stopwords supplied.
137
138 read( path => $path )
139 An alternative constructor method which reads a file on disk and
140 returns a document with that file's contents.
141
142 parse( content => $content )
143 name()
144 Returns this document's "name" property as specified when the
145 document was created.
146
147 features()
148 Returns the Feature Vector associated with this document.
149
150 categories()
151 In a list context, returns a list of Category objects to which this
152 document belongs. In a scalar context, returns the number of such
153 categories.
154
155 create_feature_vector()
156 Creates this document's Feature Vector by parsing its content. You
157 won't call this method directly, it's called by "new()".
158
160 Ken Williams <ken@mathforum.org>
161
163 This distribution is free software; you can redistribute it and/or
164 modify it under the same terms as Perl itself. These terms apply to
165 every file in the distribution - if you have questions, please contact
166 the author.
167
168
169
170perl v5.32.0 2020-07-28 AI::Categorizer::Document(3)