1AI::Categorizer::KnowleUdsgeerSeCto(n3t)ributed Perl DocAuIm:e:nCtaatteigoonrizer::KnowledgeSet(3)
2
3
4
6 AI::Categorizer::KnowledgeSet - Encapsulates set of documents
7
9 use AI::Categorizer::KnowledgeSet;
10 my $k = new AI::Categorizer::KnowledgeSet(...parameters...);
11 my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
12 $nb->train(knowledge_set => $k);
13
15 The KnowledgeSet class that provides an interface to a set of
16 documents, a set of categories, and a mapping between the two. Many
17 parameters for controlling the processing of documents are managed by
18 the KnowledgeSet class.
19
21 new()
22 Creates a new KnowledgeSet and returns it. Accepts the following
23 parameters:
24
25 load
26 If a "load" parameter is present, the load() method will be
27 invoked immediately. If the "load" parameter is a string, it
28 will be passed as the "path" parameter to load(). If the
29 "load" parameter is a hash reference, it will represent all the
30 parameters to pass to load().
31
32 categories
33 An optional reference to an array of Category objects
34 representing the complete set of categories in a KnowledgeSet.
35 If used, the "documents" parameter should also be specified.
36
37 documents
38 An optional reference to an array of Document objects
39 representing the complete set of documents in a KnowledgeSet.
40 If used, the "categories" parameter should also be specified.
41
42 features_kept
43 A number indicating how many features (words) should be
44 considered when training the Learner or categorizing new
45 documents. May be specified as a positive integer (e.g. 2000)
46 indicating the absolute number of features to be kept, or as a
47 decimal between 0 and 1 (e.g. 0.2) indicating the fraction of
48 the total number of features to be kept, or as 0 to indicate
49 that no feature selection should be done and that the entire
50 set of features should be used. The default is 0.2.
51
52 feature_selection
53 A string indicating the type of feature selection that should
54 be performed. Currently the only option is also the default
55 option: "document_frequency".
56
57 tfidf_weighting
58 Specifies how document word counts should be converted to
59 vector values. Uses the three-character specification strings
60 from Salton & Buckley's paper "Term-weighting approaches in
61 automatic text retrieval". The three characters indicate the
62 three factors that will be multiplied for each feature to find
63 the final vector value for that feature. The default weighting
64 is "xxx".
65
66 The first character specifies the "term frequency" component,
67 which can take the following values:
68
69 b Binary weighting - 1 for terms present in a document, 0 for
70 terms absent.
71
72 t Raw term frequency - equal to the number of times a feature
73 occurs in the document.
74
75 x A synonym for 't'.
76
77 n Normalized term frequency - 0.5 + 0.5 * t/max(t). This is
78 the same as the 't' specification, but with term frequency
79 normalized to lie between 0.5 and 1.
80
81 The second character specifies the "collection frequency"
82 component, which can take the following values:
83
84 f Inverse document frequency - multiply term "t"'s value by
85 log(N/n), where "N" is the total number of documents in the
86 collection, and "n" is the number of documents in which
87 term "t" is found.
88
89 p Probabilistic inverse document frequency - multiply term
90 "t"'s value by "log((N-n)/n)" (same variable meanings as
91 above).
92
93 x No change - multiply by 1.
94
95 The third character specifies the "normalization" component,
96 which can take the following values:
97
98 c Apply cosine normalization - multiply by
99 1/length(document_vector).
100
101 x No change - multiply by 1.
102
103 The three components may alternatively be specified by the
104 "term_weighting", "collection_weighting", and
105 "normalize_weighting" parameters respectively.
106
107 verbose
108 If set to a true value, some status/debugging information will
109 be output on "STDOUT".
110
111 categories()
112 In a list context returns a list of all Category objects in this
113 KnowledgeSet. In a scalar context returns the number of such
114 objects.
115
116 documents()
117 In a list context returns a list of all Document objects in this
118 KnowledgeSet. In a scalar context returns the number of such
119 objects.
120
121 document()
122 Given a document name, returns the Document object with that name,
123 or "undef" if no such Document object exists in this KnowledgeSet.
124
125 features()
126 Returns a FeatureSet object which represents the features of all
127 the documents in this KnowledgeSet.
128
129 verbose()
130 Returns the "verbose" parameter of this KnowledgeSet, or sets it
131 with an optional argument.
132
133 scan_stats()
134 Scans all the documents of a Collection and returns a hash
135 reference containing several statistics about the Collection. (XXX
136 need to describe stats)
137
138 scan_features()
139 This method scans through a Collection object and determines the
140 "best" features (words) to use when loading the documents and
141 training the Learner. This process is known as "feature
142 selection", and it's a very important part of categorization.
143
144 The Collection object should be specified as a "collection"
145 parameter, or by giving the arguments to pass to the Collection's
146 new() method.
147
148 The process of feature selection is governed by the
149 "feature_selection" and "features_kept" parameters given to the
150 KnowledgeSet's new() method.
151
152 This method returns the features as a FeatureVector whose values
153 are the "quality" of each feature, by whatever measure the
154 "feature_selection" parameter specifies. Normally you won't need
155 to use the return value, because this FeatureVector will become the
156 "use_features" parameter of any Document objects created by this
157 KnowledgeSet.
158
159 save_features()
160 Given the name of a file, this method writes the features (as
161 determined by the "scan_features" method) to the file.
162
163 restore_features()
164 Given the name of a file written by "save_features", loads the
165 features from that file and passes them as the "use_features"
166 parameter for any Document objects created in the future by this
167 KnowledgeSet.
168
169 read()
170 Iterates through a Collection of documents and adds them to the
171 KnowledgeSet. The Collection can be specified using a "collection"
172 parameter - otherwise, specify the arguments to pass to the new()
173 method of the Collection class.
174
175 load()
176 This method can do feature selection and load a Collection in one
177 step (though it currently uses two steps internally).
178
179 add_document()
180 Given a Document object as an argument, this method will add it and
181 any categories it belongs to to the KnowledgeSet.
182
183 make_document()
184 This method will create a Document object with the given data and
185 then call add_document() to add it to the KnowledgeSet. A
186 "categories" parameter should specify an array reference containing
187 a list of categories by name. These are the categories that the
188 document belongs to. Any other parameters will be passed to the
189 Document class's new() method.
190
191 finish()
192 This method will be called prior to training the Learner. Its
193 purpose is to perform any operations (such as feature vector
194 weighting) that may require examination of the entire KnowledgeSet.
195
196 weigh_features()
197 This method will be called during finish() to adjust the weights of
198 the features according to the "tfidf_weighting" parameter.
199
200 document_frequency()
201 Given a single feature (word) as an argument, this method will
202 return the number of documents in the KnowledgeSet that contain
203 that feature.
204
205 partition()
206 Divides the KnowledgeSet into several subsets. This may be useful
207 for performing cross-validation. The relative sizes of the subsets
208 should be passed as arguments. For example, to split the
209 KnowledgeSet into four KnowledgeSets of equal size, pass the
210 arguments .25, .25, .25 (the final size is 1 minus the sum of the
211 other sizes). The partitions will be returned as a list.
212
214 Ken Williams, ken@mathforum.org
215
217 Copyright 2000-2003 Ken Williams. All rights reserved.
218
219 This library is free software; you can redistribute it and/or modify it
220 under the same terms as Perl itself.
221
223 AI::Categorizer(3)
224
225
226
227perl v5.36.0 2023-01-19 AI::Categorizer::KnowledgeSet(3)