1AI::Categorizer::FeaturUesSeerleCcotnotrr(i3b)uted PerlADIo:c:uCmaetnetgaotriiozner::FeatureSelector(3)
2
3
4
6 AI::Categorizer::FeatureSelector - Abstract Feature Selection class
7
9 ...
10
12 The KnowledgeSet class that provides an interface to a set of
13 documents, a set of categories, and a mapping between the two. Many
14 parameters for controlling the processing of documents are managed by
15 the KnowledgeSet class.
16
18 new()
19 Creates a new KnowledgeSet and returns it. Accepts the following
20 parameters:
21
22 load
23 If a "load" parameter is present, the load() method will be
24 invoked immediately. If the "load" parameter is a string, it
25 will be passed as the "path" parameter to load(). If the
26 "load" parameter is a hash reference, it will represent all the
27 parameters to pass to load().
28
29 categories
30 An optional reference to an array of Category objects
31 representing the complete set of categories in a KnowledgeSet.
32 If used, the "documents" parameter should also be specified.
33
34 documents
35 An optional reference to an array of Document objects
36 representing the complete set of documents in a KnowledgeSet.
37 If used, the "categories" parameter should also be specified.
38
39 features_kept
40 A number indicating how many features (words) should be
41 considered when training the Learner or categorizing new
42 documents. May be specified as a positive integer (e.g. 2000)
43 indicating the absolute number of features to be kept, or as a
44 decimal between 0 and 1 (e.g. 0.2) indicating the fraction of
45 the total number of features to be kept, or as 0 to indicate
46 that no feature selection should be done and that the entire
47 set of features should be used. The default is 0.2.
48
49 feature_selection
50 A string indicating the type of feature selection that should
51 be performed. Currently the only option is also the default
52 option: "document_frequency".
53
54 tfidf_weighting
55 Specifies how document word counts should be converted to
56 vector values. Uses the three-character specification strings
57 from Salton & Buckley's paper "Term-weighting approaches in
58 automatic text retrieval". The three characters indicate the
59 three factors that will be multiplied for each feature to find
60 the final vector value for that feature. The default weighting
61 is "xxx".
62
63 The first character specifies the "term frequency" component,
64 which can take the following values:
65
66 b Binary weighting - 1 for terms present in a document, 0 for
67 terms absent.
68
69 t Raw term frequency - equal to the number of times a feature
70 occurs in the document.
71
72 x A synonym for 't'.
73
74 n Normalized term frequency - 0.5 + 0.5 * t/max(t). This is
75 the same as the 't' specification, but with term frequency
76 normalized to lie between 0.5 and 1.
77
78 The second character specifies the "collection frequency"
79 component, which can take the following values:
80
81 f Inverse document frequency - multiply term "t"'s value by
82 log(N/n), where "N" is the total number of documents in the
83 collection, and "n" is the number of documents in which
84 term "t" is found.
85
86 p Probabilistic inverse document frequency - multiply term
87 "t"'s value by "log((N-n)/n)" (same variable meanings as
88 above).
89
90 x No change - multiply by 1.
91
92 The third character specifies the "normalization" component,
93 which can take the following values:
94
95 c Apply cosine normalization - multiply by
96 1/length(document_vector).
97
98 x No change - multiply by 1.
99
100 The three components may alternatively be specified by the
101 "term_weighting", "collection_weighting", and
102 "normalize_weighting" parameters respectively.
103
104 verbose
105 If set to a true value, some status/debugging information will
106 be output on "STDOUT".
107
108 categories()
109 In a list context returns a list of all Category objects in this
110 KnowledgeSet. In a scalar context returns the number of such
111 objects.
112
113 documents()
114 In a list context returns a list of all Document objects in this
115 KnowledgeSet. In a scalar context returns the number of such
116 objects.
117
118 document()
119 Given a document name, returns the Document object with that name,
120 or "undef" if no such Document object exists in this KnowledgeSet.
121
122 features()
123 Returns a FeatureSet object which represents the features of all
124 the documents in this KnowledgeSet.
125
126 verbose()
127 Returns the "verbose" parameter of this KnowledgeSet, or sets it
128 with an optional argument.
129
130 scan_stats()
131 Scans all the documents of a Collection and returns a hash
132 reference containing several statistics about the Collection. (XXX
133 need to describe stats)
134
135 scan_features()
136 This method scans through a Collection object and determines the
137 "best" features (words) to use when loading the documents and
138 training the Learner. This process is known as "feature
139 selection", and it's a very important part of categorization.
140
141 The Collection object should be specified as a "collection"
142 parameter, or by giving the arguments to pass to the Collection's
143 new() method.
144
145 The process of feature selection is governed by the
146 "feature_selection" and "features_kept" parameters given to the
147 KnowledgeSet's new() method.
148
149 This method returns the features as a FeatureVector whose values
150 are the "quality" of each feature, by whatever measure the
151 "feature_selection" parameter specifies. Normally you won't need
152 to use the return value, because this FeatureVector will become the
153 "use_features" parameter of any Document objects created by this
154 KnowledgeSet.
155
156 save_features()
157 Given the name of a file, this method writes the features (as
158 determined by the "scan_features" method) to the file.
159
160 restore_features()
161 Given the name of a file written by "save_features", loads the
162 features from that file and passes them as the "use_features"
163 parameter for any Document objects created in the future by this
164 KnowledgeSet.
165
166 read()
167 Iterates through a Collection of documents and adds them to the
168 KnowledgeSet. The Collection can be specified using a "collection"
169 parameter - otherwise, specify the arguments to pass to the new()
170 method of the Collection class.
171
172 load()
173 This method can do feature selection and load a Collection in one
174 step (though it currently uses two steps internally).
175
176 add_document()
177 Given a Document object as an argument, this method will add it and
178 any categories it belongs to to the KnowledgeSet.
179
180 make_document()
181 This method will create a Document object with the given data and
182 then call add_document() to add it to the KnowledgeSet. A
183 "categories" parameter should specify an array reference containing
184 a list of categories by name. These are the categories that the
185 document belongs to. Any other parameters will be passed to the
186 Document class's new() method.
187
188 finish()
189 This method will be called prior to training the Learner. Its
190 purpose is to perform any operations (such as feature vector
191 weighting) that may require examination of the entire KnowledgeSet.
192
193 weigh_features()
194 This method will be called during finish() to adjust the weights of
195 the features according to the "tfidf_weighting" parameter.
196
197 document_frequency()
198 Given a single feature (word) as an argument, this method will
199 return the number of documents in the KnowledgeSet that contain
200 that feature.
201
202 partition()
203 Divides the KnowledgeSet into several subsets. This may be useful
204 for performing cross-validation. The relative sizes of the subsets
205 should be passed as arguments. For example, to split the
206 KnowledgeSet into four KnowledgeSets of equal size, pass the
207 arguments .25, .25, .25 (the final size is 1 minus the sum of the
208 other sizes). The partitions will be returned as a list.
209
211 Ken Williams, ken@mathforum.org
212
214 Copyright 2000-2003 Ken Williams. All rights reserved.
215
216 This library is free software; you can redistribute it and/or modify it
217 under the same terms as Perl itself.
218
220 AI::Categorizer(3)
221
222
223
224perl v5.36.0 2023-01-19AI::Categorizer::FeatureSelector(3)