AI::Categorizer::FeatureSelector(3pm)

1AI::Categorizer::FeaturUesSeerleCcotnotrr(i3b)uted PerlADIo:c:uCmaetnetgaotriiozner::FeatureSelector(3)
2
3
4

NAME

6       AI::Categorizer::FeatureSelector - Abstract Feature Selection class
7

SYNOPSIS

9        ...
10

DESCRIPTION

12       The KnowledgeSet class that provides an interface to a set of
13       documents, a set of categories, and a mapping between the two.  Many
14       parameters for controlling the processing of documents are managed by
15       the KnowledgeSet class.
16

METHODS

18       new()
19           Creates a new KnowledgeSet and returns it.  Accepts the following
20           parameters:
21
22           load
23               If a "load" parameter is present, the load() method will be
24               invoked immediately.  If the "load" parameter is a string, it
25               will be passed as the "path" parameter to load().  If the
26               "load" parameter is a hash reference, it will represent all the
27               parameters to pass to load().
28
29           categories
30               An optional reference to an array of Category objects
31               representing the complete set of categories in a KnowledgeSet.
32               If used, the "documents" parameter should also be specified.
33
34           documents
35               An optional reference to an array of Document objects
36               representing the complete set of documents in a KnowledgeSet.
37               If used, the "categories" parameter should also be specified.
38
39           features_kept
40               A number indicating how many features (words) should be
41               considered when training the Learner or categorizing new
42               documents.  May be specified as a positive integer (e.g. 2000)
43               indicating the absolute number of features to be kept, or as a
44               decimal between 0 and 1 (e.g. 0.2) indicating the fraction of
45               the total number of features to be kept, or as 0 to indicate
46               that no feature selection should be done and that the entire
47               set of features should be used.  The default is 0.2.
48
49           feature_selection
50               A string indicating the type of feature selection that should
51               be performed.  Currently the only option is also the default
52               option: "document_frequency".
53
54           tfidf_weighting
55               Specifies how document word counts should be converted to
56               vector values.  Uses the three-character specification strings
57               from Salton & Buckley's paper "Term-weighting approaches in
58               automatic text retrieval".  The three characters indicate the
59               three factors that will be multiplied for each feature to find
60               the final vector value for that feature.  The default weighting
61               is "xxx".
62
63               The first character specifies the "term frequency" component,
64               which can take the following values:
65
66               b   Binary weighting - 1 for terms present in a document, 0 for
67                   terms absent.
68
69               t   Raw term frequency - equal to the number of times a feature
70                   occurs in the document.
71
72               x   A synonym for 't'.
73
74               n   Normalized term frequency - 0.5 + 0.5 * t/max(t).  This is
75                   the same as the 't' specification, but with term frequency
76                   normalized to lie between 0.5 and 1.
77
78               The second character specifies the "collection frequency"
79               component, which can take the following values:
80
81               f   Inverse document frequency - multiply term "t"'s value by
82                   log(N/n), where "N" is the total number of documents in the
83                   collection, and "n" is the number of documents in which
84                   term "t" is found.
85
86               p   Probabilistic inverse document frequency - multiply term
87                   "t"'s value by "log((N-n)/n)" (same variable meanings as
88                   above).
89
90               x   No change - multiply by 1.
91
92               The third character specifies the "normalization" component,
93               which can take the following values:
94
95               c   Apply cosine normalization - multiply by
96                   1/length(document_vector).
97
98               x   No change - multiply by 1.
99
100               The three components may alternatively be specified by the
101               "term_weighting", "collection_weighting", and
102               "normalize_weighting" parameters respectively.
103
104           verbose
105               If set to a true value, some status/debugging information will
106               be output on "STDOUT".
107
108       categories()
109           In a list context returns a list of all Category objects in this
110           KnowledgeSet.  In a scalar context returns the number of such
111           objects.
112
113       documents()
114           In a list context returns a list of all Document objects in this
115           KnowledgeSet.  In a scalar context returns the number of such
116           objects.
117
118       document()
119           Given a document name, returns the Document object with that name,
120           or "undef" if no such Document object exists in this KnowledgeSet.
121
122       features()
123           Returns a FeatureSet object which represents the features of all
124           the documents in this KnowledgeSet.
125
126       verbose()
127           Returns the "verbose" parameter of this KnowledgeSet, or sets it
128           with an optional argument.
129
130       scan_stats()
131           Scans all the documents of a Collection and returns a hash
132           reference containing several statistics about the Collection.  (XXX
133           need to describe stats)
134
135       scan_features()
136           This method scans through a Collection object and determines the
137           "best" features (words) to use when loading the documents and
138           training the Learner.  This process is known as "feature
139           selection", and it's a very important part of categorization.
140
141           The Collection object should be specified as a "collection"
142           parameter, or by giving the arguments to pass to the Collection's
143           new() method.
144
145           The process of feature selection is governed by the
146           "feature_selection" and "features_kept" parameters given to the
147           KnowledgeSet's new() method.
148
149           This method returns the features as a FeatureVector whose values
150           are the "quality" of each feature, by whatever measure the
151           "feature_selection" parameter specifies.  Normally you won't need
152           to use the return value, because this FeatureVector will become the
153           "use_features" parameter of any Document objects created by this
154           KnowledgeSet.
155
156       save_features()
157           Given the name of a file, this method writes the features (as
158           determined by the "scan_features" method) to the file.
159
160       restore_features()
161           Given the name of a file written by "save_features", loads the
162           features from that file and passes them as the "use_features"
163           parameter for any Document objects created in the future by this
164           KnowledgeSet.
165
166       read()
167           Iterates through a Collection of documents and adds them to the
168           KnowledgeSet.  The Collection can be specified using a "collection"
169           parameter - otherwise, specify the arguments to pass to the new()
170           method of the Collection class.
171
172       load()
173           This method can do feature selection and load a Collection in one
174           step (though it currently uses two steps internally).
175
176       add_document()
177           Given a Document object as an argument, this method will add it and
178           any categories it belongs to to the KnowledgeSet.
179
180       make_document()
181           This method will create a Document object with the given data and
182           then call add_document() to add it to the KnowledgeSet.  A
183           "categories" parameter should specify an array reference containing
184           a list of categories by name.  These are the categories that the
185           document belongs to.  Any other parameters will be passed to the
186           Document class's new() method.
187
188       finish()
189           This method will be called prior to training the Learner.  Its
190           purpose is to perform any operations (such as feature vector
191           weighting) that may require examination of the entire KnowledgeSet.
192
193       weigh_features()
194           This method will be called during finish() to adjust the weights of
195           the features according to the "tfidf_weighting" parameter.
196
197       document_frequency()
198           Given a single feature (word) as an argument, this method will
199           return the number of documents in the KnowledgeSet that contain
200           that feature.
201
202       partition()
203           Divides the KnowledgeSet into several subsets.  This may be useful
204           for performing cross-validation.  The relative sizes of the subsets
205           should be passed as arguments.  For example, to split the
206           KnowledgeSet into four KnowledgeSets of equal size, pass the
207           arguments .25, .25, .25 (the final size is 1 minus the sum of the
208           other sizes).  The partitions will be returned as a list.
209

AUTHOR

211       Ken Williams, ken@mathforum.org
212

COPYRIGHT

214       Copyright 2000-2003 Ken Williams.  All rights reserved.
215
216       This library is free software; you can redistribute it and/or modify it
217       under the same terms as Perl itself.
218

NAME

SYNOPSIS

DESCRIPTION

METHODS

AUTHOR

COPYRIGHT

SEE ALSO