AI::Categorizer::KnowledgeSet(3pm)

1AI::Categorizer::KnowleUdsgeerSeCto(n3t)ributed Perl DocAuIm:e:nCtaatteigoonrizer::KnowledgeSet(3)
2
3
4

NAME

6       AI::Categorizer::KnowledgeSet - Encapsulates set of documents
7

SYNOPSIS

9        use AI::Categorizer::KnowledgeSet;
10        my $k = new AI::Categorizer::KnowledgeSet(...parameters...);
11        my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
12        $nb->train(knowledge_set => $k);
13

DESCRIPTION

15       The KnowledgeSet class that provides an interface to a set of
16       documents, a set of categories, and a mapping between the two.  Many
17       parameters for controlling the processing of documents are managed by
18       the KnowledgeSet class.
19

METHODS

21       new()
22           Creates a new KnowledgeSet and returns it.  Accepts the following
23           parameters:
24
25           load
26               If a "load" parameter is present, the "load()" method will be
27               invoked immediately.  If the "load" parameter is a string, it
28               will be passed as the "path" parameter to "load()".  If the
29               "load" parameter is a hash reference, it will represent all the
30               parameters to pass to "load()".
31
32           categories
33               An optional reference to an array of Category objects
34               representing the complete set of categories in a KnowledgeSet.
35               If used, the "documents" parameter should also be specified.
36
37           documents
38               An optional reference to an array of Document objects
39               representing the complete set of documents in a KnowledgeSet.
40               If used, the "categories" parameter should also be specified.
41
42           features_kept
43               A number indicating how many features (words) should be
44               considered when training the Learner or categorizing new
45               documents.  May be specified as a positive integer (e.g. 2000)
46               indicating the absolute number of features to be kept, or as a
47               decimal between 0 and 1 (e.g. 0.2) indicating the fraction of
48               the total number of features to be kept, or as 0 to indicate
49               that no feature selection should be done and that the entire
50               set of features should be used.  The default is 0.2.
51
52           feature_selection
53               A string indicating the type of feature selection that should
54               be performed.  Currently the only option is also the default
55               option: "document_frequency".
56
57           tfidf_weighting
58               Specifies how document word counts should be converted to
59               vector values.  Uses the three-character specification strings
60               from Salton & Buckley's paper "Term-weighting approaches in
61               automatic text retrieval".  The three characters indicate the
62               three factors that will be multiplied for each feature to find
63               the final vector value for that feature.  The default weighting
64               is "xxx".
65
66               The first character specifies the "term frequency" component,
67               which can take the following values:
68
69               b   Binary weighting - 1 for terms present in a document, 0 for
70                   terms absent.
71
72               t   Raw term frequency - equal to the number of times a feature
73                   occurs in the document.
74
75               x   A synonym for 't'.
76
77               n   Normalized term frequency - 0.5 + 0.5 * t/max(t).  This is
78                   the same as the 't' specification, but with term frequency
79                   normalized to lie between 0.5 and 1.
80
81               The second character specifies the "collection frequency"
82               component, which can take the following values:
83
84               f   Inverse document frequency - multiply term "t"'s value by
85                   "log(N/n)", where "N" is the total number of documents in
86                   the collection, and "n" is the number of documents in which
87                   term "t" is found.
88
89               p   Probabilistic inverse document frequency - multiply term
90                   "t"'s value by "log((N-n)/n)" (same variable meanings as
91                   above).
92
93               x   No change - multiply by 1.
94
95               The third character specifies the "normalization" component,
96               which can take the following values:
97
98               c   Apply cosine normalization - multiply by
99                   1/length(document_vector).
100
101               x   No change - multiply by 1.
102
103               The three components may alternatively be specified by the
104               "term_weighting", "collection_weighting", and
105               "normalize_weighting" parameters respectively.
106
107           verbose
108               If set to a true value, some status/debugging information will
109               be output on "STDOUT".
110
111       categories()
112           In a list context returns a list of all Category objects in this
113           KnowledgeSet.  In a scalar context returns the number of such
114           objects.
115
116       documents()
117           In a list context returns a list of all Document objects in this
118           KnowledgeSet.  In a scalar context returns the number of such
119           objects.
120
121       document()
122           Given a document name, returns the Document object with that name,
123           or "undef" if no such Document object exists in this KnowledgeSet.
124
125       features()
126           Returns a FeatureSet object which represents the features of all
127           the documents in this KnowledgeSet.
128
129       verbose()
130           Returns the "verbose" parameter of this KnowledgeSet, or sets it
131           with an optional argument.
132
133       scan_stats()
134           Scans all the documents of a Collection and returns a hash
135           reference containing several statistics about the Collection.  (XXX
136           need to describe stats)
137
138       scan_features()
139           This method scans through a Collection object and determines the
140           "best" features (words) to use when loading the documents and
141           training the Learner.  This process is known as "feature
142           selection", and it's a very important part of categorization.
143
144           The Collection object should be specified as a "collection"
145           parameter, or by giving the arguments to pass to the Collection's
146           "new()" method.
147
148           The process of feature selection is governed by the
149           "feature_selection" and "features_kept" parameters given to the
150           KnowledgeSet's "new()" method.
151
152           This method returns the features as a FeatureVector whose values
153           are the "quality" of each feature, by whatever measure the
154           "feature_selection" parameter specifies.  Normally you won't need
155           to use the return value, because this FeatureVector will become the
156           "use_features" parameter of any Document objects created by this
157           KnowledgeSet.
158
159       save_features()
160           Given the name of a file, this method writes the features (as
161           determined by the "scan_features" method) to the file.
162
163       restore_features()
164           Given the name of a file written by "save_features", loads the
165           features from that file and passes them as the "use_features"
166           parameter for any Document objects created in the future by this
167           KnowledgeSet.
168
169       read()
170           Iterates through a Collection of documents and adds them to the
171           KnowledgeSet.  The Collection can be specified using a "collection"
172           parameter - otherwise, specify the arguments to pass to the "new()"
173           method of the Collection class.
174
175       load()
176           This method can do feature selection and load a Collection in one
177           step (though it currently uses two steps internally).
178
179       add_document()
180           Given a Document object as an argument, this method will add it and
181           any categories it belongs to to the KnowledgeSet.
182
183       make_document()
184           This method will create a Document object with the given data and
185           then call "add_document()" to add it to the KnowledgeSet.  A
186           "categories" parameter should specify an array reference containing
187           a list of categories by name.  These are the categories that the
188           document belongs to.  Any other parameters will be passed to the
189           Document class's "new()" method.
190
191       finish()
192           This method will be called prior to training the Learner.  Its
193           purpose is to perform any operations (such as feature vector
194           weighting) that may require examination of the entire KnowledgeSet.
195
196       weigh_features()
197           This method will be called during "finish()" to adjust the weights
198           of the features according to the "tfidf_weighting" parameter.
199
200       document_frequency()
201           Given a single feature (word) as an argument, this method will
202           return the number of documents in the KnowledgeSet that contain
203           that feature.
204
205       partition()
206           Divides the KnowledgeSet into several subsets.  This may be useful
207           for performing cross-validation.  The relative sizes of the subsets
208           should be passed as arguments.  For example, to split the
209           KnowledgeSet into four KnowledgeSets of equal size, pass the
210           arguments .25, .25, .25 (the final size is 1 minus the sum of the
211           other sizes).  The partitions will be returned as a list.
212

AUTHOR

214       Ken Williams, ken@mathforum.org
215

COPYRIGHT

217       Copyright 2000-2003 Ken Williams.  All rights reserved.
218
219       This library is free software; you can redistribute it and/or modify it
220       under the same terms as Perl itself.
221

NAME

SYNOPSIS

DESCRIPTION

METHODS

AUTHOR

COPYRIGHT

SEE ALSO