1AI::Categorizer(3)    User Contributed Perl Documentation   AI::Categorizer(3)
2
3
4

NAME

6       AI::Categorizer - Automatic Text Categorization
7

SYNOPSIS

9        use AI::Categorizer;
10        my $c = new AI::Categorizer(...parameters...);
11
12        # Run a complete experiment - training on a corpus, testing on a test
13        # set, printing a summary of results to STDOUT
14        $c->run_experiment;
15
16        # Or, run the parts of $c->run_experiment separately
17        $c->scan_features;
18        $c->read_training_set;
19        $c->train;
20        $c->evaluate_test_set;
21        print $c->stats_table;
22
23        # After training, use the Learner for categorization
24        my $l = $c->learner;
25        while (...) {
26          my $d = ...create a document...
27          my $hypothesis = $l->categorize($d);  # An AI::Categorizer::Hypothesis object
28          print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
29          print "Best category: ", $hypothesis->best_category, "\n";
30        }
31

DESCRIPTION

33       "AI::Categorizer" is a framework for automatic text categorization.  It
34       consists of a collection of Perl modules that implement common
35       categorization tasks, and a set of defined relationships among those
36       modules.  The various details are flexible - for example, you can
37       choose what categorization algorithm to use, what features (words or
38       otherwise) of the documents should be used (or how to automatically
39       choose these features), what format the documents are in, and so on.
40
41       The basic process of using this module will typically involve obtaining
42       a collection of pre-categorized documents, creating a "knowledge set"
43       representation of those documents, training a categorizer on that
44       knowledge set, and saving the trained categorizer for later use.  There
45       are several ways to carry out this process.  The top-level
46       "AI::Categorizer" module provides an umbrella class for high-level
47       operations, or you may use the interfaces of the individual classes in
48       the framework.
49
50       A simple sample script that reads a training corpus, trains a
51       categorizer, and tests the categorizer on a test corpus, is distributed
52       as eg/demo.pl .
53
54       Disclaimer: the results of any of the machine learning algorithms are
55       far from infallible (close to fallible?).  Categorization of documents
56       is often a difficult task even for humans well-trained in the
57       particular domain of knowledge, and there are many things a human would
58       consider that none of these algorithms consider.  These are only
59       statistical tests - at best they are neat tricks or helpful assistants,
60       and at worst they are totally unreliable.  If you plan to use this
61       module for anything really important, human supervision is essential,
62       both of the categorization process and the final results.
63
64       For the usage details, please see the documentation of each individual
65       module.
66

FRAMEWORK COMPONENTS

68       This section explains the major pieces of the "AI::Categorizer" object
69       framework.  We give a conceptual overview, but don't get into any of
70       the details about interfaces or usage.  See the documentation for the
71       individual classes for more details.
72
73       A diagram of the various classes in the framework can be seen in
74       "doc/classes-overview.png", and a more detailed view of the same thing
75       can be seen in "doc/classes.png".
76
77   Knowledge Sets
78       A "knowledge set" is defined as a collection of documents, together
79       with some information on the categories each document belongs to.  Note
80       that this term is somewhat unique to this project - other sources may
81       call it a "training corpus", or "prior knowledge".  A knowledge set
82       also contains some information on how documents will be parsed and how
83       their features (words) will be extracted and turned into meaningful
84       representations.  In this sense, a knowledge set represents not only a
85       collection of data, but a particular view on that data.
86
87       A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
88       class.  Before you can start playing with categorizers, you will have
89       to start playing with knowledge sets, so that the categorizers have
90       some data to train on.  See the documentation for the
91       "AI::Categorizer::KnowledgeSet" module for information on its
92       interface.
93
94       Feature selection
95
96       Deciding which features are the most important is a very large part of
97       the categorization task - you cannot simply consider all the words in
98       all the documents when training, and all the words in the document
99       being categorized.  There are two main reasons for this - first, it
100       would mean that your training and categorizing processes would take
101       forever and use tons of memory, and second, the significant stuff of
102       the documents would get lost in the "noise" of the insignificant stuff.
103
104       The process of selecting the most important features in the training
105       set is called "feature selection".  It is managed by the
106       "AI::Categorizer::KnowledgeSet" class, and you will find the details of
107       feature selection processes in that class's documentation.
108
109   Collections
110       Because documents may be stored in lots of different formats, a
111       "collection" class has been created as an abstraction of a stored set
112       of documents, together with a way to iterate through the set and return
113       Document objects.  A knowledge set contains a single collection object.
114       A "Categorizer" doing a complete test run generally contains two
115       collections, one for training and one for testing.  A "Learner" can
116       mass-categorize a collection.
117
118       The "AI::Categorizer::Collection" class and its subclasses instantiate
119       the idea of a collection in this sense.
120
121   Documents
122       Each document is represented by an "AI::Categorizer::Document" object,
123       or an object of one of its subclasses.  Each document class contains
124       methods for turning a bunch of data into a Feature Vector.  Each
125       document also has a method to report which categories it belongs to.
126
127   Categories
128       Each category is represented by an "AI::Categorizer::Category" object.
129       Its main purpose is to keep track of which documents belong to it,
130       though you can also examine statistical properties of an entire
131       category, such as obtaining a Feature Vector representing an
132       amalgamation of all the documents that belong to it.
133
134   Machine Learning Algorithms
135       There are lots of different ways to make the inductive leap from the
136       training documents to unseen documents.  The Machine Learning community
137       has studied many algorithms for this purpose.  To allow flexibility in
138       choosing and configuring categorization algorithms, each such algorithm
139       is a subclass of "AI::Categorizer::Learner".  There are currently four
140       categorizers included in the distribution:
141
142       AI::Categorizer::Learner::NaiveBayes
143           A pure-perl implementation of a Naive Bayes classifier.  No
144           dependencies on external modules or other resources.  Naive Bayes
145           is usually very fast to train and fast to make categorization
146           decisions, but isn't always the most accurate categorizer.
147
148       AI::Categorizer::Learner::SVM
149           An interface to Corey Spencer's "Algorithm::SVM", which implements
150           a Support Vector Machine classifier.  SVMs can take a while to
151           train (though in certain conditions there are optimizations to make
152           them quite fast), but are pretty quick to categorize.  They often
153           have very good accuracy.
154
155       AI::Categorizer::Learner::DecisionTree
156           An interface to "AI::DecisionTree", which implements a Decision
157           Tree classifier.  Decision Trees generally take longer to train
158           than Naive Bayes or SVM classifiers, but they are also quite fast
159           when categorizing.  Decision Trees have the advantage that you can
160           scrutinize the structures of trained decision trees to see how
161           decisions are being made.
162
163       AI::Categorizer::Learner::Weka
164           An interface to version 2 of the Weka Knowledge Analysis system
165           that lets you use any of the machine learners it defines.  This
166           gives you access to lots and lots of machine learning algorithms in
167           use by machine learning researches.  The main drawback is that Weka
168           tends to be quite slow and use a lot of memory, and the current
169           interface between Weka and "AI::Categorizer" is a bit clumsy.
170
171       Other machine learning methods that may be implemented soonish include
172       Neural Networks, k-Nearest-Neighbor, and/or a mixture-of-experts
173       combiner for ensemble learning.  No timetable for their creation has
174       yet been set.
175
176       Please see the documentation of these individual modules for more
177       details on their guts and quirks.  See the "AI::Categorizer::Learner"
178       documentation for a description of the general categorizer interface.
179
180       If you wish to create your own classifier, you should inherit from
181       "AI::Categorizer::Learner" or "AI::Categorizer::Learner::Boolean",
182       which are abstract classes that manage some of the work for you.
183
184   Feature Vectors
185       Most categorization algorithms don't deal directly with documents'
186       data, they instead deal with a vector representation of a document's
187       features.  The features may be any properties of the document that seem
188       helpful for determining its category, but they are usually some version
189       of the "most important" words in the document.  A list of features and
190       their weights in each document is encapsulated by the
191       "AI::Categorizer::FeatureVector" class.  You may think of this class as
192       roughly analogous to a Perl hash, where the keys are the names of
193       features and the values are their weights.
194
195   Hypotheses
196       The result of asking a categorizer to categorize a previously unseen
197       document is called a hypothesis, because it is some kind of
198       "statistical guess" of what categories this document should be assigned
199       to.  Since you may be interested in any of several pieces of
200       information about the hypothesis (for instance, which categories were
201       assigned, which category was the single most likely category, the
202       scores assigned to each category, etc.), the hypothesis is returned as
203       an object of the "AI::Categorizer::Hypothesis" class, and you can use
204       its object methods to get information about the hypothesis.  See its
205       class documentation for the details.
206
207   Experiments
208       The "AI::Categorizer::Experiment" class helps you organize the results
209       of categorization experiments.  As you get lots of categorization
210       results (Hypotheses) back from the Learner, you can feed these results
211       to the Experiment class, along with the correct answers.  When all
212       results have been collected, you can get a report on accuracy,
213       precision, recall, F1, and so on, with both micro-averaging and macro-
214       averaging over categories.  We use the "Statistics::Contingency" module
215       from CPAN to manage the calculations. See the docs for
216       "AI::Categorizer::Experiment" for more details.
217

METHODS

219       new()
220           Creates a new Categorizer object and returns it.  Accepts lots of
221           parameters controlling behavior.  In addition to the parameters
222           listed here, you may pass any parameter accepted by any class that
223           we create internally (the KnowledgeSet, Learner, Experiment, or
224           Collection classes), or any class that they create.  This is
225           managed by the "Class::Container" module, so see its documentation
226           for the details of how this works.
227
228           The specific parameters accepted here are:
229
230           progress_file
231               A string that indicates a place where objects will be saved
232               during several of the methods of this class.  The default value
233               is the string "save", which means files like
234               "save-01-knowledge_set" will get created.  The exact names of
235               these files may change in future releases, since they're just
236               used internally to resume where we last left off.
237
238           verbose
239               If true, a few status messages will be printed during
240               execution.
241
242           training_set
243               Specifies the "path" parameter that will be fed to the
244               KnowledgeSet's "scan_features()" and "read()" methods during
245               our "scan_features()" and "read_training_set()" methods.
246
247           test_set
248               Specifies the "path" parameter that will be used when creating
249               a Collection during the "evaluate_test_set()" method.
250
251           data_root
252               A shortcut for setting the "training_set", "test_set", and
253               "category_file" parameters separately.  Sets "training_set" to
254               "$data_root/training", "test_set" to "$data_root/test", and
255               "category_file" (used by some of the Collection classes) to
256               "$data_root/cats.txt".
257
258       learner()
259           Returns the Learner object associated with this Categorizer.
260           Before "train()", the Learner will of course not be trained yet.
261
262       knowledge_set()
263           Returns the KnowledgeSet object associated with this Categorizer.
264           If "read_training_set()" has not yet been called, the KnowledgeSet
265           will not yet be populated with any training data.
266
267       run_experiment()
268           Runs a complete experiment on the training and testing data,
269           reporting the results on "STDOUT".  Internally, this is just a
270           shortcut for calling the "scan_features()", "read_training_set()",
271           "train()", and "evaluate_test_set()" methods, then printing the
272           value of the "stats_table()" method.
273
274       scan_features()
275           Scans the Collection specified in the "test_set" parameter to
276           determine the set of features (words) that will be considered when
277           training the Learner.  Internally, this calls the "scan_features()"
278           method of the KnowledgeSet, then saves a list of the KnowledgeSet's
279           features for later use.
280
281           This step is not strictly necessary, but it can dramatically reduce
282           memory requirements if you scan for features before reading the
283           entire corpus into memory.
284
285       read_training_set()
286           Populates the KnowledgeSet with the data specified in the
287           "test_set" parameter.  Internally, this calls the "read()" method
288           of the KnowledgeSet.  Returns the KnowledgeSet.  Also saves the
289           KnowledgeSet object for later use.
290
291       train()
292           Calls the Learner's "train()" method, passing it the KnowledgeSet
293           created during "read_training_set()".  Returns the Learner object.
294           Also saves the Learner object for later use.
295
296       evaluate_test_set()
297           Creates a Collection based on the value of the "test_set"
298           parameter, and calls the Learner's "categorize_collection()" method
299           using this Collection.  Returns the resultant Experiment object.
300           Also saves the Experiment object for later use in the
301           "stats_table()" method.
302
303       stats_table()
304           Returns the value of the Experiment's (as created by
305           "evaluate_test_set()") "stats_table()" method.  This is a string
306           that shows various statistics about the
307           accuracy/precision/recall/F1/etc. of the assignments made during
308           testing.
309

HISTORY

311       This module is a revised and redesigned version of the previous
312       "AI::Categorize" module by the same author.  Note the added 'r' in the
313       new name.  The older module has a different interface, and no attempt
314       at backward compatibility has been made - that's why I changed the
315       name.
316
317       You can have both "AI::Categorize" and "AI::Categorizer" installed at
318       the same time on the same machine, if you want.  They don't know about
319       each other or use conflicting namespaces.
320

AUTHOR

322       Ken Williams <ken@mathforum.org>
323
324       Discussion about this module can be directed to the perl-AI list at
325       <perl-ai@perl.org>.  For more info about the list, see
326       http://lists.perl.org/showlist.cgi?name=perl-ai
327

REFERENCES

329       An excellent introduction to the academic field of Text Categorization
330       is Fabrizio Sebastiani's "Machine Learning in Automated Text
331       Categorization": ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp.
332       1-47.
333
335       Copyright 2000-2003 Ken Williams.  All rights reserved.
336
337       This distribution is free software; you can redistribute it and/or
338       modify it under the same terms as Perl itself.  These terms apply to
339       every file in the distribution - if you have questions, please contact
340       the author.
341
342
343
344perl v5.32.0                      2020-07-28                AI::Categorizer(3)
Impressum