1AI::Categorizer::LearneUrs:e:rNaCiovnetBraiybeust(e3d)PAeIr:l:CDaotceugmoernitzaetri:o:nLearner::NaiveBayes(3)
2
3
4
6 AI::Categorizer::Learner::NaiveBayes - Naive Bayes Algorithm For
7 AI::Categorizer
8
10 use AI::Categorizer::Learner::NaiveBayes;
11
12 # Here $k is an AI::Categorizer::KnowledgeSet object
13
14 my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
15 $nb->train(knowledge_set => $k);
16 $nb->save_state('filename');
17
18 ... time passes ...
19
20 $nb = AI::Categorizer::Learner::NaiveBayes->restore_state('filename');
21 my $c = new AI::Categorizer::Collection::Files( path => ... );
22 while (my $document = $c->next) {
23 my $hypothesis = $nb->categorize($document);
24 print "Best assigned category: ", $hypothesis->best_category, "\n";
25 print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
26 }
27
29 This is an implementation of the Naive Bayes decision-making algorithm,
30 applied to the task of document categorization (as defined by the
31 AI::Categorizer module). See AI::Categorizer for a complete
32 description of the interface.
33
34 This module is now a wrapper around the stand-alone
35 "Algorithm::NaiveBayes" module. I moved the discussion of Bayes'
36 Theorem into that module's documentation.
37
39 This class inherits from the "AI::Categorizer::Learner" class, so all
40 of its methods are available unless explicitly mentioned here.
41
42 new()
43 Creates a new Naive Bayes Learner and returns it. In addition to the
44 parameters accepted by the "AI::Categorizer::Learner" class, the Naive
45 Bayes subclass accepts the following parameters:
46
47 • threshold
48
49 Sets the score threshold for category membership. The default is
50 currently 0.3. Set the threshold lower to assign more categories
51 per document, set it higher to assign fewer. This can be an
52 effective way to trade of between precision and recall.
53
54 threshold()
55 Returns the current threshold value. With an optional numeric
56 argument, you may set the threshold.
57
58 train(knowledge_set => $k)
59 Trains the categorizer. This prepares it for later use in categorizing
60 documents. The "knowledge_set" parameter must provide an object of the
61 class "AI::Categorizer::KnowledgeSet" (or a subclass thereof),
62 populated with lots of documents and categories. See
63 AI::Categorizer::KnowledgeSet for the details of how to create such an
64 object.
65
66 categorize($document)
67 Returns an "AI::Categorizer::Hypothesis" object representing the
68 categorizer's "best guess" about which categories the given document
69 should be assigned to. See AI::Categorizer::Hypothesis for more
70 details on how to use this object.
71
72 save_state($path)
73 Saves the categorizer for later use. This method is inherited from
74 "AI::Categorizer::Storable".
75
77 The various probabilities used in the above calculations are found
78 directly from the training documents. For instance, if there are 5000
79 total tokens (words) in the "sports" training documents and 200 of them
80 are the word "curling", then P(curling|sports) = 200/5000 = 0.04 . If
81 there are 10,000 total tokens in the training corpus and 5,000 of them
82 are in documents belonging to the category "sports", then P(sports) =
83 5,000/10,000 = 0.5> .
84
85 Because the probabilities involved are often very small and we multiply
86 many of them together, the result is often a tiny tiny number. This
87 could pose problems of floating-point underflow, so instead of working
88 with the actual probabilities we work with the logarithms of the
89 probabilities. This also speeds up various calculations in the
90 categorize() method.
91
93 More work on the confidence scores - right now the winning category
94 tends to dominate the scores overwhelmingly, when the scores should
95 probably be more evenly distributed.
96
98 Ken Williams, ken@forum.swarthmore.edu
99
101 Copyright 2000-2003 Ken Williams. All rights reserved.
102
103 This library is free software; you can redistribute it and/or modify it
104 under the same terms as Perl itself.
105
107 AI::Categorizer(3), Algorithm::NaiveBayes(3)
108
109 "A re-examination of text categorization methods" by Yiming Yang
110 <http://www.cs.cmu.edu/~yiming/publications.html>
111
112 "On the Optimality of the Simple Bayesian Classifier under Zero-One
113 Loss" by Pedro Domingos
114 "/www.cs.washington.edu/homes/pedrod/mlj97.ps.gz"" in "http:
115
116 A simple but complete example of Bayes' Theorem from Dr. Math
117 "/www.mathforum.com/dr.math/problems/battisfore.03.22.99.html"" in
118 "http:
119
120
121
122perl v5.36.0 2023-0A1I-:1:9Categorizer::Learner::NaiveBayes(3)