1Algorithm::NaiveBayes(3Upsme)r Contributed Perl DocumentaAtligoonrithm::NaiveBayes(3pm)
2
3
4
6 Algorithm::NaiveBayes - Bayesian prediction of categories
7
9 use Algorithm::NaiveBayes;
10 my $nb = Algorithm::NaiveBayes->new;
11
12 $nb->add_instance
13 (attributes => {foo => 1, bar => 1, baz => 3},
14 label => 'sports');
15
16 $nb->add_instance
17 (attributes => {foo => 2, blurp => 1},
18 label => ['sports', 'finance']);
19
20 ... repeat for several more instances, then:
21 $nb->train;
22
23 # Find results for unseen instances
24 my $result = $nb->predict
25 (attributes => {bar => 3, blurp => 2});
26
28 This module implements the classic "Naive Bayes" machine learning
29 algorithm. It is a well-studied probabilistic algorithm often used in
30 automatic text categorization. Compared to other algorithms (kNN, SVM,
31 Decision Trees), it's pretty fast and reasonably competitive in the
32 quality of its results.
33
34 A paper by Fabrizio Sebastiani provides a really good introduction to
35 text categorization:
36 <http://faure.iei.pi.cnr.it/~fabrizio/Publications/ACMCS02.pdf>
37
39 new()
40 Creates a new "Algorithm::NaiveBayes" object and returns it. The
41 following parameters are accepted:
42
43 purge
44 If set to a true value, the do_purge() method will be invoked
45 during train(). The default is true. Set this to a false
46 value if you'd like to be able to add additional instances
47 after training and then call train() again.
48
49 add_instance( attributes => HASH, label => STRING|ARRAY )
50 Adds a training instance to the categorizer. The "attributes"
51 parameter contains a hash reference whose keys are string
52 attributes and whose values are the weights of those attributes.
53 For instance, if you're categorizing text documents, the attributes
54 might be the words of the document, and the weights might be the
55 number of times each word occurs in the document.
56
57 The "label" parameter can contain a single string or an array of
58 strings, with each string representing a label for this instance.
59 The labels can be any arbitrary strings. To indicate that a
60 document has no applicable labels, pass an empty array reference.
61
62 train()
63 Calculates the probabilities that will be necessary for
64 categorization using the predict() method.
65
66 predict( attributes => HASH )
67 Use this method to predict the label of an unknown instance. The
68 attributes should be of the same format as you passed to
69 add_instance(). predict() returns a hash reference whose keys are
70 the names of labels, and whose values are the score for each label.
71 Scores are between 0 and 1, where 0 means the label doesn't seem to
72 apply to this instance, and 1 means it does.
73
74 In practice, scores using Naive Bayes tend to be very close to 0 or
75 1 because of the way normalization is performed. I might try to
76 alleviate this in future versions of the code.
77
78 labels()
79 Returns a list of all the labels the object knows about (in no
80 particular order), or the number of labels if called in a scalar
81 context.
82
83 do_purge()
84 Purges training instances and their associated information from the
85 NaiveBayes object. This can save memory after training.
86
87 purge()
88 Returns true or false depending on the value of the object's
89 "purge" property. An optional boolean argument sets the property.
90
91 save_state($path)
92 This object method saves the object to disk for later use. The
93 $path argument indicates the place on disk where the object should
94 be saved:
95
96 $nb->save_state($path);
97
98 restore_state($path)
99 This class method reads the file specified by $path and returns the
100 object that was previously stored there using save_state():
101
102 $nb = Algorithm::NaiveBayes->restore_state($path);
103
105 Bayes' Theorem is a way of inverting a conditional probability. It
106 states:
107
108 P(y|x) P(x)
109 P(x|y) = -------------
110 P(y)
111
112 The notation P(x|y) means "the probability of "x" given "y"." See also
113 "/mathforum.org/dr.math/problems/battisfore.03.22.99.html"" in "http:
114 for a simple but complete example of Bayes' Theorem.
115
116 In this case, we want to know the probability of a given category given
117 a certain string of words in a document, so we have:
118
119 P(words | cat) P(cat)
120 P(cat | words) = --------------------
121 P(words)
122
123 We have applied Bayes' Theorem because "P(cat | words)" is a difficult
124 quantity to compute directly, but "P(words | cat)" and P(cat) are
125 accessible (see below).
126
127 The greater the expression above, the greater the probability that the
128 given document belongs to the given category. So we want to find the
129 maximum value. We write this as
130
131 P(words | cat) P(cat)
132 Best category = ArgMax -----------------------
133 cat in cats P(words)
134
135 Since P(words) doesn't change over the range of categories, we can get
136 rid of it. That's good, because we didn't want to have to compute
137 these values anyway. So our new formula is:
138
139 Best category = ArgMax P(words | cat) P(cat)
140 cat in cats
141
142 Finally, we note that if "w1, w2, ... wn" are the words in the
143 document, then this expression is equivalent to:
144
145 Best category = ArgMax P(w1|cat)*P(w2|cat)*...*P(wn|cat)*P(cat)
146 cat in cats
147
148 That's the formula I use in my document categorization code. The last
149 step is the only non-rigorous one in the derivation, and this is the
150 "naive" part of the Naive Bayes technique. It assumes that the
151 probability of each word appearing in a document is unaffected by the
152 presence or absence of each other word in the document. We assume this
153 even though we know this isn't true: for example, the word "iodized" is
154 far more likely to appear in a document that contains the word "salt"
155 than it is to appear in a document that contains the word "subroutine".
156 Luckily, as it turns out, making this assumption even when it isn't
157 true may have little effect on our results, as the following paper by
158 Pedro Domingos argues:
159 "/www.cs.washington.edu/homes/pedrod/mlj97.ps.gz"" in "http:
160
162 My first implementation of a Naive Bayes algorithm was in the now-
163 obsolete AI::Categorize module, first released in May 2001. I replaced
164 it with the Naive Bayes implementation in AI::Categorizer (note the
165 extra 'r'), first released in July 2002. I then extracted that
166 implementation into its own module that could be used outside the
167 framework, and that's what you see here.
168
170 Ken Williams, ken@mathforum.org
171
173 Copyright 2003-2004 Ken Williams. All rights reserved.
174
175 This library is free software; you can redistribute it and/or modify it
176 under the same terms as Perl itself.
177
179 AI::Categorizer(3), perl.
180
181
182
183perl v5.38.0 2023-07-20 Algorithm::NaiveBayes(3pm)