Algorithm::NaiveBayes(3pm)

1Algorithm::NaiveBayes(3U)ser Contributed Perl DocumentatiAolngorithm::NaiveBayes(3)
2
3
4

NAME

6       Algorithm::NaiveBayes - Bayesian prediction of categories
7

SYNOPSIS

9         use Algorithm::NaiveBayes;
10         my $nb = Algorithm::NaiveBayes->new;
11
12         $nb->add_instance
13           (attributes => {foo => 1, bar => 1, baz => 3},
14            label => 'sports');
15
16         $nb->add_instance
17           (attributes => {foo => 2, blurp => 1},
18            label => ['sports', 'finance']);
19
20         ... repeat for several more instances, then:
21         $nb->train;
22
23         # Find results for unseen instances
24         my $result = $nb->predict
25           (attributes => {bar => 3, blurp => 2});
26

DESCRIPTION

28       This module implements the classic "Naive Bayes" machine learning
29       algorithm.  It is a well-studied probabilistic algorithm often used in
30       automatic text categorization.  Compared to other algorithms (kNN, SVM,
31       Decision Trees), it's pretty fast and reasonably competitive in the
32       quality of its results.
33
34       A paper by Fabrizio Sebastiani provides a really good introduction to
35       text categorization:
36       <http://faure.iei.pi.cnr.it/~fabrizio/Publications/ACMCS02.pdf>
37

METHODS

39       new()
40           Creates a new "Algorithm::NaiveBayes" object and returns it.  The
41           following parameters are accepted:
42
43           purge
44               If set to a true value, the do_purge() method will be invoked
45               during train().  The default is true.  Set this to a false
46               value if you'd like to be able to add additional instances
47               after training and then call train() again.
48
49       add_instance( attributes => HASH, label => STRING|ARRAY )
50           Adds a training instance to the categorizer.  The "attributes"
51           parameter contains a hash reference whose keys are string
52           attributes and whose values are the weights of those attributes.
53           For instance, if you're categorizing text documents, the attributes
54           might be the words of the document, and the weights might be the
55           number of times each word occurs in the document.
56
57           The "label" parameter can contain a single string or an array of
58           strings, with each string representing a label for this instance.
59           The labels can be any arbitrary strings.  To indicate that a
60           document has no applicable labels, pass an empty array reference.
61
62       train()
63           Calculates the probabilities that will be necessary for
64           categorization using the predict() method.
65
66       predict( attributes => HASH )
67           Use this method to predict the label of an unknown instance.  The
68           attributes should be of the same format as you passed to
69           add_instance().  predict() returns a hash reference whose keys are
70           the names of labels, and whose values are the score for each label.
71           Scores are between 0 and 1, where 0 means the label doesn't seem to
72           apply to this instance, and 1 means it does.
73
74           In practice, scores using Naive Bayes tend to be very close to 0 or
75           1 because of the way normalization is performed.  I might try to
76           alleviate this in future versions of the code.
77
78       labels()
79           Returns a list of all the labels the object knows about (in no
80           particular order), or the number of labels if called in a scalar
81           context.
82
83       do_purge()
84           Purges training instances and their associated information from the
85           NaiveBayes object.  This can save memory after training.
86
87       purge()
88           Returns true or false depending on the value of the object's
89           "purge" property.  An optional boolean argument sets the property.
90
91       save_state($path)
92           This object method saves the object to disk for later use.  The
93           $path argument indicates the place on disk where the object should
94           be saved:
95
96             $nb->save_state($path);
97
98       restore_state($path)
99           This class method reads the file specified by $path and returns the
100           object that was previously stored there using save_state():
101
102             $nb = Algorithm::NaiveBayes->restore_state($path);
103

THEORY

105       Bayes' Theorem is a way of inverting a conditional probability. It
106       states:
107
108                       P(y|x) P(x)
109             P(x|y) = -------------
110                          P(y)
111
112       The notation P(x|y) means "the probability of "x" given "y"."  See also
113       "/mathforum.org/dr.math/problems/battisfore.03.22.99.html"" in "http:
114       for a simple but complete example of Bayes' Theorem.
115
116       In this case, we want to know the probability of a given category given
117       a certain string of words in a document, so we have:
118
119                           P(words | cat) P(cat)
120         P(cat | words) = --------------------
121                                  P(words)
122
123       We have applied Bayes' Theorem because "P(cat | words)" is a difficult
124       quantity to compute directly, but "P(words | cat)" and P(cat) are
125       accessible (see below).
126
127       The greater the expression above, the greater the probability that the
128       given document belongs to the given category.  So we want to find the
129       maximum value.  We write this as
130
131                                        P(words | cat) P(cat)
132         Best category =   ArgMax      -----------------------
133                          cat in cats          P(words)
134
135       Since P(words) doesn't change over the range of categories, we can get
136       rid of it.  That's good, because we didn't want to have to compute
137       these values anyway.  So our new formula is:
138
139         Best category =   ArgMax      P(words | cat) P(cat)
140                          cat in cats
141
142       Finally, we note that if "w1, w2, ... wn" are the words in the
143       document, then this expression is equivalent to:
144
145         Best category =   ArgMax      P(w1|cat)*P(w2|cat)*...*P(wn|cat)*P(cat)
146                          cat in cats
147
148       That's the formula I use in my document categorization code.  The last
149       step is the only non-rigorous one in the derivation, and this is the
150       "naive" part of the Naive Bayes technique.  It assumes that the
151       probability of each word appearing in a document is unaffected by the
152       presence or absence of each other word in the document.  We assume this
153       even though we know this isn't true: for example, the word "iodized" is
154       far more likely to appear in a document that contains the word "salt"
155       than it is to appear in a document that contains the word "subroutine".
156       Luckily, as it turns out, making this assumption even when it isn't
157       true may have little effect on our results, as the following paper by
158       Pedro Domingos argues:
159       "/www.cs.washington.edu/homes/pedrod/mlj97.ps.gz"" in "http:
160

HISTORY

162       My first implementation of a Naive Bayes algorithm was in the now-
163       obsolete AI::Categorize module, first released in May 2001.  I replaced
164       it with the Naive Bayes implementation in AI::Categorizer (note the
165       extra 'r'), first released in July 2002.  I then extracted that
166       implementation into its own module that could be used outside the
167       framework, and that's what you see here.
168

AUTHOR

170       Ken Williams, ken@mathforum.org
171

COPYRIGHT

173       Copyright 2003-2004 Ken Williams.  All rights reserved.
174
175       This library is free software; you can redistribute it and/or modify it
176       under the same terms as Perl itself.
177