AI::DecisionTree(3pm)

1AI::DecisionTree(3)   User Contributed Perl Documentation  AI::DecisionTree(3)
2
3
4

NAME

6       AI::DecisionTree - Automatically Learns Decision Trees
7

VERSION

9       version 0.11
10

SYNOPSIS

12         use AI::DecisionTree;
13         my $dtree = new AI::DecisionTree;
14
15         # A set of training data for deciding whether to play tennis
16         $dtree->add_instance
17           (attributes => {outlook     => 'sunny',
18                           temperature => 'hot',
19                           humidity    => 'high'},
20            result => 'no');
21
22         $dtree->add_instance
23           (attributes => {outlook     => 'overcast',
24                           temperature => 'hot',
25                           humidity    => 'normal'},
26            result => 'yes');
27
28         ... repeat for several more instances, then:
29         $dtree->train;
30
31         # Find results for unseen instances
32         my $result = $dtree->get_result
33           (attributes => {outlook     => 'sunny',
34                           temperature => 'hot',
35                           humidity    => 'normal'});
36

DESCRIPTION

38       The "AI::DecisionTree" module automatically creates so-called "decision
39       trees" to explain a set of training data.  A decision tree is a kind of
40       categorizer that use a flowchart-like process for categorizing new
41       instances.  For instance, a learned decision tree might look like the
42       following, which classifies for the concept "play tennis":
43
44                          OUTLOOK
45                          /  |  \
46                         /   |   \
47                        /    |    \
48                  sunny/  overcast \rainy
49                      /      |      \
50                 HUMIDITY    |       WIND
51                 /  \       *no*     /  \
52                /    \              /    \
53           high/      \normal      /      \
54              /        \    strong/        \weak
55            *no*      *yes*      /          \
56                               *no*        *yes*
57
58       (This example, and the inspiration for the "AI::DecisionTree" module,
59       come directly from Tom Mitchell's excellent book "Machine Learning",
60       available from McGraw Hill.)
61
62       A decision tree like this one can be learned from training data, and
63       then applied to previously unseen data to obtain results that are
64       consistent with the training data.
65
66       The usual goal of a decision tree is to somehow encapsulate the
67       training data in the smallest possible tree.  This is motivated by an
68       "Occam's Razor" philosophy, in which the simplest possible explanation
69       for a set of phenomena should be preferred over other explanations.
70       Also, small trees will make decisions faster than large trees, and they
71       are much easier for a human to look at and understand.  One of the
72       biggest reasons for using a decision tree instead of many other machine
73       learning techniques is that a decision tree is a much more scrutable
74       decision maker than, say, a neural network.
75
76       The current implementation of this module uses an extremely simple
77       method for creating the decision tree based on the training instances.
78       It uses an Information Gain metric (based on expected reduction in
79       entropy) to select the "most informative" attribute at each node in the
80       tree.  This is essentially the ID3 algorithm, developed by J. R.
81       Quinlan in 1986.  The idea is that the attribute with the highest
82       Information Gain will (probably) be the best attribute to split the
83       tree on at each point if we're interested in making small trees.
84

METHODS

86   Building and Querying the Tree
87       new(...parameters...)
88           Creates a new decision tree object and returns it.  Accepts the
89           following parameters:
90
91           noise_mode
92               Controls the behavior of the "train()" method when "noisy" data
93               is encountered.  Here "noisy" means that two or more training
94               instances contradict each other, such that they have identical
95               attributes but different results.
96
97               If "noise_mode" is set to "fatal" (the default), the "train()"
98               method will throw an exception (die).  If "noise_mode" is set
99               to "pick_best", the most frequent result at each noisy node
100               will be selected.
101
102           prune
103               A boolean "prune" parameter which specifies whether the tree
104               should be pruned after training.  This is usually a good idea,
105               so the default is to prune.  Currently we prune using a simple
106               minimum-description-length criterion.
107
108           verbose
109               If set to a true value, some status information will be output
110               while training a decision tree.  Default is false.
111
112           purge
113               If set to a true value, the "do_purge()" method will be invoked
114               during "train()".  The default is true.
115
116           max_depth
117               Controls the maximum depth of the tree that will be created
118               during "train()".  The default is 0, which means that trees of
119               unlimited depth can be constructed.
120
121       add_instance(attributes => \%hash, result => $string, name => $string)
122           Adds a training instance to the set of instances which will be used
123           to form the tree.  An "attributes" parameter specifies a hash of
124           attribute-value pairs for the instance, and a "result" parameter
125           specifies the result.
126
127           An optional "name" parameter lets you give a unique name to each
128           training instance.  This can be used in coordination with the
129           "set_results()" method below.
130
131       train()
132           Builds the decision tree from the list of training instances.  If a
133           numeric "max_depth" parameter is supplied, the maximum tree depth
134           can be controlled (see also the "new()" method).
135
136       get_result(attributes => \%hash)
137           Returns the most likely result (from the set of all results given
138           to "add_instance()") for the set of attribute values given.  An
139           "attributes" parameter specifies a hash of attribute-value pairs
140           for the instance.  If the decision tree doesn't have enough
141           information to find a result, it will return "undef".
142
143       do_purge()
144           Purges training instances and their associated information from the
145           DecisionTree object.  This can save memory after training, and
146           since the training instances are implemented as C structs, this
147           turns the DecisionTree object into a pure-perl data structure that
148           can be more easily saved with "Storable.pm", for instance.
149
150       purge()
151           Returns true or false depending on the value of the tree's "purge"
152           property.  An optional boolean argument sets the property.
153
154       copy_instances(from => $other_tree)
155           Allows two trees to share the same set of training instances.  More
156           commonly, this lets you train one tree, then re-use its instances
157           in another tree (possibly changing the instance "result" values
158           using "set_results()"), which is much faster than re-populating the
159           second tree's instances from scratch.
160
161       set_results(\%results)
162           Given a hash that relates instance names to instance result values,
163           change the result values as specified.
164
165   Tree Introspection
166       instances()
167           Returns a reference to an array of the training instances used to
168           build this tree.
169
170       nodes()
171           Returns the number of nodes in the trained decision tree.
172
173       depth()
174           Returns the depth of the tree.  This is the maximum number of
175           decisions that would need to be made to classify an unseen
176           instance, i.e. the length of the longest path from the tree's root
177           to a leaf.  A tree with a single node would have a depth of zero.
178
179       rule_tree()
180           Returns a data structure representing the decision tree.  For
181           instance, for the tree diagram above, the following data structure
182           is returned:
183
184            [ 'outlook', {
185                'rain' => [ 'wind', {
186                    'strong' => 'no',
187                    'weak' => 'yes',
188                } ],
189                'sunny' => [ 'humidity', {
190                    'normal' => 'yes',
191                    'high' => 'no',
192                } ],
193                'overcast' => 'yes',
194            } ]
195
196           This is slightly remniscent of how XML::Parser returns the parsed
197           XML tree.
198
199           Note that while the ordering in the hashes is unpredictable, the
200           nesting is in the order in which the criteria will be checked at
201           decision-making time.
202
203       rule_statements()
204           Returns a list of strings that describe the tree in rule-form.  For
205           instance, for the tree diagram above, the following list would be
206           returned (though not necessarily in this order - the order is
207           unpredictable):
208
209             if outlook='rain' and wind='strong' -> 'no'
210             if outlook='rain' and wind='weak' -> 'yes'
211             if outlook='sunny' and humidity='normal' -> 'yes'
212             if outlook='sunny' and humidity='high' -> 'no'
213             if outlook='overcast' -> 'yes'
214
215           This can be helpful for scrutinizing the structure of a tree.
216
217           Note that while the order of the rules is unpredictable, the order
218           of criteria within each rule reflects the order in which the
219           criteria will be checked at decision-making time.
220
221       as_graphviz()
222           Returns a "GraphViz" object representing the tree.  Requires that
223           the GraphViz module is already installed, of course.  The object
224           returned will allow you to create PNGs, GIFs, image maps, or
225           whatever graphical representation of your tree you might want.
226
227           A "leaf_colors" argument can specify a fill color for each leaf
228           node in the tree.  The keys of the hash should be the same as the
229           strings appearing as the "result" parameters given to
230           "add_instance()", and the values should be any GraphViz-style color
231           specification.
232
233           Any additional arguments given to "as_graphviz()" will be passed on
234           to GraphViz's "new()" method.  See the GraphViz docs for more info.
235

LIMITATIONS

237       A few limitations exist in the current version.  All of them could be
238       removed in future versions - especially with your help. =)
239
240       No continuous attributes
241           In the current implementation, only discrete-valued attributes are
242           supported.  This means that an attribute like "temperature" can
243           have values like "cool", "medium", and "hot", but using actual
244           temperatures like 87 or 62.3 is not going to work.  This is because
245           the values would split the data too finely - the tree-building
246           process would probably think that it could make all its decisions
247           based on the exact temperature value alone, ignoring all other
248           attributes, because each temperature would have only been seen once
249           in the training data.
250
251           The usual way to deal with this problem is for the tree-building
252           process to figure out how to place the continuous attribute values
253           into a set of bins (like "cool", "medium", and "hot") and then
254           build the tree based on these bin values.  Future versions of
255           "AI::DecisionTree" may provide support for this.  For now, you have
256           to do it yourself.
257

TO DO

259       All the stuff in the LIMITATIONS section.  Also, revisit the pruning
260       algorithm to see how it can be improved.
261

AUTHOR

263       Ken Williams, ken@mathforum.org
264