artificial intelligence 7. decision trees japan advanced institute of science and technology (jaist)...
TRANSCRIPT
Artificial Intelligence7. Decision trees
Japan Advanced Institute of Science and Technology (JAIST)Yoshimasa Tsuruoka
Outline• What is a decision tree?• How to build a decision tree• Entropy• Information Gain
• Overfitting• Generalization performance• Pruning
• Lecture slides• http://www.jaist.ac.jp/~tsuruoka/lectures/
Decision treesChapter 3 of Mitchell, T., Machine Learning (1997)
• Decision Trees– Disjunction of conjunctions– Successfully applied to a broad range of tasks• Diagnosing medical cases• Assessing credit risk of loan applications
• Nice characteristics– Understandable to human– Robust to noise
• Concept: PlayTennis
A decision tree
Outlook
Humidity Wind
Sunny RainOvercast
Yes
No Yes
High Normal
No Yes
Strong Weak
• Instance <Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong>
Classification by a decision tree
Outlook
Humidity Wind
Sunny RainOvercast
Yes
No Yes
High Normal
No Yes
Strong Weak
(Outlook = Sunny ^ Humidity = Normal)v (Outlook = Overcast)v (Outlook = Rain ^ Wind = Weak)
Disjunction of conjunctions
Outlook
Humidity Wind
Sunny RainOvercast
Yes
No Yes
High Normal
No Yes
Strong Weak
Problems suited to decision trees
• Instanced are represented by attribute-value pairs• The target function has discrete target values• Disjunctive descriptions may be required• The training data may contain errors• The training data may contain missing
attribute values
Training dataDay Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Which attribute should be tested at each node?
• We want to build a small decision tree
• Information gain– How well a given attribute separates the training
examples according to their target classification– Reduction in entropy
• Entropy– (im)purity of an arbitrary collection of examples
Entropy
• If there are only two classes
• In general,
ppppSEntropy 22 loglog
940.0
14/5log14/514/9log14/9]5,9[ 22
Entropy
c
iii ppSEntropy
12log
Information Gain
vAValuesv
v SEntropyS
SSEntropyASGain
,
• The expected reduction in entropy achieved by splitting the training examples
Example
048.0
00.114
6811.0
14
8940.0
14
6
14
8
,
]3,3[
]2,6[
]5,9[
,
,
StrongWeak
StrongWeakvv
v
Strong
Weak
SEntropySEntropySEntropy
SEntropyS
SSEntropyWindSGain
S
S
S
StrongWeakWindValues
Coumpiting Information Gain
Humidity Wind
High Normal Weak Strong
940.0
]5,9[
E
S
985.0
]4,3[
E
S
592.0
]1,6[
E
S
940.0
]5,9[
E
S
811.0
]2,6[
E
S
00.1
]3,3[
E
S
151.0
592.014
7985.0
14
7940.0
,
HumiditySGain
048.0
592.014
6811.0
14
8940.0
,
WindSGain
Which attribute is the best classifier?
• Information gain
029.0,
048.0,
151.0,
246.0,
eTemperaturSGain
WindSGain
HumiditySGain
OutlookSGain
Splitting training data with Outlook
Outlook
Sunny RainOvercast
{D1,D2,…,D14}[9+,5-]
{D1,D2,D8,D9,D11}[2+,3-]
{D3,D7,D12,D13}[4+,0-]
{D4,D5,D6,D10,D14}[3+,2-]
Yes? ?
Overfitting
• Growing each branch of the tree deeply enough to perfectly classify the training examples is not a good strategy.– The resulting tree may overfit the training data
• Overfitting– The tree can explain the training data very well
but performs poorly on new data
Alleviating the overfitting problem
• Several approaches– Stop growing the tree earlier– Post-prune the tree
• How can we evaluate the classification performance of the tree for new data?– The available data are separated into two sets of
examples: a training set and a validation (development) set
Validation (development) set• Use a portion of the original training data to
estimate the generalization performance.
Original training set
Original training set
Test setTest set
Training setTraining set
Test setTest set
Validation setValidation set