decision trees - university of...
TRANSCRIPT
![Page 1: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/1.jpg)
Decision TreesCS 760@UW-Madison
![Page 2: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/2.jpg)
Zoo of machine learning models
This is a joke
![Page 3: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/3.jpg)
The lectures aheadorganized according to different machine learning models/methods
1. supervised learning• non-parametric: decision tree, nearest neighbors• parametric
• discriminative: linear/logistic regression, SVM, NN• generative: Naïve Bayes, Bayesian networks
2. unsupervised learning: clustering, dimension reduction3. reinforcement learning4. other settings: ensemble, active, semi-supervised
intertwined with experimental methodologies, theory, etc.1. evaluation of learning algorithms2. learning theory: PAC, bias-variance, mistake-bound3. feature selection
![Page 4: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/4.jpg)
Goals for this lectureyou should understand the following concepts
• the decision tree representation• the standard top-down approach to learning a tree• Occam’s razor• entropy and information gain• test sets and unbiased estimates of accuracy• overfitting• early stopping and pruning• validation sets• regression trees• probability estimation trees
![Page 5: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/5.jpg)
Decision Tree Representation
![Page 6: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/6.jpg)
A decision tree to predict heart disease thal
#_major_vessels > 0 present
normal fixed_defect
true false
1 2
present
reversible_defect
chest_pain_type absent
absentabsentabsent present
3 4
Each internal node tests one feature xi
Each branch from an internal node represents one outcome of the test
Each leaf predicts y or P(y | x)
![Page 7: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/7.jpg)
Decision tree exerciseSuppose X1 … X5 are Boolean features, and Y is also Boolean
How would you represent the following with decision trees?
) (i.e., 5252 XXYXXY Ù==
52 XXY Ú=
1352 XXXXY ¬Ú=
![Page 8: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/8.jpg)
Decision Tree Learning
![Page 9: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/9.jpg)
History of decision tree learning
dates of seminal publications: work on these 2 was contemporaneous
many DT variants have been developed since CART and ID3
1963 1973 1980 1984 1986
AID
CH
AID
THAI
D
CAR
T
ID3
CART developed by Leo Breiman, Jerome Friedman, Charles Olshen, R.A. Stone
ID3, C4.5, C5.0 developed by Ross Quinlan
![Page 10: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/10.jpg)
Top-down decision tree learning
MakeSubtree(set of training instances D)
C = DetermineCandidateSplits(D)
if stopping criteria met
make a leaf node N
determine class label/probabilities for N
else
make an internal node N
S = FindBestSplit(D, C)
for each outcome k of S
Dk = subset of instances that have outcome k
kth child of N = MakeSubtree(Dk)
return subtree rooted at N
![Page 11: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/11.jpg)
Candidate splits in ID3, C4.5• splits on nominal features have one branch per value
• splits on numeric features use a threshold
thal
normal fixed_defect reversible_defect
weight ≤ 35
true false
![Page 12: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/12.jpg)
Candidate splits on numeric features
weight ≤ 35
true false
weight
17 35
given a set of training instances D and a specific feature Xi
• sort the values of Xi in D• evaluate split thresholds in intervals between instances of
different classes
• could use midpoint of each considered interval as the threshold• C4.5 instead picks the largest value of Xi in the entire training set that does not
exceed the midpoint
![Page 13: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/13.jpg)
Candidate splits on numeric features(in more detail)
// Run this subroutine for each numeric feature at each node of DT induction
DetermineCandidateNumericSplits(set of training instances D, feature Xi)C = {} // initialize set of candidate splits for feature Xi
S = partition instances in D into sets s1 … sV where the instances in each set have the same value for Xi
let vj denote the value of Xi for set sj
sort the sets in S using vj as the key for each sj
for each pair of adjacent sets sj, sj+1 in sorted Sif sj and sj+1 contain a pair of instances with different class labels
// assume we’re using midpoints for splits
add candidate split Xi ≤ (vj + vj+1)/2 to C
return C
![Page 14: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/14.jpg)
Candidate splits• instead of using k-way splits for k-valued features, could
require binary splits on all discrete features (CART does this)
thal
normal reversible_defect∨ fixed_defect
color
red ∨blue green ∨ yellow
![Page 15: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/15.jpg)
Finding The Best Splits
![Page 16: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/16.jpg)
Finding the best split
• How should we select the best feature to split on at each step?
• Key hypothesis: the simplest tree that classifies the training instances accurately will work well on previously unseen instances
![Page 17: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/17.jpg)
Occam’s razor
• attributed to 14th century William of Ockham
• “Nunquam ponenda est pluralitis sin necesitate”
• “Entities should not be multiplied beyond necessity”
• “when you have two competing theories that make exactly the same predictions, the simpler one is the better”
![Page 18: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/18.jpg)
But a thousand years earlier, I said, “We consider it a goodprinciple to explain the phenomena by the simplest hypothesis possible.”
![Page 19: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/19.jpg)
Occam’s razor and decision trees
• there are fewer short models (i.e. small trees) than long ones
• a short model is unlikely to fit the training data well by chance
• a long model is more likely to fit the training data well coincidentally
Why is Occam’s razor a reasonable heuristic for decision tree learning?
![Page 20: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/20.jpg)
Finding the best splits
• Can we find and return the smallest possible decision tree that accurately classifies the training set?
• Instead, we’ll use an information-theoretic heuristic to greedily choose splits
NO! This is an NP-hard problem[Hyafil & Rivest, Information Processing Letters, 1976]
![Page 21: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/21.jpg)
Information theory background
• consider a problem in which you are using a code to communicate information to a receiver
• example: as bikes go past, you are communicating the manufacturer of each bike
![Page 22: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/22.jpg)
Information theory background
• suppose there are only four types of bikes• we could use the following code
11
10
01
00
• expected number of bits we have to communicate: 2 bits/bike
Trek
Specialized
Cervelo
Serrota
type code
![Page 23: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/23.jpg)
Information theory background• we can do better if the bike types aren’t equiprobable• optimal code uses bits for event with
probability− log2 P(y)
P(y)
1
€
P(Trek) = 0.5P(Specialized) = 0.25P(Cervelo) = 0.125P(Serrota) = 0.125
2
3
3
1
01
001
000
Type/probability # bits code
• expected number of bits we have to communicate: 1.75 bits/bike
åÎ
-)(values
2 )(log)(Yy
yPyP
![Page 24: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/24.jpg)
Entropy• entropy is a measure of uncertainty associated with a
random variable
• defined as the expected number of bits required to communicate the value of the variable
entropy function forbinary variable
åÎ
-=)(values
2 )(log)()(Yy
yPyPYH
![Page 25: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/25.jpg)
Conditional entropy
• What’s the entropy of Y if we condition on some other variable X?
where
𝐻(𝑌|𝑋) = ( 𝑃(𝑋 = 𝑥)𝐻(𝑌|𝑋 = 𝑥)�
,∈values(4)
𝐻(𝑌|𝑋 = 𝑥) = − ( 𝑃(𝑌 = 𝑦|𝑋 = 𝑥) log9 𝑃 (𝑌 = 𝑦|𝑋 = 𝑥)�
:∈values(;)
![Page 26: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/26.jpg)
Information gain (a.k.a. mutual information)
• choosing splits in ID3: select the split S that most reduces the conditional entropy of Y for training set D
InfoGain(D,S) = HD (Y )− HD (Y | S)
D indicates that we’re calculating probabilities using the specific sample D
![Page 27: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/27.jpg)
Relations between the concepts
Figure from wikipedia.org
![Page 28: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/28.jpg)
Information gain example
![Page 29: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/29.jpg)
Information gain example
Humidity
high normal
D: [3+, 4-]
D: [9+, 5-]
D: [6+, 1-]
• What’s the information gain of splitting on Humidity?
940.0145log
145
149log
149)( 22 =÷
øö
çèæ-÷
øö
çèæ-=YHD
592.0 71log
71
76log
76)normal|( 22
=
÷øö
çèæ-÷
øö
çèæ-=YHD
985.0 74log
74
73log
73)high|( 22
=
÷øö
çèæ-÷
øö
çèæ-=YHD
151.0
)592.0(147)985.0(
147940.0
)Humidity|()()Humidity,(InfoGain
=
úûù
êëé +-=
-= YHYHD DD
![Page 30: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/30.jpg)
Information gain example
Humidity
high normal
D: [3+, 4-]
D: [9+, 5-]
D: [6+, 1-]
• Is it better to split on Humidity or Wind?
HD (Y |weak) = 0.811
Wind
weak strong
D: [6+, 2-]
D: [9+, 5-]
D: [3+, 3-]
HD (Y | strong) = 1.0
✔
151.0
)592.0(147)985.0(
147940.0)Humidity,(InfoGain
=
úûù
êëé +-=D
048.0
)0.1(146)811.0(
148940.0)Wind,(InfoGain
=
úûù
êëé +-=D
![Page 31: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/31.jpg)
One limitation of information gain
• information gain is biased towards tests with many outcomes
• e.g. consider a feature that uniquely identifies each training instance
• splitting on this feature would result in many branches, each of which is “pure” (has instances of only one class)
• maximal information gain!
![Page 32: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/32.jpg)
Gain ratio• to address this limitation, C4.5 uses a splitting criterion
called gain ratio
• gain ratio normalizes the information gain by the entropy of the split being considered
GainRatio(D,S) = InfoGain(D,S)HD (S)
=HD (Y )−HD (Y | S)
HD (S)
![Page 33: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/33.jpg)
Stopping criteria
We should form a leaf when• all of the given subset of instances are of the same class• we’ve exhausted all of the candidate splits
Is there a reason to stop earlier, or to prune back the tree?
![Page 34: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/34.jpg)
How to assess the accuracy of a tree?• can we just calculate the fraction of training instances
that are correctly classified?
• consider a problem domain in which instances are assigned labels at random with P(Y = t) = 0.5
• how accurate would a learned decision tree be on previously unseen instances?
• how accurate would it be on its training set?
![Page 35: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/35.jpg)
How can we assess the accuracy of a tree?• to get an unbiased estimate of a learned model’s
accuracy, we must use a set of instances that are held-aside during learning
• this is called a test set
all instances
test
train
![Page 36: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/36.jpg)
Overfitting
![Page 37: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/37.jpg)
Overfitting• consider error of model h over
• training data:• entire distribution of data:
• model overfits the training data if there is an alternative model such that
errorD(h)
errorD (h)
errorD (h) > errorD (h ')
errorD(h) < errorD(h ')
HhÎHh Î'
![Page 38: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/38.jpg)
Example 1: overfitting with noisy datasuppose
• the target concept is • there is noise in some feature values• we’re given the following training set
X1 X2 X3 X4 X5 … Y
t t t t t … t
t t f f t … t
t f t t f … t
t f f t f … f
t f t f f … f
f t t f t … f
noisy value
21 XXY Ù=
![Page 39: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/39.jpg)
Example 1: overfitting with noisy data
X1
X2
T F
X3t
f
f
f
X4
t
X1
X2
T F
t f
f
correct tree tree that fits noisy training data
![Page 40: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/40.jpg)
Example 2: overfitting with noise-free datasuppose
• the target concept is • P(X3 = t) = 0.5 for both classes• P(Y = t) = 0.67• we’re given the following training set
X1 X2 X3 X4 X5 … Y
t t t t t … t
t t t f t … t
t t t t f … t
t f f t f … f
f t f f t … f
21 XXY Ù=
![Page 41: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/41.jpg)
Example 2: overfitting with noise-free data
X3
T F
t f
t
training setaccuracy
test setaccuracy
100%
66% 66%
50%
• because the training set is a limited sample, there might be (combinations of) features that are correlated with the target concept by chance
![Page 42: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/42.jpg)
Overfitting in decision trees
![Page 43: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/43.jpg)
Example 3: regression using polynomial
𝑡 = sin(2𝜋𝑥) + 𝜖
Figure from Machine Learning and Pattern Recognition, Bishop
![Page 44: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/44.jpg)
Regression using polynomial of
degree M
𝑡 = sin(2𝜋𝑥) + 𝜖
Example 3: regression using polynomial
![Page 45: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/45.jpg)
𝑡 = sin(2𝜋𝑥) + 𝜖
Example 3: regression using polynomial
![Page 46: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/46.jpg)
𝑡 = sin(2𝜋𝑥) + 𝜖
Example 3: regression using polynomial
![Page 47: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/47.jpg)
𝑡 = sin(2𝜋𝑥) + 𝜖
Example 3: regression using polynomial
![Page 48: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/48.jpg)
Example 3: regression using polynomial
![Page 49: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/49.jpg)
General phenomenon
Figure from Deep Learning, Goodfellow, Bengio and Courville
![Page 50: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/50.jpg)
Prevent overfitting• cause: training error and expected error are different
1. there may be noise in the training data
2. training data is of limited size, resulting in difference from the true distribution
3. larger the hypothesis class, easier to find a hypothesis that fits the difference between the training data and the true distribution
• prevent overfitting:
1. cleaner training data help!
2. more training data help!
3. throwing away unnecessary hypotheses helps! (Occam’s Razor)
![Page 51: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/51.jpg)
Avoiding overfitting in DT learning
two general strategies to avoid overfitting1. early stopping: stop if further splitting not justified by
a statistical test• Quinlan’s original approach in ID3
2. post-pruning: grow a large tree, then prune back some nodes• more robust to myopia of greedy tree learning
![Page 52: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/52.jpg)
Pruning in C4.5
1. split given data into training and validation (tuning) sets
2. grow a complete tree3. do until further pruning is harmful
• evaluate impact on tuning-set accuracy of pruning each node
• greedily remove the one that most improves tuning-set accuracy
![Page 53: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/53.jpg)
Validation sets• a validation set (a.k.a. tuning set) is a subset of the training set that is
held aside• not used for primary training process (e.g. tree growing)• but used to select among models (e.g. trees pruned to varying
degrees)
all instances
testtrain
tuning
![Page 54: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/54.jpg)
Variants
![Page 55: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/55.jpg)
Regression trees
X5 > 10
X3
X2 > 2.1Y=5
Y=24Y=3.5
Y=3.2
• in a regression tree, leaves have functions that predict numeric values instead of class labels
• the form of these functions depends on the method• CART uses constants• some methods use linear functions
X5 > 10
X3
X2 > 2.1Y=2X4+5
Y=3X4+X6
Y=3.2
Y=1
![Page 56: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/56.jpg)
Regression trees in CART
• CART does least squares regression which tries to minimize
target value for ith
training instancevalue predicted by tree for ith training instance (average value of y for training instances reaching the leaf)
• at each internal node, CART chooses the split that most reduces this quantity
å=
-||
1
2)()( )ˆ(D
i
ii yy
å åÎ Î
-=leaves
2)()( )ˆ(L Li
ii yy
![Page 57: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/57.jpg)
Probability estimation trees
X5 > 10
X3
P(Y=pos) = 0.5P(Y=neg) = 0.5
P(Y=pos) = 0.1P(Y=neg) = 0.9
P(Y=pos) = 0.8P(Y=neg) = 0.2
• in a PE tree, leaves estimate the probability of each class
• could simply use training instances at a leaf to estimate probabilities, but …
• smoothing is used to make estimates less extreme (we’ll revisit this topic when we cover Bayes nets)
D: [3+, 3-] D: [0+, 8-]
D: [3+, 0-]
![Page 58: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/58.jpg)
m-of-n splits• a few DT algorithms have used m-of-n splits [Murphy & Pazzani ‘91]• each split is constructed using a heuristic search process• this can result in smaller, easier to comprehend trees
test is satisfied if 5 of 10conditions are true
tree for exchange rate prediction [Craven & Shavlik, 1997]
![Page 59: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/59.jpg)
Searching for m-of-n splitsm-of-n splits are found via a hill-climbing search
• initial state: best 1-of-1 (ordinary) binary split
• evaluation function: information gain
• operators:
m-of-n è m-of-(n+1)
1 of { X1=t, X3=f } è 1 of { X1=t, X3=f, X7=t }
m-of-n è (m+1)-of-(n+1)
1 of { X1=t, X3=f } è 2 of { X1=t, X3=f, X7=t }
![Page 60: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/60.jpg)
Lookahead
• most DT learning methods use a hill-climbing search• a limitation of this approach is myopia: an important
feature may not appear to be informative until used in conjunction with other features
• can potentially alleviate this limitation by using a lookahead search [Norton ‘89; Murphy & Salzberg ‘95]
• empirically, often doesn’t improve accuracy or tree size
![Page 61: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/61.jpg)
Choosing best split in ordinary DT learning
OrdinaryFindBestSplit(set of training instances D, set of candidate splits C)maxgain = -∞for each split S in C
gain = InfoGain(D, S)if gain > maxgain
maxgain = gainSbest = S
return Sbest
![Page 62: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/62.jpg)
Choosing best split with lookahead(part 1)
LookaheadFindBestSplit(set of training instances D, set of candidate splits C)maxgain = -∞for each split S in C
gain = EvaluateSplit(D, C, S)if gain > maxgain
maxgain = gainSbest = S
return Sbest
![Page 63: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/63.jpg)
Choosing best split with lookahead(part 2)
EvaluateSplit(D, C, S)if a split on S separates instances by class (i.e. )
// no need to split furtherreturn
elsefor each outcome k of S
// see what the splits at the next level would beDk = subset of instances that have outcome kSk = OrdinaryFindBestSplit(Dk, C – S)
// return information gain that would result from this 2-level subtreereturn
HD (Y | S) = 0
HD (Y )− HD (Y | S)
HD (Y )−Dk
Dk∑ HDk
(Y | S = k,Sk⎛⎝⎜
⎞⎠⎟
![Page 64: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/64.jpg)
Calculating information gain with lookahead
Humidity
Wind Temperature
D: [12-, 11+]
D: [6-, 8+] D: [6-, 3+]
D: [2-, 3+] D: [4-, 5+] D: [2-, 2+] D: [4-, 1+]
Suppose that when considering Humidity as a split, we find that Wind and Temperature are the best features to split on at the next level
high normal
strong weak high low
We can assess value of choosing Humidity as our split by
HD (Y )− 14
23HD (Y | Humidity = high,Wind)+ 9
23HD (Y | Humidity = low,Temperature)⎛
⎝⎜⎞⎠⎟
![Page 65: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/65.jpg)
Calculating information gain with lookahead
1423HD (Y | Humidity = high,Wind)+ 9
23HD (Y | Humidity = low,Temperature)
= 523HD (Y | Humidity = high,Wind = strong)+
923HD (Y | Humidity = high,Wind = weak)+
423HD (Y | Humidity = low,Temperature = high)+
523HD (Y | Humidity = low,Temperature = low)
HD (Y | Humidity = high,Wind = strong) = − 25
log 25
"#$
%&'−
35
log 35
"#$
%&'
!
Using the tree from the previous slide:
![Page 66: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/66.jpg)
Comments on decision tree learning
• widely used approach• many variations• provides humanly comprehensible models when trees
not too big• insensitive to monotone transformations of numeric
features• standard methods learn axis-parallel hypotheses*
• standard methods not suited to on-line setting*
• usually not among most accurate learning methods
* although variants exist that are exceptions to this
![Page 67: Decision Trees - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/02-decision-tree.pdf · Goals for this lecture you should understand the following concepts •](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec8fbb4174e812a312c158d/html5/thumbnails/67.jpg)
THANK YOUSome of the slides in these lectures have been adapted/borrowed from materials developed by Yingyu Liang, Mark Craven, David
Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Elad Hazan, Tom Dietterich, and Pedro Domingos.