chapter 6. decision tree (chapter 5 of pkksc) · 2017-03-28 · classi cation and regression tree...
TRANSCRIPT
Chapter 6. Decision Tree(Chapter 5 of PKKSC)
Yongdai Kim
Seoul National University
1. Introduction
Seoul National University. 1
Decision Tree
• A supervised learning(classification and prediction) method.
• Genrate rules of the form ”if-then”.
• Easy rules represented by DB language like SQL.
• Good interpretation.
* The first method for high dimensional nonlinear function
estimation
Seoul National University. 2
Prediction and Interpretation
• Prediction accuracy is of the first concern:
– Example: A company wants to select a small part of its
customers who respond, with high probability, to a direct mail
sent by company. method.
• In general, not onyl prediction accuracy, but also explaining
reasons of decision is important.
– Example: A bank must explain reasons for rejection to a loan
applicant.
• Decision tree has a good interpretation power.
Seoul National University. 3
Example
Seoul National University. 4
Elements of Decision Tree
Seoul National University. 5
2. Construction of Decision Tree
Seoul National University. 6
Questions for tree construction
• We ask the following questions to construct a tree.
– Why does the root node ask for income?
– Why is the node 6 an internal node?
– What value does the tree assign to the node 7?
Seoul National University. 7
Four ingredients in decision tree construction
• Splitting rule.
• Stopping rule
• Pruning rule.
• Assigning a prediction value for each terminal node.
Seoul National University. 8
Four steps for construction of a decision tree
• Growing : Find an optimal splitting rule for each node and grow
the tree. Stop growing if stopping rule is satisfied.
• Pruning : Remove nodes which increase prediction error or which
have inappropriate inference rules. And also remove unnecessary
(redundant) nodes.
• Validation : Validate using gain chart, risk chart, test sample
error, cross validation and etc (to decide how much we prune the
tree)
• Interpretation and prediction : Interpret the constructed tree
and predict
Seoul National University. 9
Splitting Rule
• For each node, determine a splitting variable and splitting
criterion.
• For continuous splitting variable X, splitting criterion c is a
number. In general, tree assigns the case X <c to the left child
node and assigns the case X ≥ c to the right child node.
• For categorical variable, the splitting criterion divides the range
of the splitting variable in two parts. For example, {1, 2, 4} and
{3} is a splitting criterion for a splitting variable X whose range
{1, 2, 3, 4}. If a case X ∈ {1, 2, 3}, tree assigns it to the left child.
Otherwise , tree assigns it to the right child.
Seoul National University. 10
Purity
• Purity( or impurity) is the measure of homogeneity of the target
variable for a given node.
• For example, a node in which the ratio of group 0 and group 1 is
9:11 has a lower purity than a node in which the ratio of group 0
and group 1 is 1:9.
• For each node, we selection a splitting variable and a splitting
criterion which maximizes the sum of purities of the two child
nodes.
Seoul National University. 11
Purity (impurity) measure
• Decision tree grows by splitting of each node.
• After a node is split into two child nodes, the sum of the purity
of the child nodes is greater than the purity of the parent node.
• It means the child nodes are purer than the parent node.
• A splitting variable minimizes reduction of impurity for the
splitting.
• An easy candidate of impurity is the error rate. That is, tree
selects a splitting variable and a splitting criterion which
minimize error rate of child nodes.
Seoul National University. 12
Problem of error rate as an impurity measure
• Split 2 has the same error rate as Split 1, but Split 2 is better in
the sense that the left child node can be split further.
Seoul National University. 13
Conditions for impurity functions
• We have seen that the error rate is not a good impurity measure.
• It would be appropriate that the impurity measure is small when
one of the child nodes has an extremely small error rate.
• The impurity function ϕ : [0, 1] → [0,∞) should satisfy
– ϕ(0) = ϕ(1) = 0
– ϕ(1/2) = maximum
– ϕ(p) = ϕ(1− p)
• Also, to give more impurity for p around 1/2, we require that the
impurity measure is concave.
• Proposition. For given node t, let
∆i(t) = ϕ(pt)− (ϕ(ptR) + ϕ(ptL)). Then ∆i(t) ≥ 0. (see
Proposition 4.4. in Breiman et al. (1984))
Seoul National University. 14
Impurity functions
• Classification model
– χ2 statistic.
– Gini index: ϕ(p) = p(1− p)
– Entropy index: ϕ(p) = p log p+ (1− p) log(1− p)
• Regression model
– F statistic of ANOVA.
– Decrement of variance.
Seoul National University. 15
χ2 statistics
• For given splitting variable and splitting criterion, we make the
following table.
• This table is called by observed frequency O table.
Seoul National University. 16
χ2 statistics
• We can compute the expected frequency E of previous table.
Seoul National University. 17
χ2 statistics
• χ2 statistic.
χ2 =∑ (Eij −Oij)
2
Eij
• Apply to previous table, compute χ2 statistic
χ2 = (56− 32)2/56 + (24− 48)2/24
+ (154− 178)2/154 + (66− 42)2/66
= 46.75
• Find the splitting variable and the splitting criterion maximize
χ2 statistic.
Seoul National University. 18
Gini index
• Gini index
Gini index = Probability of Good at left child
× Probability of Bad at left child
+ Probability of Good at right child
× Probability of Good at right child
• Apply to previous table, compute Gini index
Gini index = (32/80) ∗ (48/80) + (178/220) ∗ (42/220) = 0.3944
• Find the splitting variable and the splitting criterion minimizing
Gini index.
Seoul National University. 19
Entropy index
• Entropy index
Entropy index = Probability of Good at left child
× log(Probability of Good at left child)
+ Probability of Bad at left child
× log(Probability of Bad at left child)
+ Probability of Good at right child
× log(Probability of Good at right child)
+ Probability of Bad at right child
× log(Probability of Bad at right child)
Seoul National University. 20
• Apply to previous table, comput entropy index
entropy = (32/80) ∗ log(32/80) + (48/80) ∗ log(48/80)
+ (178/200) ∗ log(178/200) + (42/200) ∗ log(42/200)
= −0.4796
• Find the splitting variable and the splitting criterion minimizing
entropy.
Seoul National University. 21
Example : Splitting method
• Using Gini index, find an optimal split for following table.
Seoul National University. 22
Example : Splitting method
1. Split by temperature.
1-1. left node={hot}, right node={mild,cold}.
• Gini index = 3/4 ∗ 1/4 + 3/10 ∗ 7/10 = 0.3975
Seoul National University. 23
Example : Splitting method
1. Split by temperature.
1-2. left node={mild}, right node={hot,cold}.
• Gini index = 1/6 ∗ 5/6 + 5/8 ∗ 3/8 = 0.373
Seoul National University. 24
Example : Splitting method
1. Split by temperature.
1-3. left node={cold}, right node={hot,mild}.
• Gini index = 2/4 ∗ 2/4 + 4/10 ∗ 6/10 = 0.49
Seoul National University. 25
Example : Splitting method
2. Split by humidity.
2-1. left node={high}, right node={normal}.
• Gini index = 3/7 ∗ 4/7 + 3/7 ∗ 4/7 = 0.489
Seoul National University. 26
Example : Splitting method
3. Split by windy.
3-1. left node={false}, right node={true}.
• Gini index = 4/8 ∗ 4/8 + 2/6 ∗ 4/6 = 0.472
Seoul National University. 27
Example : Splitting method
• Select a split with the smallest impurity.
• Thus, select the split 1-2.
Seoul National University. 28
Measurement of impurity in regression model
• Use a split with the smallest significant probability of t statistic
which tests difference between means of both child nodes.
• Use a split with the smallest sum of variances of both child nodes.
Seoul National University. 29
Remark on impurity
• Impurity is defined for each node.
• For a given nore, selection of split rule is done using the smallest
sum of impurities of child nodes. This maximizes the difference
between impurity of parent node and sum of impurity of child
nodes.
• For choosing the optimal split rule among several nodes, find the
split that not minimizes the sum of impurities of the child nodes,
but maximizes difference of impurity between parent node and
child nodes.
Seoul National University. 30
Remark on multi-way split
• We have considered binary split.
• That is, each node can have only two children.
• We can think of multi-way split where a node can have more
than two children.
• There is such an option in SAS E-Miner.
• However, it is known that multi-way split is inferior to binary
split.
• This is partly because multi-way split is too much greedy.
• Also, note that multi-way split can be represented by several
binary splits.
Seoul National University. 31
Stopping rules
• Stopping rules terminates further splitting.
• For example
– All observations in a node are contained in a group.
– The number of observations in a node is small.
– The decrement of impurity is small.
– The depth of a node is larger than a given number.
Seoul National University. 32
Pruning
• A tree with too many nodes will have large prediction error rate
for new observations.
• It is appropriate to prune away some branch of tree for good
prediction error rate.
• To determine the size of tree, we estimate prediction error using
validation set or cross validation.
Seoul National University. 33
Pruning Process
• For a given tree T and positive number a, cost-complexity
pruning is defined by
cost-complexity(a) = error rate of T + a|T |
where |T | is the number of nodes.
• In general, the larger tree(the larger |T |), the smaller error rate.
But cost-complexity does not decrease as |T | increases.
• For the grown tree Tm, T (a) is a subtree which minimizes
cost-complexity(a).
• In general, the larger a, the smaller |T (a)|.
Seoul National University. 34
Pruning Process
• One important property of T (a) is that T (a) is a subtree of T (b)
when a > b.
• This saves the computation for searching T (a) significantly.
• See Breiman et al. (1984) for proof.
• For a given a, we estimate the generalization error of T (a) by
cross-validation.
• Choose a∗ (and crresponding T (a∗)) which minimizes the
(estimated) generalization error.
Seoul National University. 35
3. Some Algorithms for Decision Tree
Seoul National University. 36
CART
• Classification And Regression Tree
• Breiman and et al. 1984
• A result of machine learning research.
• One of the most popular decision tree algorithm.
• Using Gini index or entropy
• Cost-complexity pruning is an important unique feature.
• Can consider a split rule based on a linear combination of
variables.
• Missing data can be processed using surrogate variables.
Seoul National University. 37
C4.5
• J. Ross Quinlan
• The early version : ID3(Iterative Dichotomizer 3) 1986
• Multisplit is available.
• For categorical input variable, a node splits into the number of
categories.
• C4.5 employs entropy as impurity.
• C4.5 uses test data set for pruning process.
Seoul National University. 38
C4.5
• There is a program to generate rules from tree.
• Example
– Watch a game, home team wins , drink beer.
– Watch a game, home team wins, drink soda.
– Watch a game, home team loses, drink beer.
– Watch a game, home team loses, drink milk.
• There is no relation between wining or losing of home team and
drinking beer. → Watch a game, drink beer.
Seoul National University. 39
CHAID
• Chi-squared Automatic Interaction Detection
• J. A. Hartigan 1975
• Successor of AID described by J. A. Morgan and N. A. Souquist
1963
• CHAID has no pruning process, it stops growing at a certain size.
• Categorical input variable only.
• CHAID employs χ2 statistic as impurity.
Seoul National University. 40
4. Advantages and Disadvantages of Decision Tree
Seoul National University. 41
Advantages
• Tree generates easy rules.
• Classification is easy.
• Deal with both categorical and continuous variables.
• Find the most significant variable.
• Robust to input outliers.
• Nonparametric model.
Seoul National University. 42
Disadvantages
• Poor prediction accuracy for linear regression model with
continuous target variable.
• When depth is large, not only accuracy but interpretation are
bad.
• Heavy computation cost.
• Unstable.
• Absence of linearity and main effects (all nodes are high order
interactions)
Seoul National University. 43
5. Decision tree as high dimensional nonlinearfunction estimation
• We can write the decision tree as
f̂(x) =∑t∈T̃
αtI(x ∈ Rt)
where T̃ is the set of terminal nodes, αt are predictive values at
node t and Rt is the range of input variables at node t.
• Rt is given as Rt = I(x1 ∈ A1, . . . , xp ∈ Ap) where Ai is the
subset of the domain of xi.
• In this view, Decision tree can be considered as a local constant
model.
• Art of decision tree is to find Rt, t ∈ T̃ .
• Global search for the optimal Rt is NP-complete.
Seoul National University. 44
• Decision tree finds Rt greedily similar to the forward selection
procedure.
• In each node, we find a split rule which uses only one variable.
Hence, we can say that Decision Tree uses a univariate function
estimation procedure (i.e. local constant fit) repeatedly
(greedily).
• Most of high dimensional nonlinear function estimation uses this
idea to overcome curse of dimensionality.
• An important disadvantage of this approach (repeated univariate
function estimation) is that the final model might be sub-optimal
and can be unstable.
Seoul National University. 45