chapter 6. decision tree (chapter 5 of pkksc) · 2017-03-28 · classi cation and regression tree...

Chapter 6. Decision Tree(Chapter 5 of PKKSC)

Yongdai Kim

Seoul National University

1. Introduction

Seoul National University. 1

Decision Tree

• A supervised learning(classification and prediction) method.

• Genrate rules of the form ”if-then”.

• Easy rules represented by DB language like SQL.

• Good interpretation.

* The first method for high dimensional nonlinear function

estimation

Prediction and Interpretation

• Prediction accuracy is of the first concern:

– Example: A company wants to select a small part of its

customers who respond, with high probability, to a direct mail

sent by company. method.

• In general, not onyl prediction accuracy, but also explaining

reasons of decision is important.

– Example: A bank must explain reasons for rejection to a loan

applicant.

• Decision tree has a good interpretation power.

Example

Elements of Decision Tree

2. Construction of Decision Tree

Questions for tree construction

• We ask the following questions to construct a tree.

– Why does the root node ask for income?

– Why is the node 6 an internal node?

– What value does the tree assign to the node 7?

Four ingredients in decision tree construction

• Splitting rule.

• Stopping rule

• Pruning rule.

• Assigning a prediction value for each terminal node.

Four steps for construction of a decision tree

• Growing : Find an optimal splitting rule for each node and grow

the tree. Stop growing if stopping rule is satisfied.

• Pruning : Remove nodes which increase prediction error or which

have inappropriate inference rules. And also remove unnecessary

(redundant) nodes.

• Validation : Validate using gain chart, risk chart, test sample

error, cross validation and etc (to decide how much we prune the

• Interpretation and prediction : Interpret the constructed tree

and predict

Splitting Rule

• For each node, determine a splitting variable and splitting

criterion.

• For continuous splitting variable X, splitting criterion c is a

number. In general, tree assigns the case X <c to the left child

node and assigns the case X ≥ c to the right child node.

• For categorical variable, the splitting criterion divides the range

of the splitting variable in two parts. For example, {1, 2, 4} and

{3} is a splitting criterion for a splitting variable X whose range

{1, 2, 3, 4}. If a case X ∈ {1, 2, 3}, tree assigns it to the left child.

Otherwise , tree assigns it to the right child.

Purity

• Purity( or impurity) is the measure of homogeneity of the target

variable for a given node.

• For example, a node in which the ratio of group 0 and group 1 is

9:11 has a lower purity than a node in which the ratio of group 0

and group 1 is 1:9.

• For each node, we selection a splitting variable and a splitting

criterion which maximizes the sum of purities of the two child

nodes.

Purity (impurity) measure

• Decision tree grows by splitting of each node.

• After a node is split into two child nodes, the sum of the purity

of the child nodes is greater than the purity of the parent node.

• It means the child nodes are purer than the parent node.

• A splitting variable minimizes reduction of impurity for the

splitting.

• An easy candidate of impurity is the error rate. That is, tree

selects a splitting variable and a splitting criterion which

minimize error rate of child nodes.

Problem of error rate as an impurity measure

• Split 2 has the same error rate as Split 1, but Split 2 is better in

the sense that the left child node can be split further.

Conditions for impurity functions

• We have seen that the error rate is not a good impurity measure.

• It would be appropriate that the impurity measure is small when

one of the child nodes has an extremely small error rate.

• The impurity function ϕ : [0, 1] → [0,∞) should satisfy

– ϕ(0) = ϕ(1) = 0

– ϕ(1/2) = maximum

– ϕ(p) = ϕ(1− p)

• Also, to give more impurity for p around 1/2, we require that the

impurity measure is concave.

• Proposition. For given node t, let

∆i(t) = ϕ(pt)− (ϕ(ptR) + ϕ(ptL)). Then ∆i(t) ≥ 0. (see

Proposition 4.4. in Breiman et al. (1984))

Impurity functions

• Classification model

– χ2 statistic.

– Gini index: ϕ(p) = p(1− p)

– Entropy index: ϕ(p) = p log p+ (1− p) log(1− p)

• Regression model

– F statistic of ANOVA.

– Decrement of variance.

χ2 statistics

• For given splitting variable and splitting criterion, we make the

following table.

• This table is called by observed frequency O table.

χ2 statistics

• We can compute the expected frequency E of previous table.

χ2 statistics

• χ2 statistic.

χ2 =∑ (Eij −Oij)

• Apply to previous table, compute χ2 statistic

χ2 = (56− 32)2/56 + (24− 48)2/24

+ (154− 178)2/154 + (66− 42)2/66

= 46.75

• Find the splitting variable and the splitting criterion maximize

χ2 statistic.

Gini index

• Gini index

Gini index = Probability of Good at left child

× Probability of Bad at left child

+ Probability of Good at right child

× Probability of Good at right child

• Apply to previous table, compute Gini index

Gini index = (32/80) ∗ (48/80) + (178/220) ∗ (42/220) = 0.3944

• Find the splitting variable and the splitting criterion minimizing

Gini index.

Entropy index

• Entropy index

Entropy index = Probability of Good at left child

× log(Probability of Good at left child)

+ Probability of Bad at left child

× log(Probability of Bad at left child)

+ Probability of Good at right child

× log(Probability of Good at right child)

+ Probability of Bad at right child

× log(Probability of Bad at right child)

• Apply to previous table, comput entropy index

entropy = (32/80) ∗ log(32/80) + (48/80) ∗ log(48/80)

+ (178/200) ∗ log(178/200) + (42/200) ∗ log(42/200)

= −0.4796

• Find the splitting variable and the splitting criterion minimizing

entropy.

Example : Splitting method

• Using Gini index, find an optimal split for following table.

1. Split by temperature.

1-1. left node={hot}, right node={mild,cold}.

• Gini index = 3/4 ∗ 1/4 + 3/10 ∗ 7/10 = 0.3975

1-2. left node={mild}, right node={hot,cold}.

• Gini index = 1/6 ∗ 5/6 + 5/8 ∗ 3/8 = 0.373

1-3. left node={cold}, right node={hot,mild}.

• Gini index = 2/4 ∗ 2/4 + 4/10 ∗ 6/10 = 0.49

2. Split by humidity.

2-1. left node={high}, right node={normal}.

• Gini index = 3/7 ∗ 4/7 + 3/7 ∗ 4/7 = 0.489

3. Split by windy.

3-1. left node={false}, right node={true}.

• Gini index = 4/8 ∗ 4/8 + 2/6 ∗ 4/6 = 0.472

• Select a split with the smallest impurity.

• Thus, select the split 1-2.

Measurement of impurity in regression model

• Use a split with the smallest significant probability of t statistic

which tests difference between means of both child nodes.

• Use a split with the smallest sum of variances of both child nodes.

Remark on impurity

• Impurity is defined for each node.

• For a given nore, selection of split rule is done using the smallest

sum of impurities of child nodes. This maximizes the difference

between impurity of parent node and sum of impurity of child

nodes.

• For choosing the optimal split rule among several nodes, find the

split that not minimizes the sum of impurities of the child nodes,

but maximizes difference of impurity between parent node and

child nodes.

Remark on multi-way split

• We have considered binary split.

• That is, each node can have only two children.

• We can think of multi-way split where a node can have more

than two children.

• There is such an option in SAS E-Miner.

• However, it is known that multi-way split is inferior to binary

split.

• This is partly because multi-way split is too much greedy.

• Also, note that multi-way split can be represented by several

binary splits.

Stopping rules

• Stopping rules terminates further splitting.

• For example

– All observations in a node are contained in a group.

– The number of observations in a node is small.

– The decrement of impurity is small.

– The depth of a node is larger than a given number.

Pruning

• A tree with too many nodes will have large prediction error rate

for new observations.

• It is appropriate to prune away some branch of tree for good

prediction error rate.

• To determine the size of tree, we estimate prediction error using

validation set or cross validation.

Pruning Process

• For a given tree T and positive number a, cost-complexity

pruning is defined by

cost-complexity(a) = error rate of T + a|T |

where |T | is the number of nodes.

• In general, the larger tree(the larger |T |), the smaller error rate.

But cost-complexity does not decrease as |T | increases.

• For the grown tree Tm, T (a) is a subtree which minimizes

cost-complexity(a).

• In general, the larger a, the smaller |T (a)|.

Pruning Process

• One important property of T (a) is that T (a) is a subtree of T (b)

when a > b.

• This saves the computation for searching T (a) significantly.

• See Breiman et al. (1984) for proof.

• For a given a, we estimate the generalization error of T (a) by

cross-validation.

• Choose a∗ (and crresponding T (a∗)) which minimizes the

(estimated) generalization error.

3. Some Algorithms for Decision Tree

• Classification And Regression Tree

• Breiman and et al. 1984

• A result of machine learning research.

• One of the most popular decision tree algorithm.

• Using Gini index or entropy

• Cost-complexity pruning is an important unique feature.

• Can consider a split rule based on a linear combination of

variables.

• Missing data can be processed using surrogate variables.

• J. Ross Quinlan

• The early version : ID3(Iterative Dichotomizer 3) 1986

• Multisplit is available.

• For categorical input variable, a node splits into the number of

categories.

• C4.5 employs entropy as impurity.

• C4.5 uses test data set for pruning process.

• There is a program to generate rules from tree.

• Example

– Watch a game, home team wins , drink beer.

– Watch a game, home team wins, drink soda.

– Watch a game, home team loses, drink beer.

– Watch a game, home team loses, drink milk.

• There is no relation between wining or losing of home team and

drinking beer. → Watch a game, drink beer.

• Chi-squared Automatic Interaction Detection

• J. A. Hartigan 1975

• Successor of AID described by J. A. Morgan and N. A. Souquist

• CHAID has no pruning process, it stops growing at a certain size.

• Categorical input variable only.

• CHAID employs χ2 statistic as impurity.

4. Advantages and Disadvantages of Decision Tree

Advantages

• Tree generates easy rules.

• Classification is easy.

• Deal with both categorical and continuous variables.

• Find the most significant variable.

• Robust to input outliers.

• Nonparametric model.

Disadvantages

• Poor prediction accuracy for linear regression model with

continuous target variable.

• When depth is large, not only accuracy but interpretation are

• Heavy computation cost.

• Unstable.

• Absence of linearity and main effects (all nodes are high order

interactions)

5. Decision tree as high dimensional nonlinearfunction estimation

• We can write the decision tree as

f̂(x) =∑t∈T̃

αtI(x ∈ Rt)

where T̃ is the set of terminal nodes, αt are predictive values at

node t and Rt is the range of input variables at node t.

• Rt is given as Rt = I(x1 ∈ A1, . . . , xp ∈ Ap) where Ai is the

subset of the domain of xi.

• In this view, Decision tree can be considered as a local constant

model.

• Art of decision tree is to find Rt, t ∈ T̃ .

• Global search for the optimal Rt is NP-complete.

• Decision tree finds Rt greedily similar to the forward selection

procedure.

• In each node, we find a split rule which uses only one variable.

Hence, we can say that Decision Tree uses a univariate function

estimation procedure (i.e. local constant fit) repeatedly

(greedily).

• Most of high dimensional nonlinear function estimation uses this

idea to overcome curse of dimensionality.

• An important disadvantage of this approach (repeated univariate

function estimation) is that the final model might be sub-optimal

and can be unstable.

chapter 6. decision tree (chapter 5 of pkksc) · 2017-03-28 · classi cation and regression tree...

Documents

chapter 3 event tree analysis

evolutionary tree reconstruction (chapter 10)

chapter 3: implementing spanning tree

chapter 4 trees §1 preliminaries 1. terminology lineal tree...

boosting and other expert fusion strategies. references ...

chapter 11 b-tree

chapter 5 association rules fp tree

decision trees and random forests reference: leo breiman,...

graph theory chapter 3 tree

chapter 4: multiple spanning tree protocol

chapter 5 tree & binary tree

trees chapter 15. 2 chapter contents tree concepts...

chapter 4 feature design tree

tree and array multipliers - george mason...

chapter 15 minimum spanning tree

the shannon-mcmillan-breiman theorem beyond amenable...

chapter 9 – workbook questions - healthy...

trees chapter 16. 2 chapter contents tree concepts...

trees chapter 24. 2 chapter contents tree concepts...

chapter 6.1 6 6...chapter 6.1 –6.3 an introduction to ......