chapter 6. decision tree (chapter 5 of pkksc) · 2017-03-28 · classi cation and regression tree...

46
Chapter 6. Decision Tree (Chapter 5 of PKKSC) Yongdai Kim Seoul National University

Upload: others

Post on 19-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Chapter 6. Decision Tree(Chapter 5 of PKKSC)

Yongdai Kim

Seoul National University

Page 2: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

1. Introduction

Seoul National University. 1

Page 3: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Decision Tree

• A supervised learning(classification and prediction) method.

• Genrate rules of the form ”if-then”.

• Easy rules represented by DB language like SQL.

• Good interpretation.

* The first method for high dimensional nonlinear function

estimation

Seoul National University. 2

Page 4: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Prediction and Interpretation

• Prediction accuracy is of the first concern:

– Example: A company wants to select a small part of its

customers who respond, with high probability, to a direct mail

sent by company. method.

• In general, not onyl prediction accuracy, but also explaining

reasons of decision is important.

– Example: A bank must explain reasons for rejection to a loan

applicant.

• Decision tree has a good interpretation power.

Seoul National University. 3

Page 5: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Example

Seoul National University. 4

Page 6: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Elements of Decision Tree

Seoul National University. 5

Page 7: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

2. Construction of Decision Tree

Seoul National University. 6

Page 8: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Questions for tree construction

• We ask the following questions to construct a tree.

– Why does the root node ask for income?

– Why is the node 6 an internal node?

– What value does the tree assign to the node 7?

Seoul National University. 7

Page 9: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Four ingredients in decision tree construction

• Splitting rule.

• Stopping rule

• Pruning rule.

• Assigning a prediction value for each terminal node.

Seoul National University. 8

Page 10: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Four steps for construction of a decision tree

• Growing : Find an optimal splitting rule for each node and grow

the tree. Stop growing if stopping rule is satisfied.

• Pruning : Remove nodes which increase prediction error or which

have inappropriate inference rules. And also remove unnecessary

(redundant) nodes.

• Validation : Validate using gain chart, risk chart, test sample

error, cross validation and etc (to decide how much we prune the

tree)

• Interpretation and prediction : Interpret the constructed tree

and predict

Seoul National University. 9

Page 11: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Splitting Rule

• For each node, determine a splitting variable and splitting

criterion.

• For continuous splitting variable X, splitting criterion c is a

number. In general, tree assigns the case X <c to the left child

node and assigns the case X ≥ c to the right child node.

• For categorical variable, the splitting criterion divides the range

of the splitting variable in two parts. For example, {1, 2, 4} and

{3} is a splitting criterion for a splitting variable X whose range

{1, 2, 3, 4}. If a case X ∈ {1, 2, 3}, tree assigns it to the left child.

Otherwise , tree assigns it to the right child.

Seoul National University. 10

Page 12: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Purity

• Purity( or impurity) is the measure of homogeneity of the target

variable for a given node.

• For example, a node in which the ratio of group 0 and group 1 is

9:11 has a lower purity than a node in which the ratio of group 0

and group 1 is 1:9.

• For each node, we selection a splitting variable and a splitting

criterion which maximizes the sum of purities of the two child

nodes.

Seoul National University. 11

Page 13: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Purity (impurity) measure

• Decision tree grows by splitting of each node.

• After a node is split into two child nodes, the sum of the purity

of the child nodes is greater than the purity of the parent node.

• It means the child nodes are purer than the parent node.

• A splitting variable minimizes reduction of impurity for the

splitting.

• An easy candidate of impurity is the error rate. That is, tree

selects a splitting variable and a splitting criterion which

minimize error rate of child nodes.

Seoul National University. 12

Page 14: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Problem of error rate as an impurity measure

• Split 2 has the same error rate as Split 1, but Split 2 is better in

the sense that the left child node can be split further.

Seoul National University. 13

Page 15: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Conditions for impurity functions

• We have seen that the error rate is not a good impurity measure.

• It would be appropriate that the impurity measure is small when

one of the child nodes has an extremely small error rate.

• The impurity function ϕ : [0, 1] → [0,∞) should satisfy

– ϕ(0) = ϕ(1) = 0

– ϕ(1/2) = maximum

– ϕ(p) = ϕ(1− p)

• Also, to give more impurity for p around 1/2, we require that the

impurity measure is concave.

• Proposition. For given node t, let

∆i(t) = ϕ(pt)− (ϕ(ptR) + ϕ(ptL)). Then ∆i(t) ≥ 0. (see

Proposition 4.4. in Breiman et al. (1984))

Seoul National University. 14

Page 16: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Impurity functions

• Classification model

– χ2 statistic.

– Gini index: ϕ(p) = p(1− p)

– Entropy index: ϕ(p) = p log p+ (1− p) log(1− p)

• Regression model

– F statistic of ANOVA.

– Decrement of variance.

Seoul National University. 15

Page 17: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

χ2 statistics

• For given splitting variable and splitting criterion, we make the

following table.

• This table is called by observed frequency O table.

Seoul National University. 16

Page 18: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

χ2 statistics

• We can compute the expected frequency E of previous table.

Seoul National University. 17

Page 19: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

χ2 statistics

• χ2 statistic.

χ2 =∑ (Eij −Oij)

2

Eij

• Apply to previous table, compute χ2 statistic

χ2 = (56− 32)2/56 + (24− 48)2/24

+ (154− 178)2/154 + (66− 42)2/66

= 46.75

• Find the splitting variable and the splitting criterion maximize

χ2 statistic.

Seoul National University. 18

Page 20: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Gini index

• Gini index

Gini index = Probability of Good at left child

× Probability of Bad at left child

+ Probability of Good at right child

× Probability of Good at right child

• Apply to previous table, compute Gini index

Gini index = (32/80) ∗ (48/80) + (178/220) ∗ (42/220) = 0.3944

• Find the splitting variable and the splitting criterion minimizing

Gini index.

Seoul National University. 19

Page 21: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Entropy index

• Entropy index

Entropy index = Probability of Good at left child

× log(Probability of Good at left child)

+ Probability of Bad at left child

× log(Probability of Bad at left child)

+ Probability of Good at right child

× log(Probability of Good at right child)

+ Probability of Bad at right child

× log(Probability of Bad at right child)

Seoul National University. 20

Page 22: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

• Apply to previous table, comput entropy index

entropy = (32/80) ∗ log(32/80) + (48/80) ∗ log(48/80)

+ (178/200) ∗ log(178/200) + (42/200) ∗ log(42/200)

= −0.4796

• Find the splitting variable and the splitting criterion minimizing

entropy.

Seoul National University. 21

Page 23: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Example : Splitting method

• Using Gini index, find an optimal split for following table.

Seoul National University. 22

Page 24: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Example : Splitting method

1. Split by temperature.

1-1. left node={hot}, right node={mild,cold}.

• Gini index = 3/4 ∗ 1/4 + 3/10 ∗ 7/10 = 0.3975

Seoul National University. 23

Page 25: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Example : Splitting method

1. Split by temperature.

1-2. left node={mild}, right node={hot,cold}.

• Gini index = 1/6 ∗ 5/6 + 5/8 ∗ 3/8 = 0.373

Seoul National University. 24

Page 26: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Example : Splitting method

1. Split by temperature.

1-3. left node={cold}, right node={hot,mild}.

• Gini index = 2/4 ∗ 2/4 + 4/10 ∗ 6/10 = 0.49

Seoul National University. 25

Page 27: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Example : Splitting method

2. Split by humidity.

2-1. left node={high}, right node={normal}.

• Gini index = 3/7 ∗ 4/7 + 3/7 ∗ 4/7 = 0.489

Seoul National University. 26

Page 28: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Example : Splitting method

3. Split by windy.

3-1. left node={false}, right node={true}.

• Gini index = 4/8 ∗ 4/8 + 2/6 ∗ 4/6 = 0.472

Seoul National University. 27

Page 29: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Example : Splitting method

• Select a split with the smallest impurity.

• Thus, select the split 1-2.

Seoul National University. 28

Page 30: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Measurement of impurity in regression model

• Use a split with the smallest significant probability of t statistic

which tests difference between means of both child nodes.

• Use a split with the smallest sum of variances of both child nodes.

Seoul National University. 29

Page 31: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Remark on impurity

• Impurity is defined for each node.

• For a given nore, selection of split rule is done using the smallest

sum of impurities of child nodes. This maximizes the difference

between impurity of parent node and sum of impurity of child

nodes.

• For choosing the optimal split rule among several nodes, find the

split that not minimizes the sum of impurities of the child nodes,

but maximizes difference of impurity between parent node and

child nodes.

Seoul National University. 30

Page 32: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Remark on multi-way split

• We have considered binary split.

• That is, each node can have only two children.

• We can think of multi-way split where a node can have more

than two children.

• There is such an option in SAS E-Miner.

• However, it is known that multi-way split is inferior to binary

split.

• This is partly because multi-way split is too much greedy.

• Also, note that multi-way split can be represented by several

binary splits.

Seoul National University. 31

Page 33: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Stopping rules

• Stopping rules terminates further splitting.

• For example

– All observations in a node are contained in a group.

– The number of observations in a node is small.

– The decrement of impurity is small.

– The depth of a node is larger than a given number.

Seoul National University. 32

Page 34: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Pruning

• A tree with too many nodes will have large prediction error rate

for new observations.

• It is appropriate to prune away some branch of tree for good

prediction error rate.

• To determine the size of tree, we estimate prediction error using

validation set or cross validation.

Seoul National University. 33

Page 35: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Pruning Process

• For a given tree T and positive number a, cost-complexity

pruning is defined by

cost-complexity(a) = error rate of T + a|T |

where |T | is the number of nodes.

• In general, the larger tree(the larger |T |), the smaller error rate.

But cost-complexity does not decrease as |T | increases.

• For the grown tree Tm, T (a) is a subtree which minimizes

cost-complexity(a).

• In general, the larger a, the smaller |T (a)|.

Seoul National University. 34

Page 36: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Pruning Process

• One important property of T (a) is that T (a) is a subtree of T (b)

when a > b.

• This saves the computation for searching T (a) significantly.

• See Breiman et al. (1984) for proof.

• For a given a, we estimate the generalization error of T (a) by

cross-validation.

• Choose a∗ (and crresponding T (a∗)) which minimizes the

(estimated) generalization error.

Seoul National University. 35

Page 37: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

3. Some Algorithms for Decision Tree

Seoul National University. 36

Page 38: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

CART

• Classification And Regression Tree

• Breiman and et al. 1984

• A result of machine learning research.

• One of the most popular decision tree algorithm.

• Using Gini index or entropy

• Cost-complexity pruning is an important unique feature.

• Can consider a split rule based on a linear combination of

variables.

• Missing data can be processed using surrogate variables.

Seoul National University. 37

Page 39: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

C4.5

• J. Ross Quinlan

• The early version : ID3(Iterative Dichotomizer 3) 1986

• Multisplit is available.

• For categorical input variable, a node splits into the number of

categories.

• C4.5 employs entropy as impurity.

• C4.5 uses test data set for pruning process.

Seoul National University. 38

Page 40: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

C4.5

• There is a program to generate rules from tree.

• Example

– Watch a game, home team wins , drink beer.

– Watch a game, home team wins, drink soda.

– Watch a game, home team loses, drink beer.

– Watch a game, home team loses, drink milk.

• There is no relation between wining or losing of home team and

drinking beer. → Watch a game, drink beer.

Seoul National University. 39

Page 41: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

CHAID

• Chi-squared Automatic Interaction Detection

• J. A. Hartigan 1975

• Successor of AID described by J. A. Morgan and N. A. Souquist

1963

• CHAID has no pruning process, it stops growing at a certain size.

• Categorical input variable only.

• CHAID employs χ2 statistic as impurity.

Seoul National University. 40

Page 42: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

4. Advantages and Disadvantages of Decision Tree

Seoul National University. 41

Page 43: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Advantages

• Tree generates easy rules.

• Classification is easy.

• Deal with both categorical and continuous variables.

• Find the most significant variable.

• Robust to input outliers.

• Nonparametric model.

Seoul National University. 42

Page 44: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

Disadvantages

• Poor prediction accuracy for linear regression model with

continuous target variable.

• When depth is large, not only accuracy but interpretation are

bad.

• Heavy computation cost.

• Unstable.

• Absence of linearity and main effects (all nodes are high order

interactions)

Seoul National University. 43

Page 45: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

5. Decision tree as high dimensional nonlinearfunction estimation

• We can write the decision tree as

f̂(x) =∑t∈T̃

αtI(x ∈ Rt)

where T̃ is the set of terminal nodes, αt are predictive values at

node t and Rt is the range of input variables at node t.

• Rt is given as Rt = I(x1 ∈ A1, . . . , xp ∈ Ap) where Ai is the

subset of the domain of xi.

• In this view, Decision tree can be considered as a local constant

model.

• Art of decision tree is to find Rt, t ∈ T̃ .

• Global search for the optimal Rt is NP-complete.

Seoul National University. 44

Page 46: Chapter 6. Decision Tree (Chapter 5 of PKKSC) · 2017-03-28 · Classi cation And Regression Tree Breiman and et al. 1984 A result of machine learning research. One of the most popular

• Decision tree finds Rt greedily similar to the forward selection

procedure.

• In each node, we find a split rule which uses only one variable.

Hence, we can say that Decision Tree uses a univariate function

estimation procedure (i.e. local constant fit) repeatedly

(greedily).

• Most of high dimensional nonlinear function estimation uses this

idea to overcome curse of dimensionality.

• An important disadvantage of this approach (repeated univariate

function estimation) is that the final model might be sub-optimal

and can be unstable.

Seoul National University. 45