chapter 6. decision tree (chapter 5 of pkksc) · 2017-03-28 · classi cation and regression tree...

Post on 19-Jul-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Chapter 6. Decision Tree(Chapter 5 of PKKSC)

Yongdai Kim

Seoul National University

1. Introduction

Seoul National University. 1

Decision Tree

• A supervised learning(classification and prediction) method.

• Genrate rules of the form ”if-then”.

• Easy rules represented by DB language like SQL.

• Good interpretation.

* The first method for high dimensional nonlinear function

estimation

Seoul National University. 2

Prediction and Interpretation

• Prediction accuracy is of the first concern:

– Example: A company wants to select a small part of its

customers who respond, with high probability, to a direct mail

sent by company. method.

• In general, not onyl prediction accuracy, but also explaining

reasons of decision is important.

– Example: A bank must explain reasons for rejection to a loan

applicant.

• Decision tree has a good interpretation power.

Seoul National University. 3

Example

Seoul National University. 4

Elements of Decision Tree

Seoul National University. 5

2. Construction of Decision Tree

Seoul National University. 6

Questions for tree construction

• We ask the following questions to construct a tree.

– Why does the root node ask for income?

– Why is the node 6 an internal node?

– What value does the tree assign to the node 7?

Seoul National University. 7

Four ingredients in decision tree construction

• Splitting rule.

• Stopping rule

• Pruning rule.

• Assigning a prediction value for each terminal node.

Seoul National University. 8

Four steps for construction of a decision tree

• Growing : Find an optimal splitting rule for each node and grow

the tree. Stop growing if stopping rule is satisfied.

• Pruning : Remove nodes which increase prediction error or which

have inappropriate inference rules. And also remove unnecessary

(redundant) nodes.

• Validation : Validate using gain chart, risk chart, test sample

error, cross validation and etc (to decide how much we prune the

tree)

• Interpretation and prediction : Interpret the constructed tree

and predict

Seoul National University. 9

Splitting Rule

• For each node, determine a splitting variable and splitting

criterion.

• For continuous splitting variable X, splitting criterion c is a

number. In general, tree assigns the case X <c to the left child

node and assigns the case X ≥ c to the right child node.

• For categorical variable, the splitting criterion divides the range

of the splitting variable in two parts. For example, {1, 2, 4} and

{3} is a splitting criterion for a splitting variable X whose range

{1, 2, 3, 4}. If a case X ∈ {1, 2, 3}, tree assigns it to the left child.

Otherwise , tree assigns it to the right child.

Seoul National University. 10

Purity

• Purity( or impurity) is the measure of homogeneity of the target

variable for a given node.

• For example, a node in which the ratio of group 0 and group 1 is

9:11 has a lower purity than a node in which the ratio of group 0

and group 1 is 1:9.

• For each node, we selection a splitting variable and a splitting

criterion which maximizes the sum of purities of the two child

nodes.

Seoul National University. 11

Purity (impurity) measure

• Decision tree grows by splitting of each node.

• After a node is split into two child nodes, the sum of the purity

of the child nodes is greater than the purity of the parent node.

• It means the child nodes are purer than the parent node.

• A splitting variable minimizes reduction of impurity for the

splitting.

• An easy candidate of impurity is the error rate. That is, tree

selects a splitting variable and a splitting criterion which

minimize error rate of child nodes.

Seoul National University. 12

Problem of error rate as an impurity measure

• Split 2 has the same error rate as Split 1, but Split 2 is better in

the sense that the left child node can be split further.

Seoul National University. 13

Conditions for impurity functions

• We have seen that the error rate is not a good impurity measure.

• It would be appropriate that the impurity measure is small when

one of the child nodes has an extremely small error rate.

• The impurity function ϕ : [0, 1] → [0,∞) should satisfy

– ϕ(0) = ϕ(1) = 0

– ϕ(1/2) = maximum

– ϕ(p) = ϕ(1− p)

• Also, to give more impurity for p around 1/2, we require that the

impurity measure is concave.

• Proposition. For given node t, let

∆i(t) = ϕ(pt)− (ϕ(ptR) + ϕ(ptL)). Then ∆i(t) ≥ 0. (see

Proposition 4.4. in Breiman et al. (1984))

Seoul National University. 14

Impurity functions

• Classification model

– χ2 statistic.

– Gini index: ϕ(p) = p(1− p)

– Entropy index: ϕ(p) = p log p+ (1− p) log(1− p)

• Regression model

– F statistic of ANOVA.

– Decrement of variance.

Seoul National University. 15

χ2 statistics

• For given splitting variable and splitting criterion, we make the

following table.

• This table is called by observed frequency O table.

Seoul National University. 16

χ2 statistics

• We can compute the expected frequency E of previous table.

Seoul National University. 17

χ2 statistics

• χ2 statistic.

χ2 =∑ (Eij −Oij)

2

Eij

• Apply to previous table, compute χ2 statistic

χ2 = (56− 32)2/56 + (24− 48)2/24

+ (154− 178)2/154 + (66− 42)2/66

= 46.75

• Find the splitting variable and the splitting criterion maximize

χ2 statistic.

Seoul National University. 18

Gini index

• Gini index

Gini index = Probability of Good at left child

× Probability of Bad at left child

+ Probability of Good at right child

× Probability of Good at right child

• Apply to previous table, compute Gini index

Gini index = (32/80) ∗ (48/80) + (178/220) ∗ (42/220) = 0.3944

• Find the splitting variable and the splitting criterion minimizing

Gini index.

Seoul National University. 19

Entropy index

• Entropy index

Entropy index = Probability of Good at left child

× log(Probability of Good at left child)

+ Probability of Bad at left child

× log(Probability of Bad at left child)

+ Probability of Good at right child

× log(Probability of Good at right child)

+ Probability of Bad at right child

× log(Probability of Bad at right child)

Seoul National University. 20

• Apply to previous table, comput entropy index

entropy = (32/80) ∗ log(32/80) + (48/80) ∗ log(48/80)

+ (178/200) ∗ log(178/200) + (42/200) ∗ log(42/200)

= −0.4796

• Find the splitting variable and the splitting criterion minimizing

entropy.

Seoul National University. 21

Example : Splitting method

• Using Gini index, find an optimal split for following table.

Seoul National University. 22

Example : Splitting method

1. Split by temperature.

1-1. left node={hot}, right node={mild,cold}.

• Gini index = 3/4 ∗ 1/4 + 3/10 ∗ 7/10 = 0.3975

Seoul National University. 23

Example : Splitting method

1. Split by temperature.

1-2. left node={mild}, right node={hot,cold}.

• Gini index = 1/6 ∗ 5/6 + 5/8 ∗ 3/8 = 0.373

Seoul National University. 24

Example : Splitting method

1. Split by temperature.

1-3. left node={cold}, right node={hot,mild}.

• Gini index = 2/4 ∗ 2/4 + 4/10 ∗ 6/10 = 0.49

Seoul National University. 25

Example : Splitting method

2. Split by humidity.

2-1. left node={high}, right node={normal}.

• Gini index = 3/7 ∗ 4/7 + 3/7 ∗ 4/7 = 0.489

Seoul National University. 26

Example : Splitting method

3. Split by windy.

3-1. left node={false}, right node={true}.

• Gini index = 4/8 ∗ 4/8 + 2/6 ∗ 4/6 = 0.472

Seoul National University. 27

Example : Splitting method

• Select a split with the smallest impurity.

• Thus, select the split 1-2.

Seoul National University. 28

Measurement of impurity in regression model

• Use a split with the smallest significant probability of t statistic

which tests difference between means of both child nodes.

• Use a split with the smallest sum of variances of both child nodes.

Seoul National University. 29

Remark on impurity

• Impurity is defined for each node.

• For a given nore, selection of split rule is done using the smallest

sum of impurities of child nodes. This maximizes the difference

between impurity of parent node and sum of impurity of child

nodes.

• For choosing the optimal split rule among several nodes, find the

split that not minimizes the sum of impurities of the child nodes,

but maximizes difference of impurity between parent node and

child nodes.

Seoul National University. 30

Remark on multi-way split

• We have considered binary split.

• That is, each node can have only two children.

• We can think of multi-way split where a node can have more

than two children.

• There is such an option in SAS E-Miner.

• However, it is known that multi-way split is inferior to binary

split.

• This is partly because multi-way split is too much greedy.

• Also, note that multi-way split can be represented by several

binary splits.

Seoul National University. 31

Stopping rules

• Stopping rules terminates further splitting.

• For example

– All observations in a node are contained in a group.

– The number of observations in a node is small.

– The decrement of impurity is small.

– The depth of a node is larger than a given number.

Seoul National University. 32

Pruning

• A tree with too many nodes will have large prediction error rate

for new observations.

• It is appropriate to prune away some branch of tree for good

prediction error rate.

• To determine the size of tree, we estimate prediction error using

validation set or cross validation.

Seoul National University. 33

Pruning Process

• For a given tree T and positive number a, cost-complexity

pruning is defined by

cost-complexity(a) = error rate of T + a|T |

where |T | is the number of nodes.

• In general, the larger tree(the larger |T |), the smaller error rate.

But cost-complexity does not decrease as |T | increases.

• For the grown tree Tm, T (a) is a subtree which minimizes

cost-complexity(a).

• In general, the larger a, the smaller |T (a)|.

Seoul National University. 34

Pruning Process

• One important property of T (a) is that T (a) is a subtree of T (b)

when a > b.

• This saves the computation for searching T (a) significantly.

• See Breiman et al. (1984) for proof.

• For a given a, we estimate the generalization error of T (a) by

cross-validation.

• Choose a∗ (and crresponding T (a∗)) which minimizes the

(estimated) generalization error.

Seoul National University. 35

3. Some Algorithms for Decision Tree

Seoul National University. 36

CART

• Classification And Regression Tree

• Breiman and et al. 1984

• A result of machine learning research.

• One of the most popular decision tree algorithm.

• Using Gini index or entropy

• Cost-complexity pruning is an important unique feature.

• Can consider a split rule based on a linear combination of

variables.

• Missing data can be processed using surrogate variables.

Seoul National University. 37

C4.5

• J. Ross Quinlan

• The early version : ID3(Iterative Dichotomizer 3) 1986

• Multisplit is available.

• For categorical input variable, a node splits into the number of

categories.

• C4.5 employs entropy as impurity.

• C4.5 uses test data set for pruning process.

Seoul National University. 38

C4.5

• There is a program to generate rules from tree.

• Example

– Watch a game, home team wins , drink beer.

– Watch a game, home team wins, drink soda.

– Watch a game, home team loses, drink beer.

– Watch a game, home team loses, drink milk.

• There is no relation between wining or losing of home team and

drinking beer. → Watch a game, drink beer.

Seoul National University. 39

CHAID

• Chi-squared Automatic Interaction Detection

• J. A. Hartigan 1975

• Successor of AID described by J. A. Morgan and N. A. Souquist

1963

• CHAID has no pruning process, it stops growing at a certain size.

• Categorical input variable only.

• CHAID employs χ2 statistic as impurity.

Seoul National University. 40

4. Advantages and Disadvantages of Decision Tree

Seoul National University. 41

Advantages

• Tree generates easy rules.

• Classification is easy.

• Deal with both categorical and continuous variables.

• Find the most significant variable.

• Robust to input outliers.

• Nonparametric model.

Seoul National University. 42

Disadvantages

• Poor prediction accuracy for linear regression model with

continuous target variable.

• When depth is large, not only accuracy but interpretation are

bad.

• Heavy computation cost.

• Unstable.

• Absence of linearity and main effects (all nodes are high order

interactions)

Seoul National University. 43

5. Decision tree as high dimensional nonlinearfunction estimation

• We can write the decision tree as

f̂(x) =∑t∈T̃

αtI(x ∈ Rt)

where T̃ is the set of terminal nodes, αt are predictive values at

node t and Rt is the range of input variables at node t.

• Rt is given as Rt = I(x1 ∈ A1, . . . , xp ∈ Ap) where Ai is the

subset of the domain of xi.

• In this view, Decision tree can be considered as a local constant

model.

• Art of decision tree is to find Rt, t ∈ T̃ .

• Global search for the optimal Rt is NP-complete.

Seoul National University. 44

• Decision tree finds Rt greedily similar to the forward selection

procedure.

• In each node, we find a split rule which uses only one variable.

Hence, we can say that Decision Tree uses a univariate function

estimation procedure (i.e. local constant fit) repeatedly

(greedily).

• Most of high dimensional nonlinear function estimation uses this

idea to overcome curse of dimensionality.

• An important disadvantage of this approach (repeated univariate

function estimation) is that the final model might be sub-optimal

and can be unstable.

Seoul National University. 45

top related