comparative review of classification treesnymetro.chapter.informs.org/prac_cor_pubs/05-13... · 1)...

Comparative Review of Classification Trees

by Leonardo Auslender,

leoldv12 ‘at’ gmail ‘dot’ com

Independent Statistical Research Consultant 2013

— 2 —

1) Trees/CART: varieties, algorithm 2) Model Deployment: scoring 3) Examples. 4) Concluding Remarks: Brains, the future Review of Trees: Auslender, L. (1998): Alacart, Poor Man’s Classification Trees, North Eastern SAS Users Group Conference.

Contents

— 3 —

A Field vARIGuide to Tree

CART Tree (S+, R)

AID THAID CHAID

ID3 C4.5 C5.0

1) Varieties of Tree Methods

— 4 —

CART

Classification and

Regression Trees

Source: Breiman L., Freedman J. Stone R., Olshen C.: Classification and Regression Trees, Wadsworth, International Group, Belmont, CA, 1984

— 5 —

Aim: separate two classes by using X1 and X2 and producing more homogenous rectangular regions.

— 6 —

CART: Underlying Classification algorithm Using missclasification.

Y X1 X2 X3 x4

0 1 10 21 1

1 1 30 8 1

0 2 0 8 0

0 3 10 8 0

Misscl (Y / X1 <=1) = .5 Misscl (Y / X1 > 1) = 0, repeat for every value of X1 and for every other X variable, select optimal variable and split (actually uses Gini in

Cart).

— 7 —

Basic CART Algorithm: binary dependent

variable or target (0,1)

Range of Continuous Variable A

“0” “0”

70% “1”

“1”

20%

50%

Original % of ‘0’s and ‘1’s of dep. var

Splitting point

Xi

Y

— 8 —

Divide and Conquer: recursive partitioning

n = 5,000

10% HELOC

n = 3,350 n = 1,650

Debits < 19 yes no

21% HELOC 5% HELOC

— 9 —

Ideal SAS code to find splits

Proc summary data = …. Nway; class (all independent vars); var depvar; /* this is ‘target’, 0/1*/

output out = ….. Sum = ; run;

For large data sets (large N, large NVAR), hardware and software constraints prevent completion.

— 10 —

Fitted Decision Tree: Interpretation and

structure

VAR C

>1

25%

0-52

45%

VAR B

VAR A <19 19

5%

0,1

21%

>52

— 11 —

The Cultivation of Trees

• Split Search

– Which splits are to be considered?

• Splitting Criterion

– Which split is best?

• Stopping Rule

– When should the splitting stop?

• Pruning Rule

– Should some branches be lopped-off?

— 12 —

Possible Splits to Consider: most common is binary

because...

1

100,000

200,000

300,000

400,000

500,000

2 4 6 8 10 12 14 16 18 20

Nominal Input Ordinal

Input

Input Levels

If input has 1000 levels, 999 possible binary splits and 999 * 998 /2 trinary split, etc.

— 13 —

Splitting Criterion: gini, twoing,

misclassification, entropy…

A) Minimize Gini impurity criterion (favors node homogeneity) ----------------- B) Maximize Twoing impurity criterion (favors class separation) Empirical results: for binary dependent variables, Gini and Twoing are equivalent. For trinomial, Gini provides more accurate trees. Beyond three categories, twoing performs better.

— 14 —

The Right-Sized Tree

Stunting

Pruning

— 15 —

— 16 —

— 17 —

— 18 —

Benefits of Trees

• Interpretability

– Tree structured presentation

• Mixed Measurement Scales

– Nominal, ordinal, interval

– Regression trees

• Robustness

• Missing Values

— 19 —

…Benefits

• Automatically

– Detects interactions (AID) in hierarchical conditioning search, not ‘ala’ regression analysis.

– Selects input variables

Input Input

Prob

Multivariate Step Function

— 20 —

Drawbacks of Trees

. Unstable: small perturbations in data can lead to big changes in trees.

. Linear structures are approximated in very rough

form.

. Applications may require that

rules descriptions for

different categories not

share the same attributes.

. It is a conditional

Structure and interpretation many times misunderstands the conditioning effect.

— 21 —

Drawbacks of Trees (cont.)

. Tends to over-fit => overly optimistic accuracy.

. Large trees are very difficult to interpret.

. Tree size conditioned by data set size.

. No valid inferential

procedures at present

(if it matters).

. Greedy search

algorithm.

— 22 —

Note on Missing Values.

1) Missingness NOT in Y (see Wang and Sheng, 2007, JMLR for semi-supervised method for missing Y).

2) Different methods of imputation: 1) C4.5: probabilistic split: variables with missing values are

attached to child nodes with weights equal to proportion of non-missing values.

2) Complete case: eliminate all missing observations, and train. 3) Grand mode/mean: imputed if categorical/continuous. 4) Separate class: appropriate for categorical. For continuous,

create extreme large value and thus separate missings from non-missings.

5) Complete variable case: delete all variables with missing values.

6) Surrogate (CART default): Use surrogate variable/s whenever variable is missing. At testing or scoring, if variable is missing, uses surrogate/s.

Tree Derivative: Random Forests. (Breiman, 1999)

— 23 —

Random Forests proceed in the following steps, and notice that there is no

need to create a training, validation and a test data sets:

1. Take a random sample of N observations with replacement

(“bagging”) from the data set. On average, select about 2/3 of rows. The

remaining 1/3 are called “out of bag (OOB)” observations. A new random

selection is performed for each tree constructed.

2. Using the observations selected in step 1, construct a decision tree to

its maximum size, without pruning. As the tree is built, allow only a

subset of the total set of predictor variables to be considered as

possible splitters for each node. Select the set of predictors to be

considered as random subset of the total set of available predictors.

For example, if there are ten predictors, choose five of them randomly as

candidate splitters. Perform a new random selection for each split. Some

predictors (possibly best one) will not be considered for each split, but

predictor excluded from one split may be used for another split in the

same tree.

— 24 —

No Overfitting or Pruning.

The "Over-fitting“ problem appears in large, single-tree models where the model fits

noise in the data, which causes poor generalization power, which is the basis for

pruning those models. In nearly all cases, decision tree forests do not have problem

with over-fitting, and there is no need to prune trees in the forest. Generally, the more

trees in a forest, the better the fit.

Internal Measure of Test Set (Generalization) Error .

About 1/3 of observations are excluded from each tree in the forest, which are called

“out of bag (OOB)”. That is, each tree has a different set of out-of-bag observations

that implies each OOB set constitutes an independent test sample.

To measure the generalization error of decision tree forests, the OOB set for each tree

is run through the tree and the error rate of prediction is computed.

The error rates for the trees in the forest are then averaged to obtain the overall

generalization error rate for the decision tree forest model.

There are several advantages to this method of computing the generalization error:

(1) All observations are used to construct the model, and none have to be held back

as a separate test set,

(2) The testing is fast because only one forest has to be constructed (as compared to

V-fold cross-validation where additional trees have to be constructed).

— 25 —

2) Scoring: battle horse of database marketing.

Model Deployment.

— 26 —

Scoring Recipe

• Model

– Formula

• Data Modifications

– Derived inputs

– Variable Transformations

– Missing value imputation

• Scoring Code

Scored data

Original

computation algorithm

— 27 —

/* PROGRAM ALGOR8.PGM WITH 8 FINAL NODES*/

/* METHOD MISSCL ALACART TEST */

RETAIN ROOT 1;

IF ROOT & CURRDUE <= 105.38 & PASTDUE <= 90.36 & CURRDUE <= 12

THEN DO;

NODE = '4_1 ';

PRED = 0 ;

/* % NODE IMPURITY = 0.0399 ; */

/* BRANCH # = 1 ; */

/* NODE FREQ = 81 ; */

END;

ELSE IF ROOT & CURRDUE <= 105.38 & PASTDUE <= 90.36 & CURRDUE > 12

THEN DO;

NODE = '4_2 ';

PRED = 1 ;

/* % NODE IMPURITY = 0.4478 ; */

/* BRANCH # = 2 ; */

/* NODE FREQ = 212 ; */

END;

ELSE IF ROOT & CURRDUE <= 105.38 & PASTDUE > 90.36

THEN DO;

NODE = '3_2 ';

PRED = 0 ;

Scoring Recipe: example of scoring output generated by

Alacart

— 28 —

Scorability

X1

0

.2

.4

.6

.8

1

X2 0 .2 .4 .6 .8 1

Scoring Code Classifier

If x1<.47 & x2<.18 or x1>.47 & x2>.29, then red.

Tree

Training Data

New Case

— 29 —

1st. Data set: Titanic.

Titanic survival data, available on the web. 1313 observations

but due to missing “age” values, 756 complete observations,

out of 1313 total number of observations. Below, variables

available for analysis (the “*” variables are transformations

to “help” the logistic). „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ† ‚Data Contents ‚Variable‚ ‚ ‚ Length ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚Variable Name ‚Variable Label ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚AGE ‚Yrs of Age ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚AGESEX ‚Age * Sex ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚AGESQ ‚Age * Age ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚PASSCLASS1 ‚First Class ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚PASSCLASS2 ‚Second Class ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚PASSCLASS3 ‚Third Class ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚SEX ‚Sex Female = 1 ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚SURVIVED ‚Survived = 1 ‚ 8‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒŒ

Original Data

Survived

Total

Did not Survived

Count % Obs Count % Obs

All All All All Count

Age Gender Total

5 0.38 96 7.31 101 Present female 1st

2nd 10 0.76 75 5.71 85

3rd 56 4.27 46 3.50 102

male 1st 82 6.25 43 3.27 125

2nd 106 8.07 21 1.60 127

3rd 184 14.01 32 2.44 216

All 443 33.74 313 23.84 756

Missing Gender Total

4 0.30 38 2.89 42 female 1st

2nd 3 0.23 19 1.45 22

3rd 76 5.79 34 2.59 110

male 1st 38 2.89 16 1.22 54

2nd 42 3.20 4 0.30 46

3rd 257 19.57 26 1.98 283

All 420 31.99 137 10.43 557

All 863 65.73 450 34.27 1313

— 32 —

— 33 — 33

Complete Data

W/O Missing

Age.

Data Description: SURVIVED

G + PCl

% G +

PCl

DID NOT SURVIVED

Mean AGE Mean AGE

Count

%

Tota

l

Obs

% Of

Gender Mean Count

%

Tota

l

Obs

% Of

Gender

Mea

n Count

GENDER PSSNGR

CLASS

5 0.66 1.28 35.20 96

13.7

0 39.60 37.91 101 13.36 female 1st

2nd 10 1.32 3.28 31.40 75 9.92 21.92 26.85 85 11.24

3rd 56 7.41 9.27 23.82 46 6.08 11.37 23.72 102 13.49

Total

71 9.39 13.82 24.90 217

28.7

0 73.89 30.87 288 38.10

male PSSNGR

CLASS

82

10.8

5 26.66 44.84 43 5.69 16.03 34.25 125 16.53 1st

2nd

106

14.0

2 24.36 31.70 21 3.78 3.39 14.84 127 16.80

3rd

184

24.3

4 36.16 27.10 32 4.23 7.69 23.09 216 28.57

Total

372

49.2

1 87.18 33.32 96

13.7

0 27.11 25.95 468 61.90

Total

443

58.6

0 100.00 31.13 313

41.4

0 100.00 29.36 756 100.00

Pr (Fem/Surv)

Pr (Surv & Fem)

— 34 —

Logistic Vs. trees. Titanic No missing Values.

Consistency Information Value Characteristic

756 'informs.titanic_no_missing' # obs. Number of variables 7

Number of continuous variables 3 Number of class variables

4

Trees used 3, Forest 5, while Stepwise 5 plus the intercept.

Models * Vars * Coeffs Estimate Pr > Chi-Square Importance # Rules

Var Sel Type Variable

0.546 493 Forest AGE

PASSCLASS1 0.400 66

PASSCLASS2 0.248 58

PASSCLASS3 0.465 59

SEX 1.000 97 STEPWISE AGE -0.039177938 0.000

PASSCLASS1 1.2919799232 0.000

PASSCLASS3 -1.229467857 0.000

SEX 2.631357225 0.000

Intercept -0.163634963 0.550 Trees AGE 0.408 1

PASSCLASS3 0.556 2

SEX 1.000 1

— 35 —

Training: Rates '-' ==> misclass & Missprec

_PREDICTED

0 1 Overall

Class

Rate Prec Rate

Class Rate

Prec Rate

Class Rate Prec Rate

Survived = 1 Model Name

96.39 78.35 -3.61 -7.58 96.39 0 FOREST

LOGISTIC_STEPWISE 83.97 80.35 -16.03 -24.23 83.97

TREES 96.61 77.12 -3.39 -7.46 96.61 1 FOREST -37.70 -21.65 62.30 92.42 62.30

LOGISTIC_STEPWISE -29.07 -19.65 70.93 75.77 70.93

TREES -40.58 -22.88 59.42 92.54 59.42 Overall FOREST 78.35 92.42 82.28 82.28

LOGISTIC_STEPWISE 80.35 75.77 78.57 78.57

TREES 77.12 92.54 81.22 81.22

Trees have the highest classification rate (96.39%) and an excellent precision rate (92.42%). Forest comes at a close second.

— 36 —

Gains Table

Events Rate

Cum Events Rate

% Event Captured

Cum % Events

Captured Lift Cum Lift Brier Score

* 100 Percentile Model Name

100.000 100.000 12.141 12.141 2.415 2.415 0.511

5 FOREST

LOGISTIC_STEPWISE 94.737 94.737 11.502 11.502 2.288 2.288 5.111

TREES 95.119 95.119 11.548 11.548 2.297 2.297 0.000

10 FOREST

97.368 98.684 11.821 23.962 2.352 2.384 0.936


TREES 91.935 93.527 11.161 22.709 2.221 2.259 0.000

15 FOREST

90.789 96.053 11.022 34.984 2.193 2.320 5.166


TREES 91.935 92.997 11.161 33.871 2.221 2.246 0.000

20 FOREST

93.421 95.395 11.342 46.326 2.256 2.304 3.859


TREES 91.935 92.731 11.161 45.032 2.221 2.240 0.000

— 37 —

Comparing the results.

1) Trees required fewer variables than logistic and thus easier to interpret. Forest by definition use all the variables.

2) Trees obtained slightly larger lift measures but who can beat forests?

3) Trees and Forest determine the most important variable immediately, female sex, at the top of the tree, while with logistic it is not very clear.

— 38 —

2nd Data set: Surendra Financial

Data.

No information available about meaning or

measurement. All variables called R1 – R84,

one binary dependent variable “Newgroup”.

There are no missing values, the missing

values have been somehow imputed, but

not reported.

— 39 —

Data Mining Example: Just fit a model.

Consistency Information Value

Characteristic

45,175

'surendra.newsurendra' # obs.

Number of variables

84 Number of continuous variables

84 Number of class variables

0

Financial information with target = “newgroup” and variable names R1 – R85 Without any information as to what anything means. Forest omitted from the exercise.

— 40 —

Difficult to interpret, the larger ‘p’ is.

— 41 —

Gains Table

Events Rate

Cum Events Rate

% Event Captured

Cum % Events

Captured Lift Cum Lift Brier Score

* 100 Percent

ile Model Name

92.873 92.873 37.391 37.391 7.477 7.477 5.043

5 LOGISTIC_STEPWISE

TREES

98.639 98.639 39.712 39.712 7.942 7.942 0.000 10 LOGISTIC_STEPWIS

E 49.270 71.071 19.836 57.227 3.967 5.722 25.486

TREES

50.476 74.558 20.322 60.034 4.064 6.003 0.000 15 LOGISTIC_STEPWIS

E 30.235 57.459 12.173 69.399 2.434 4.626 20.962

TREES 12.772 53.962 5.142 65.176 1.028 4.345 0.000

20 LOGISTIC_STEPWISE

19.088 47.869 7.681 77.081 1.537 3.854 15.217

Logistic selected 49 variables, Trees 12. 10 of the 12 also used by Logistic.

— 42 —

Comparing the results.

1) Trees selected 12 and logistic 49 variables.

2) The initial split on R73 produced almost perfectly pure nodes. R73 was also selected by Stepwise, but Stepwise doesn’t stop fast enough.

3) Model performance, as evaluated by lift, favors Trees.

— 43 —

Very quick: Trees vs. Gradient Boosting.

Task: Classify into ‘5’ segments. Tools: Trees and Gradient Boosting (different versions). Compare by classification, precision and F1 rates.

Model descriptions

_MODEL_

Obs STUDY NUMBER

1 tree_equal_prob_CV_10 1

2 tree_origl_prob_CV_10 2

3 tree_origl_CV_10_5_split 3

4 tree_origl_CV_10_2_split_dec 4

5 Boost simple 5

6 Boost equal Probs 6

7 Custs_Boost dec matrix 7

8 Boost_equal_2nd_stage 8

9 Boost_orig_2nd_stage 9

Then models 1 through 4 are TREES, 5 through 9 BOOSTING.

— 44 —

— 45 —

— 46 —

— 47 —

Quick summary conclusions for multi-classification.

1) All boosting methods are good and just

‘1’ of trees competes with them in one case.

2) The performance of any of the boosting methods was similar, thus not much model specification search is required.

3) Boosting methods very difficult to interpret.

4. Concluding Remarks

— 49 —

Different algorithms

1) Non-greedy algorithms and two- or three-step ahead search.

2) Hybrid models, which combine regression and tree Methods (not very popular after the 2000s). 3) Boosting or majority voting methods, which generate a sequence of trees and classifications and the outcome is decided democratically. 4) Binned trees, in which splits searches are conducted after discretizing all variables, thus allowing for possibly non-linear effect searches. ...

— 50 —

Avoid over-fitting / overtorture... because ...

Instead, in Sherlock Holmes’ words: “I never guess. It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts”. (A Scandal in Bohemia).

We should not act as Mark Twain says: “Get your facts first, and then you can

distort

them as much as you please.”

— 51 —

Let us not be in haste …

Method comparison by way of two examples does not imply general method superiority. There are many examples in the literature in which logistic regression performed better.

— 52 —

The End

comparative review of classification treesnymetro.chapter.informs.org/prac_cor_pubs/05-13... · 1)...

Documents