decision trees a “divisive” method (splits) start with “root node” – all in one group ...

89
Decision Trees A “divisive” method (splits) Start with “root node” – all in one group Get splitting rules Response often binary Result is a “tree” Example: Loan Defaults Example: Framingham Heart Study Example: Automobile Accidentsnts

Upload: irma-powers

Post on 20-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Decision Trees A “divisive” method (splits)

Start with “root node” – all in one group

Get splitting rules

Response often binary

Result is a “tree”

Example: Loan Defaults

Example: Framingham Heart Study

Example: Automobile Accidentsnts

Page 2: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Recursive Splitting

X1=DebtToIncomeRatio

X2 = Age

Pr{default} =0.008 Pr{default} =0.012

Pr{default} =0.0001

Pr{default} =0.003

Pr{default} =0.006

No defaultDefault

Page 3: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Some Actual Data

Framingham Heart Study

First Stage Coronary Heart Disease P{CHD} = Function of:

» Age - no drug yet!

» Cholesterol

» Systolic BP

Import

Page 4: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Example of a “tree” Using Default Settings

Page 5: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Example of a “tree” Using Custom Settings

(Gini splits, N=4 leaves)

Page 6: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

95 5

55 45

Heart DiseaseNo Yes

LowBP

HighBP

100

100

DEPENDENT (effect)

75

25

75

25

INDEPENDENT (no effect)

Heart DiseaseNo Yes

100

100

150 50 150 50

Contingency tables

180 ?240?

How to make splits?

Page 7: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

95 75

5 25

55 75

45 25

Heart DiseaseNo Yes

LowBP

HighBP

100

100

DEPENDENT (effect)150 50

Contingency tables

180 ?240? 2(400/75)+ 2(400/25)

= 42.67

Compare to c2 tables – Significant!(Why do we say “Significant” ???)

22 ( )

allcells Observed - Expected

Expected=

How to make splits?

Page 8: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Framingham Conclusion: Sufficient evidence against the (null) hypothesis of no relationship.

H0:H1:

H0: InnocenceH1: Guilt

Beyond reasonable doubt

P<0.05 H0: No associationH1: BP and heart disease are associated

P=0.00000000064

95 75

5 25

55 75

45 25

Page 9: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Demo 1: Chi-Square for Titanic Survival (SAS)

Alive DeadCrew 212 696 3rd Class 178 528 2nd Class 118 1671st Class 202 123 “logworth” =all possible splits c2 P-value -log10(p-value) Crew vs. other 51.94 5.7 x 10-13 12.24 Crew&3 versus 1&2 163.09 2.4 x 10-37 36.62First versus other 160.04 1.1 x 10-36 35.96

Page 10: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Demo 2: Titanic (EM)

Open Enterprise Miner File New Project Specify workspace (1)Startup Code – specify data library (2)New Data Source Titanic (3)Data has class (1,2,3,crew) & MWC (4)Bring in data, Decision Tree (defaults)

Page 11: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

3 possible splits for M,W,C

10 possible splits for class (is that fair?)

Page 12: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Demo 3: Framingham (EM)

Open Enterprise Miner File New Project Specify workspace (1)Startup Code – specify data library (2)New Data Source Framingham

(a)Firstchd = target, reject obs(b)Explore Firschd (pie) vs. Age (histogram)

(3)Bring in data, Decision Tree (defaults)

Page 13: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Other Sp lit Criteria Gini Diversity Index

(1) { A A A A B A B B C B} Pick 2, Pr{different} = 1-Pr{AA}-Pr{BB}-Pr{CC}

» 1-.25-.16-.01=.58 (sampling with replacement)

(2) { A A B C B A A B C C }» 1-.16-.09-.09= 0.66 group (2) is more diverse (less pure)

Shannon Entropy Larger more diverse (less pure) -Si pi log2(pi) {0.5, 0.4, 0.1} 1.36

{0.4, 0.2, 0.3} 1.51 (more diverse)

Page 14: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

How to make splits?

Which variable to use?

Where to split? Cholesterol > ____ Systolic BP > _____

Idea – Pick BP cutoff to minimize p-value for c2

Split point data-derived!

What does “significance” mean now?

Page 15: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Multiple testing

a = Pr{ falsely reject hypothesis 1}

a = Pr{ falsely reject hypothesis 2}

Pr{ falsely reject one or the other} < 2aDesired: 0.05 probability or lessSolution: Compare 2(p-value) to 0.05

a

a

Page 16: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Validation

Traditional stats – small dataset, need all observations to estimate parameters of interest.

Data mining – loads of data, can afford “holdout sample”

Variation: n-fold cross validation Randomly divide data into n sets Estimate on n-1, validate on 1 Repeat n times, using each set as holdout. Titanic and Framingham examples did not use holdout.

Page 17: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Pruning

Grow bushy tree on the “fit data”

Classify validation (holdout) data

Likely farthest out branches do not improve, possibly hurt fit on validation data

Prune non-helpful branches.

What is “helpful”? What is good discriminator criterion?

Page 18: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Goals

Split (or keep split) if diversity in parent “node” > summed diversities in child nodes

Prune to optimize Estimates Decisions Ranking

in validation data

Page 19: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Assessment for: Decisions Minimize incorrect decisions (model versus realized)

Estimates Error Mean Square (average error)

Ranking C (concordance) statistic = proportion concordant + ½ (proportion tied) Obs number 1 (2, 3) 4 5 Probability of 0 0.2 0.3 0.6 0.7 ( from model) Actual 0 (0,1) 1 0

» Concordant Pairs: (1,3) (1,4) (2,4)

» Discordant Pairs: (3,5) (4,5)

» Tied (2,3)

» 6 ways to get pair with 2 different responses C = 3/6 + (1/2)(1/6) = 7/12=0.5833

Page 20: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Accounting for Costs

Pardon me (sir, ma’am) can you spare some change?

Say “sir” to male +$2.00

Say “ma’am” to female +$5.00

Say “sir” to female -$1.00 (balm for slapped face)

Say “ma’am” to male -$10.00 (nose splint)

Page 21: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Including Probabilities

True Gender

M

F

Leaf has Pr(M)=.7, Pr(F)=.3 You say:

Sir Ma’am

0.7 (2)

0.3 (-1)

0.7 (-10)

0.3 (5)

Expected profit is 2(0.7)-1(0.3) = $1.10 if I say “sir”

Expected profit is -7+1.5 = -$5.50 (a loss) if I say “Ma’am”

Weight leaf profits by leaf size (# obsns.) and sum.

Prune (and split) to maximize profits. +$1.10 -$5.50

Page 22: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Regression Trees

Continuous response Y

Predicted response Pi constant in regions i=1, …, 5

Predict 50

Predict 80

Predict 100

Predict 130

Predict 20

X2

X1

Page 23: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Predict Pi in cell i.

Yij jth response in cell i.

Split to minimize Si Sj (Yij-Pi)2

Regression Trees

Page 24: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Real data example:

Traffic accidents in Portugal*

Y = injury induced “cost to

society”

* Tree developed by Guilhermina Torrao, (used with permission) NCSU Institute for Transportation Research & Education

Help - I ran Into a “tree”

Help - I ran Into a “tree”

Page 25: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Demo 4: Chemicals (EM)Hypothetical potency versus chemical content (proportions of A, B, C, D) New Data Source Chemicals

{Sum, Dead} rejected, PotencyTarget, Partition node: training 75%, validation 25%Connect and run tree node, look at resultsTree Node, Results. View Model Variable Importance

Page 26: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Demo 4 (cont.)Variable Number of Importance Validation Ratio of Validation Name Splitting Rules Importance to Training ImportanceB 2.0 1.0 1.0 1.0A 4.0 0.68 0.63 0.94C 0.0 0.0 0.0 NaND 0.0 0.0 0.0 NaN

Only A and B amounts contribute to splits.B splits reduce variation more. A reduces it by about 2/3 as much as B.

Properties panel Exported dataExplore3D graph

X=A Y=B Z=P_Potency

Page 27: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

B

B A

A

A

A

Page 28: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

1

2

3

4

5

6

Page 29: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Logistic – another classifier

Older – “tried & true” method

Predict probability of response from input variables (“Features”)

Linear regression gives infinite range of predictions

0 < probability < 1 so not linear regression.

Logistic Regression

Page 30: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Three Logistic Curves ea+bX

(1+ea+bX)

X

p

Logistic Regression

Page 31: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Example: Seat Fabric Ignition

Flame exposure time = X

Y=1 ignited, Y=0 did not ignite Y=0, X= 3, 5, 9 10 , 13, 16 Y=1, X = 7 11, 12 14, 15, 17, 25, 30

Q=(1-p1)(1-p2)(1-p3)(1-p4)p5p6(1-p7)p8p9(1-p10)p11p12p13

p’s all different pi = f(a+bXi) = ea+bXi/(1+ea+bXi)

Find a,b to maximize Q(a,b)

Page 32: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Logistic idea:

Given temperature X, compute L(x)=a+bX then p = eL/(1+eL)

p(i) = ea+bXi/(1+ea+bXi)

Write p(i) if response, 1-p(i) if not

Multiply all n of these together, find a,b to maximize this “likelihood”

Estimated L = ___+ ____X

-2.6

0.23

-2.6

0.23

maximize Q(a,b)

ba

Page 33: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”
Page 34: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Example: Shuttle Missions

O-rings failed in Challenger disaster

Prior flights “erosion” and “blowby” in O-rings (6 per mission)

Feature: Temperature at liftoff

Target: (1) - erosion or blowby vs. no problem (0)

L=5.0850 – 0.1156(temperature) p = eL/(1+eL)

Page 35: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

0 6 5Pr{2 } 1 (1 ) 6 (1 )X X X Xor more p p p p

Page 36: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Neural Networks

Very flexible functions

“Hidden Layers”

“Multilayer Perceptron”

Logistic function of

Logistic functions **

Of data

Output = Pr{1}inputs

** (note: Hyperbolic tangent functions are just reparameterized logistic functions)

H1

H2

X1

X2

X4

X3

(0,1)

Page 37: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Arrows on right represent linear combinations of “basis functions,” e.g. hyperbolic tangents(reparameterized logistic curves)

b1

Y

H1

Example: Y = a + b1 H1 + b2 H2 + b3 H3 Y = 0 + 9 H1 + 3 H2 + 5 H3

b2H2

H3

b3

“bias”“weights”

X

-1 to 1

Page 38: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

X2

P

X1

-0.4 0.8

0

0.25

-0.9

0.01

(“biases”)

(-10)

(-13)

(20)

3

-1

2.5

(-1)

P

A Complex Neural Network Surface

X2

X1

Page 39: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Demo 5 (EM) Framingham Neural Network Note: No validation data used – surfaces are unnecessarily complex

Age 25Nonsmoker

Age 25Smoker

Sbp200-100

Sbp200-100

Chole

st13

0-28

0

Chole

st13

0-28

0

.057

.001

.057

.001

“Slice” of 4-D surface at age=25Code for graphs in NNFramingham.sas

Page 40: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Framingham Neural Network Surfaces “Slice” at age 55 – are these patterns real?

Age 55Smoker

.262

.009

Sbp200-100

Chole

st13

0-28

0

Age 55Nonsmoker

.261

.004

Sbp200-100

Chole

st13

0-28

0

Page 41: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Handling Neural Net Complexity

41

(1) Use validation data, stop iterating when fit gets worse on validation data.Problems for Framingham (50-50 split)

Fit gets worse at step 1 no effect of inputs. Algorithm fails to converge (likely reason: too many parameters)

(2) Use regression to omit predictor variables (“features”) and their parameters. This gives previous graphs.

(3) Do both (1) and (2).

Page 42: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

* Cumulative Lift Chart - Go from leaf of most to least predicted response. - Lift is proportion responding in first p% overall population response rate

Lift3.3

1

Predicted response high ------------------------------------- low

Page 43: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Receiver Operating Characteristic Curve

Logits of 1s Logits of 0s red black

Cut point 1

Logits of 1s Logits of 0s red black

(or any model predictor from most to least likelyto respond)

Select most likely p1% according to model.Call these 1, the rest 0. (call almost everything 0)Y=proportion of all 1’s correctly identified.(Y near 0)X=proportion of all 0’s incorrectly called 1’s (X near 0)

Page 44: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Receiver Operating Characteristic Curve

Logits of 1s Logits of 0s red black

Cut point 2

Logits of 1s Logits of 0s red black

Select most likely p2% according to model.Call these 1, the rest 0.Y=proportion of all 1’s correctly identified.X=proportion of all 0’s incorrectly called 1’s

Page 45: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Receiver Operating Characteristic Curve

Logits of 1s Logits of 0s red black

Cut point 3

Logits of 1s Logits of 0s red black

Select most likely p3% according to model.Call these 1, the rest 0.Y=proportion of all 1’s correctly identified.X=proportion of all 0’s incorrectly called 1’s

Page 46: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Receiver Operating Characteristic Curve

Cut point 3.5

Logits of 1s Logits of 0s red black

Select most likely 50% according to model.Call these 1, the rest 0.Y=proportion of all 1’s correctly identified.X=proportion of all 0’s incorrectly called 1’s

Page 47: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Receiver Operating Characteristic Curve

Cut point 4

Logits of 1s Logits of 0s red black

Select most likely p5% according to model.Call these 1, the rest 0.Y=proportion of all 1’s correctly identified.X=proportion of all 0’s incorrectly called 1’s

Page 48: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Receiver Operating Characteristic Curve

Logits of 1s Logits of 0s red black

Cut point 5

Logits of 1s Logits of 0s red black

Select most likely p6% according to model.Call these 1, the rest 0.Y=proportion of all 1’s correctly identified.X=proportion of all 0’s incorrectly called 1’s

Page 49: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Receiver Operating Characteristic Curve

Cut point 6

Logits of 1s Logits of 0s red black

Select most likely p7% according to model.Call these 1, the rest 0.(call almost everything 1)Y=proportion of all 1’s correctly identified. (Y near 1)X=proportion of all 0’s incorrectly called 1’s (X near 1)

Page 50: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Why is area under the curve the same as number of concordant pairs + (number of ties)/2 ??

Tree Leaf 1: 20 0’s, 60 1’s Leaf 2: 20 0’s, 20 1’s Leaf 3: 60 0’s 20 1’s

Cut pt. 0 1 2 3 Left of cut, call it 1 | and right of cut, call it 0. X = number of incorrect 0 decisions , Y= number of correct 1 decisions X = number of 0’s to the left of cut point, Y = number of 1’s to the left of cut point.

(X,Y) on ROC curveCut 0 X=0 (all points called 0 so none wrong) Y=0 (no 1 decisions) (0, 0)Cut 1 (20)(60) = 1200 ties, 60(20+60)=1200 concordant (20, 60)

20 20 60

60 1’s

TiesConcordant

(0,0)

(20,60)

Areas are number of tied pairs in leaf 1 plus number of concordant pairs from 0’s in leaves 2 and 3ROC curve (line) splits ties in two.

Page 51: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Why is area under the curve the same as number of concordant pairs + (number of ties)/2??

Tree Leaf 1: 20 0’s, 60 1’s Leaf 2: 20 0’s, 20 1’s Leaf 3: 60 0’s 20 1’s

Cut pt. 0 1 2 3 Left of cut, call it 1 | and right of cut, call it 0.

Cut point 2: 20 more 1’s, 60 0’s to the right, 20x20=400 ties in leaf 2. point on ROC curve: (40.80)Cut point 3: call everything a 1. All 100 1’s are correctly called, all 0’s incorrectly called. point on ROC curve:(100,100)

20 20 60 0’s

60 1’s

20 1’s

20 1’s ROC curve is in terms of proportions so join points are (0,0), (.2,.6), (.4,.8), and (1,1).

For Logistic Regression or Neural Nets, each point is a potential cut point for the ROC curve.

TiesConcordant

(1,1)

(0,0)

----(0.4,0.8)

-(0.2,0.6)

Page 52: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

52

Framingham ROC for 4 models and baseline 45 degree line Neural net highlighted, area 0.780

SelectedModel

Model Node Model Description

MisclassificationRate

Avg. SquaredError

Area UnderROC

Y Neural Neural Network 0.079257 0.066890 0.780

  Tree Decision Tree 0.079257 0.069369 0.720

  Tree2 Gini Tree N=4 0.079257 0.069604 0.675

  Reg Regression 0.080495 0.069779 0.734

Sensitivity: Pr{ calling a 1 a 1 }=Y coordinate Specificity; Pr{ calling a 0 a 0 } = 1-X.

Page 53: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

A Combined Example

Cell Phone Texting Locations

Black circle: Phone moved > 50 feet in first two minutes of texting.

Green dot: Phone moved < 50 feet. .

Page 54: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Tree Neural Net Logistic Regression

Three Models

Training Data Lift Charts

Validation Data Lift Charts

Resulting Surfaces

Page 55: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Demo 5: Three Breast Cancer ModelsFeatures are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

1. Sample code number id number 2. Clump Thickness 1 - 10 3. Uniformity of Cell Size 1 - 10 4. Uniformity of Cell Shape 1 - 10 5. Marginal Adhesion 1 - 10 6. Single Epithelial Cell Size 1 - 10 7. Bare Nuclei 1 - 10 8. Bland Chromatin 1 - 10 9. Normal Nucleoli 1 - 10 10. Mitoses 1 - 10 11. Class: (2 for benign, 4 for malignant)

Page 56: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Use Decision Tree to Pick Variables

Page 57: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Decision Tree Needs Only BARE & SIZE

Page 58: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Can save scoring code from EM Data A; do bare=0 to 10; do size=0 to 10; (score code from EM)Output; end; end;

Model Comparison Node Results

SelectedModel

Model Node Model Description Train MisclassificationRate

Train Avg. SquaredError

Train AreaUnder ROC

Validation MisclassificationRate

Validation Avg.Squared Error

ValidationArea UnderROC

Y Tree Decision Tree 0.031429 0.029342 0.975 0.051576 0.048903 0.933

  Neural Neural Network 0.031429 0.025755 0.994 0.054441 0.037838 0.985

  Reg Regression 0.031429 0.029322 0.992 0.054441 0.040768 0.987

Cancer Screening Results: Model Comparison Node

Page 59: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”
Page 60: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Pr{A} =0.5

Pr{B}=0.3Pr{A

and B}= 0.2

0.4

0.3

0.1

A B

Association Analysis is just elementary probability with new names

0.3+0.2+0.1+0.4 = 1.0

A: Purchase Milk

B: Purchase Cereal

Page 61: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

0.30.2 0.1

0.4

A B

Rule B=> A “people who buy B will buy A”

Support: Support= Pr{A and B} = 0.2

Independence means that Pr{A|B} = Pr{A} = 0.5 Pr{A} = 0.5 = Expected confidence if there is norelation to B..

Confidence: Confidence = Pr{A|B}=Pr{A and B}/Pr{B}=2/3??- Is the confidence in B=>A the same as the confidence in A=>B?? (yes, no)

Lift: Lift = confidence / E{confidence} = (2/3) / (1/2) = 1.33 Gain = 33%

Marketing A to the 30% of people who buy B willresult in 33% better salesthan marketing to a random 30% of the people.

B

Cereal=> Milk

Page 62: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Example: Grocery cart items (hypothetical)

item cart

bread 1milk 1soap 2meat 3bread 3bread 4cereal 4milk 4soup 5bread 5cereal 5milk 5

Association Report Expected Confidence Confidence Support Transaction Relations (%) (%) (%) Lift Count Rule 2 63.43 84.21 52.68 1.33 8605 cereal ==> milk 2 62.56 83.06 52.68 1.33 8605 milk ==> cereal 2 63.43 61.73 12.86 0.97 2100 meat ==> milk 2 63.43 61.28 38.32 0.97 6260 bread ==> milk 2 63.43 60.77 12.81 0.96 2093 soup ==> milk 2 62.54 60.42 38.32 0.97 6260 milk ==> bread 2 62.56 60.28 37.70 0.96 6158 bread ==> cereal 2 62.54 60.26 37.70 0.96 6158 cereal ==> bread 2 62.56 60.16 12.69 0.96 2072 soup ==> cereal 2 21.08 20.28 12.69 0.96 2072 cereal ==> soup 2 20.83 20.27 12.86 0.97 2100 milk ==> meat 2 21.08 20.20 12.81 0.96 2093 milk ==> soup

Sort Criterion: ConfidenceMaximum Items: 2Minimum (single item) Support: 20%

Page 63: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Link Graph Sort Criterion: ConfidenceMaximum Items: 3e.g. milk & cerealbreadMinimum Support: 5%

Slider bar at 62%

Page 64: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

We have the “features” (predictors)

We do NOT have the response even on a training data set (UNsupervised)

Another name for clustering

EM Large number of clusters with k-means (k clusters) Ward’s method to combine (less clusters , say r<k) One more k means for final r-cluster solution.

Unsupervised Learning

Page 65: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Example: Cluster the breast cancer data (disregard target)

Plot clusters and actual target values:

Page 66: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Segment Profiler

Page 67: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Text MiningHypothetical collection of news releases (“corpus”) :

release 1: Did the NCAA investigate the basketball scores and vote for sanctions?release 2: Republicans voted for and Democrats voted against it for the win. (etc.) Compute word counts: NCAA basketball score vote Republican Democrat winRelease 1 1 1 1 1 0 0 0Release 2 0 0 0 2 1 1 1

Page 68: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Text Mining Mini-Example: Word counts in 16 e-mails--------------------------------words-----------------------------------------

R B T P e a o d E r p s D u o l e u k e r S S c e s b e m V n S c c u c i l t o o a p o o m t d i b c t N L m e W r r e i e c a r e C i e e i e e n o n a l a r A a n c n _ _ t n t n l t s A r t h s V N

1 20 8 10 12 6 0 1 5 3 8 18 15 21 2 5 6 9 5 4 2 0 9 0 12 12 9 0 3 0 2 0 14 0 2 12 0 16 4 24 19 30 4 8 9 7 0 12 14 2 12 3 15 22 8 2 5 0 0 4 16 0 0 15 2 17 3 9 0 1 6 10 6 9 5 5 19 5 20 0 18 13 9 14 7 2 3 1 13 0 1 12 13 20 0 0 1 6 8 4 1 4 16 2 4 9 0 12 9 3 0 0 9 26 13 9 2 16 20 6 24 4 30 9 10 1410 19 22 10 11 9 12 0 14 10 22 3 1 011 2 0 0 14 1 3 12 0 16 12 17 23 812 16 19 21 0 13 9 0 16 4 12 0 0 213 14 17 12 0 20 19 0 12 5 9 6 1 414 1 0 4 21 3 6 9 3 8 0 3 10 20

Page 69: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Eigenvalues of the Correlation Matrix

Eigenvalue Difference Proportion Cumulative

1 7.10954264 4.80499109 0.5469 0.5469 2 2.30455155 1.30162837 0.1773 0.7242 3 1.00292318 0.23404351 0.0771 0.8013 4 0.76887967 0.21070080 0.0591 0.8605 5 0.55817886 0.10084923 0.0429 0.9034 (more) 13 0.0008758 0.0001 1.0000

55% of the variation in these 13-dimensional vectors occurs in onedimension.

Variable Prin1

Basketball -.320074 NCAA -.314093 Tournament -.277484 Score_V -.134625 Score_N -.120083 Wins -.080110 Speech 0.273525 Voters 0.294129 Liar 0.309145 Election 0.315647 Republican 0.318973 President 0.333439 Democrat 0.336873

Prin 1

Prin 2

Page 70: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Eigenvalues of the Correlation Matrix

Eigenvalue Difference Proportion Cumulative

1 7.10954264 4.80499109 0.5469 0.5469 2 2.30455155 1.30162837 0.1773 0.7242 3 1.00292318 0.23404351 0.0771 0.8013 4 0.76887967 0.21070080 0.0591 0.8605 5 0.55817886 0.10084923 0.0429 0.9034 (more) 13 0.0008758 0.0001 1.0000

55% of the variation in these 13-dimensional vectors occurs in onedimension.

Variable Prin1

Basketball -.320074 NCAA -.314093 Tournament -.277484 Score_V -.134625 Score_N -.120083 Wins -.080110 Speech 0.273525 Voters 0.294129 Liar 0.309145 Election 0.315647 Republican 0.318973 President 0.333439 Democrat 0.336873

Prin 1

Prin 2

Page 71: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Sports Cluster

PoliticsCluster

Page 72: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

R B T P e a o d E r p s D u o C l e u k e r S S c L e s b e m V n S c c u U P c i l t o o a p o o m S r t d i b c t N L m e W r r e T i i e c a r e C i e e i e e n E n o n a l a r A a n c n _ _ t R 1 n t n l t s A r t h s V N

3 1 -3.63815 0 2 0 14 0 2 12 0 16 4 24 19 3011 1 -3.02803 2 0 0 14 1 3 12 0 16 12 17 23 8 5 1 -2.98347 0 0 4 16 0 0 15 2 17 3 9 0 114 1 -2.48381 1 0 4 21 3 6 9 3 8 0 3 10 20 7 1 -2.37638 2 3 1 13 0 1 12 13 20 0 0 1 6 8 1 -1.79370 4 1 4 16 2 4 9 0 12 9 3 0 0 (biggest gap) 1 2 -0.00738 20 8 10 12 6 0 1 5 3 8 18 15 21 2 2 0.48514 5 6 9 5 4 2 0 9 0 12 12 9 0 6 2 1.54559 10 6 9 5 5 19 5 20 0 18 13 9 14 4 2 1.59833 8 9 7 0 12 14 2 12 3 15 22 8 210 2 2.49069 19 22 10 11 9 12 0 14 10 22 3 1 013 2 3.16620 14 17 12 0 20 19 0 12 5 9 6 1 412 2 3.48420 16 19 21 0 13 9 0 16 4 12 0 0 2 9 2 3.54077 26 13 9 2 16 20 6 24 4 30 9 10 14

SportsDocuments

PoliticsDocuments

Prin1

Page 73: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

PROC CLUSTER (single linkage) agrees !

Cluster 2 Cluster 1

Page 74: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Memory Based Reasoning (optional)

Usual name: Nearest Neighbor Analysis

Probe Point

9 blue7red

Classify as BLUE

Estimate Pr{Blue}as 9/16

Page 75: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Note: Enterprise Miner ExportedData ExplorerChooses colors

Page 76: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Original Data

Original Datascored

by EnterpriseMiner Nearest

Neighbormethod

Score Data (grid)scored

by EnterpriseMiner Nearest

Neighbormethod

Page 77: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Fisher’s Linear Discriminant Analysis - an older method of classification

(optional) Assumes multivariate normal distribution of inputs (features)

Computes a “posterior probability” for each of several classes, given the observations.

Based on statistical theory

Page 78: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Example: One input,(financial stress index) Three Populations:(pay all creditcard debt, pay part, default). Normal distributions with same variance. Classify a credit card holder by financial

stress index: pay all (stress<20), pay part (20<stress<27.5), default(stress>27.5)Data are hypothetical. Means are 15, 25, 30.

Page 79: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Example: Differingvariances.

Not just closest mean now.

Normal distributions with different variances. Classify a credit card holder by financial stress index: pay all (stress<19.48), pay part (19.48<stress<26.40), default(26.40<stress<36.93), pay part (stress>36.93)Data are hypothetical. Means are 15, 25, 30. Standard Deviations 3,6,3.

Page 80: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Example: 20% defaulters, 40% pay part40% pay all

Normal distributions with same variance and “priors.” Classify a credit card holder by financial stress index: pay all (stress<19.48), pay part (19.48<stress<28.33), default(28.33<stress<35.00), pay part (stress>35.00) Data are hypothetical. Means are 15, 25, 30. Standard Deviations 3,6,3. Population size ratios 2:2:1

Page 81: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

2

22

2

222

2

2

2

1

22

1

2

2

1

2

2

1)(

2

1)(

2

1)(

),(~

XX

XX

X

eeXf

eXf

eXf

NX

m = 15, 25, or 30, s =3 here.

The part that changes from one population to another is Fisher’s linear discriminant function. Here it is (mx-m2/2)/s2, that is, (mx-m2/2)/9 . The bigger this is, the more likely it is that X came from the population with mean m.

The three probability densities are in the same proportions as their values of exp((mx-m2/2)/9).

Page 82: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

18

2

2

2

18

2

X

e

X

m= 15, 25, or 30, s =here.

15 25 30

22.5 23. 61 20.00 59.11*108 179.5*108 4.85*108

0.24 0.74 0.02

Example: Stress index X=21. Classify as someone that pays only part of credit card bill because 21is closer to 25 than to 15 or 30.

Probabilities 0.24, 0.74, 0.02give more detail.

probability

Page 83: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Fisher for credit card data was (mx-m2/2)/9 when variance was 9in every group.

(mx-m2/2)/9 = 1.67 X-12.50 for mean 15 (mx-m2/2)/9 = 1.67 X-34.72 for mean 25 (mx-m2/2)/9 = 1.67 X-50.00 for mean 30 For 2 inputs it again has a linear form ___+___X1+___X2 where the coefficients come from the bivariate mean vector and the 2x2matrix is assumed constant across populations.

When variances differ, formulas become more complicated, discriminant becomes quadratic, and boundaries are not linear.

18.

8.1,~

2

1

NX

Page 84: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Example: Two inputs, three populations Multivariate normals, same2x2 covariancematrix.

Viewed fromabove, boundariesappear to be linear.

Page 85: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Income

debt

DefaultsPayspartonly

Paysinfull

Page 86: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Generated sample: 1000 per group

Income index

Debtindex

Page 87: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

proc discrim data=a; class group; var X1 X2; run;

_ -1 _ -1 _ Constant = -.5 X' COV X Coefficient Vector = COV X j j j

Linear Discriminant Function for group

Variable 1 2 3

Constant -20.09529 -0.31096 -1.73324 X1 -1.91229 0.12222 -0.06775 X2 2.13365 0.19059 -0.61763

Page 88: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Error Count Estimates for group

1 2 3 Total Rate 0.0000 0.0930 0.1120 0.0683 Priors 0.3333 0.3333 0.3333

Number of Observations and Percent Classified into group

From group 1 2 3 Total

1 1000 0 0 1000 100.00 0.00 0.00 100.00

2 0 907 93 1000 0.00 90.70 9.30 100.00

3 0 112 888 1000 0.00 11.20 88.80 100.00

Total 1000 1019 981 3000 33.33 33.97 32.70 100.00

Priors 0.33333 0.33333 0.33333

Page 89: Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Debtindex

Income index

DefaultsPayspartonly

Paysinfull

Generated sample: 1000 per groupSample Discriminants Differ Somewhat from Theoretical