examples decision trees what is a decision tree? how to...
Post on 19-Jun-2018
218 Views
Preview:
TRANSCRIPT
Decision Trees
Content
Examples What is a decision tree? How to build a decision tree? Stopping rule and tree pruning Confusion matrix (binary)
Classification Example: Fisher’s Iris Data 3 species of iris flowers, 50 observations per species 4 predictor variables: petal length and width, sepal length
and width Objective: to predict the class of species based
Classification Example: Fisher’s Iris Data
Classification Example:Stock Selection
Classification Example:Stock Selection To predict a stock whether
it is underperformed or overperformed.
Underperformed means its monthly return is less than the median stock return for the month
Otherwise, overperformed
Classification Example:In-Patient Data
Classification Example:In-Patient Data 1,756,484 records of hospital in-patient statistics in NSW, Australia
in 1996-97 Aim: identify risk factors for an adverse event (AE) An adverse event (AE)
is an unintended injury or complication which results in disability, death or prolongation of hospital stay, and
is caused by health care management rather than the patient’s disease Eg. accidental cut during surgery, incorrect dosage of drugs
Potential predictors: Comorbidity (multiple diagnoses), Procedures (multiple procedures),
gender, insurance, psychiatric status, age, day only, readmitted, etc 3.4% of AE cases in the dataset
Classification Example:In-Patient Data
Classification Example:In-Patient Data Model Performance:
Confusion matrix Misclassification rate
= (12601+542874)/1756484 = 31.6%
Sensitivity = 47403/60004 = 79.0%
Specificity = 1153606/1696480 = 68.0%
Pen-Digits Data Binary Tree
Content
Examples What is a decision tree? How to build a decision tree? Stopping rule and tree pruning Confusion matrix (binary)
What is a decision tree
Variation of Decision Trees
Classification treeThe target is discrete (binary, nominal)The leaves give the predicted class as well as
the probability of class membership Regression treeThe target is continuousThe leaves give the predicted value of the
target Tree with binary splits Tree with multiway splits
Illustrating Classification Task Example of a Decision Tree
Decision Tree Classification Task Apply Model to Test Data
Apply Model to Test Data Content
Examples What is a decision tree? How to build a decision tree? Stopping rule and tree pruning Confusion matrix (binary)
How to build a decision tree
Recursive partitioning a top-down, greedy algorithm to fit the decision tree
for the data Top-down
Starting at the root node, split the data into subgroups that are as homogeneous as possible with respect to the target.
Greedy method always make a locally optimal choice in the hope that
this will lead to a globally optimal solution
Root-Node Split
1-Deep Space Depth 2
2-Deep Space Three steps in tree construction
Selection of the best splitWhich input variable could give the ‘best’ split? ‘Best’ according to which splitting criterion?
Stop-splitting ruleWhen should the splitting stop?
Assignment of each leaf node to a class Predict the value of the target variable
(discrete or continuous) at each leaf node
No. of possible splits
Split on a nominal input with L distinct levels No. of possible splits into B branches:
S(L,B) = B ⋅S(L −1,B) + S(L −1,B −1)
No. of possible splits
Split on an ordinal input with L distinct levels No. of possible splits into B branches:
No. of possible splits
Split on a continuous input Treat it as if an ordinal input
No. of possible splits
Selection of the best splits
Exhaustively examining all possible splits is time consuming.
By default, Softwares will use exhaustive search if no. of possible splits < 5000.
Otherwise, a clustering of levels of an input is used to limit the possible splits to consider.
An alternative way is to consider binary splits only (B = 2) nominal : 2L–1 – 1 possible splits ordinal : L – 1 possible splits
Splitting Criterion
After a set of candidate splits is determined, a splitting criterion is used to determine the best one.
Splitting criterion for discrete target
Two approaches for discrete target: Method 1: statistical test for independence
between the input and target variables Chi-squared test Likelihood ratio test
The best split is the one that is most significant (i.e. p-value is the smallest)
Statistical approach to splitting Any split in a classification tree can be arranged in a
contingency table.
Test of independence between target (row) and input (column): Chi-squared test X2 = Σ (O – E)2/E Likelihood ratio test G2 = 2 Σ O ln(O/E) O = observed frequency E = expected frequency X2 and G2 ~ chi-square dist. with d.f. (r-1)(B-1) r = no. of target levels B = no. of branches
Example revisited …
X2 = 266.67 d.f. = 1 G2 = 345.22 d.f. = 1 Smaller p-value ⇒ Stronger association between input and target The split with the smallest P-value or largest logworth =
– log10(p-value) will be chosen
Pen-Digits Data:Chi-Squared Test
Splitting criterion for discrete target
Method 2: based on impurity function of a nodeGini index: 1 – Σj pj
2
Entropy: –Σj pj log2 pj where log2(x) = ln(x) / ln(2)Misclassification error: 1 – max pj
The best split is the one that gives the maximum reduction in impurity (IP):ΔIP = 0.4 – 6/10(0.33) – 4/10(0) = 0.202
Gini Index Gini index is a measure of diversity for discrete data.
Gini = 1-2(3/8)2-2(1/8)2 = .69
Gini = 1-(6/7)2-(1/7)2 = .24
Minimum G = 0 if one of the pj’s is 1Maximum G = 1 – 1/k if p1 = … = pk = 1/k
Entropy Impurity function
Properties of an impurity function of a node: Nonnegative decreases when the node is more “pure”, i.e. one
class dominates
For node 1:• Gini = 1 – 0.52 – 0.52 = 0.5• Entropy = –0.5 log2(0.5) – 0.5 log2(0.5) = 1• Misclassification error = 1 – 0.5 = 0.5For node 2:• Gini = 1 – 0.752 – 0.252 = 0.375• Entropy = –0.75 log2(0.75) – 0.25 log2(0.25) = 0.811• Misclassification error = 1 – 0.75 = 0.25
Remarks
The process of selecting the best split on a node: 1) Select the best split on each input variable (i.e.
choose number of branches and cut-off points) 2) select the best of these
Comparing splits on the same input variable:Gini, Entropy, and Misclass favour splits into greater
numbers of branches (large B). They are not appropriate for evaluating multiway
splits. The p-values of Chi-squared and likelihood ratio tests
automatically adjust for this bias through the d.f..
Problem with Impurity Reduction
Impurity reduction tends to prefer splits that result in large number of partitions, each being small but pure
Customer ID has highest information gain because entropy for all the children is zero
Remarks Comparing splits on different input variables:
The p-values of Chi-squared and likelihood ratio tests tends to be smaller as the number of possible splits, m, increases.
Kass (1980) proposed Bonferroni adjustments of the pvalues to account for this bias.
Logworth = – log10(m p-value) What is the value of m? If all the splits have logworth < – log10(0.2) then don’t split.
Otherwise, the split with the largest logworth is selected as the best split.
Which splitting criterion is the best? No single best choice Attempt all and determine the best results
P-Value Adjustments inChi-Square Test
Splitting Criterionfor continuous target Two approaches for continuous target:
Based on impurity function of a node Sample variance
Based on a statistical test for one-way ANOVA F test
Boston Housing Data
NOX
F test is better than (sample) variance reduction as it has P-value adjustment for different no. of branches.
F test is relatively robust to departures from normality assumption
However, F test is sensitive to departures from non-constant variance
Assignment of each leaf nodeto a class For classification tree:
Classify an observation in a node to the class with maximum posterior probability
p( j ) is prior probability p( t | j ) = proportion of class j obs. going to node t If p( j ) = proportion of all obs. belonging to class j, then p( j | t ) =
proportion of obs. in node t belonging to class j
For regression tree: Predict an observation in a node by the sample mean of the
target values in the node.
Example Prior probabilities
p(1) = p(7) = 364/1064 p(9) = 336/1064
Conditional probabilities p(t|1) = 285/364 p(t|7) = 143/364 p(t|9) = 41/336
Show the following results for posterior probabilities: p(1|t) = 285/469 p(7|t) = 143/469 p(9|t) = 41/469
Classify to class 1.
Content
Examples What is a decision tree? How to build a decision tree? Stopping rule and tree pruning Confusion matrix (binary)
Stop splitting rule
A simple method:continuous splitting until every node is pure or
contains only one observation. fit training data perfectly but may predict
poorly on new data. Two approaches:Top-down stopping rules (pre-pruning)Bottom-up assessment criteria (post-pruning)
Advantages of Trees Easy to interpret
Tree structured presentation Allow mixed input data types:
Nominal, ordinal, interval Allow discrete (binary and nominal) or continuous target
ordinal target not allowed Robust to outliers in inputs No problem with missing values Automatically
Detects interactions (AID) Accommodates nonlinearity Selects input variables
Disadvantages of trees Most algorithms use univariate splits
Solution: Linear combination split (a1x1+a2x2< c?) Unstable fitted tree
Often a small change in the data result in a very different series of splits
Solution: Bagging Lack of smoothness (step function) in reg. tree Splitting turns continuous input variables into discrete
variables. Solution: tree-based regression
Spitting using a “greedy” algorithm While each split is optimal, the overall tree is not.
Content
Examples What is a decision tree? How to build a decision tree? Stopping rule and tree pruning Confusion matrix (binary)
Confusion Matrix
Misclassification rate = (false positive + false negative)/(total cases)
Accuracy (or correct classification rate) = (true ‘+’ + true ‘–’)/(total cases)
Captured Response Curve orTarget Concentration Curve
Proportion of responders in the full sample are captured in the top 10% (20% …) of people as ranked by the model.
Try to locate all positive targets (all respondents)
Response rate
Response Rate = true positives / total predicted positives
Gains Chart or Response Chart
Proportion of responders in the top 10% (20% …) of people as ranked by the model.
Lift chart:lift (= response
rate / baseline)
top related