decision tree induction - storm.cis.fordham.edugweiss/classes/cisc6930/slides/04 decision...
TRANSCRIPT
1
DATA MINING DECISION TREE INDUCTION
Classification Techniques
Linear Models Support Vector Machines
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
2
Example of a Decision Tree
3
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
Another Decision Tree Example
4
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
MarSt
Refund
TaxInc
YES NO
NO
NO
Yes No
Married Single,
Divorced
< 80K > 80K
More than one tree may perfectly fit the data
Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Tree
Induction
algorithm
Training Set
5
Decision Tree
Apply Model to Test Data
6
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Start from the root of tree.
Apply Model to Test Data
7
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Apply Model to Test Data
8
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Apply Model to Test Data
9
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Apply Model to Test Data
10
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Apply Model to Test Data
11
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Assign Cheat to “No”
Decision Tree Terminology
12
Decision Tree Induction
Many Algorithms: Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
John Ross Quinlan is a computer science researcher in data
mining and decision theory. He has contributed extensively to the development of decision tree algorithms, including inventing the canonical C4.5 and ID3 algorithms.
13
14
Decision Tree Classifier
Ross Quinlan
An
ten
na
Le
ng
th
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
Abdomen Length
Abdomen Length > 7.1?
no yes
Katydid Antenna Length > 6.0?
no yes
Katydid Grasshopper
15
Grasshopper
Antennae shorter than body?
Cricket
Foretiba has ears?
Katydids Camel Cricket
Yes
Yes
Yes
No
No
3 Tarsi?
No
Decision trees predate computers
Definition
16
Decision tree is a classifier in the form of a tree structure
– Decision node: specifies a test on a single attribute
– Leaf node: indicates the value of the target attribute
– Arc/edge: split of one attribute
– Path: a disjunction of test to make the final decision
Decision trees classify instances or examples by starting at the root of the tree and moving through it until a leaf node.
17
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
• This can also be called supervised segmentation
• This emphasizes that we are segmenting the instance space
– Tree pruning
• Identify and remove branches that reflect noise or outliers
Decision Tree Classification
Decision Tree Representation
Each internal node tests an attribute
Each branch corresponds to attribute value
Each leaf node assigns a classification
18
outlook
sunny overcast rain
yes humidity wind
high normal strong weak
yes yes no no
How do we Construct a Decision Tree?
Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-
and-conquer manner At start, all the training examples are at the root
Examples are partitioned recursively based on selected attributes.
Test attributes are selected on the basis of a heuristic or statistical measure (e.g., info. gain)
Why do we call this a greedy algorithm? Because it makes locally optimal decisions (at
each node).
19
When Do we Stop Partitioning?
All samples for a node belong to same class
No remaining attributes
majority voting used to assign class
No samples left
20
How to Pick Locally Optimal Split
Hunt’s algorithm: recursively partition training records into successively purer subsets.
How to measure purity/impurity? Entropy and associated information gain
Gini
Classification error rate
Never used in practice but good for understanding and simple exercises
21
How to Determine Best Split
Own
Car?
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Car
Type?
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
Student
ID?
...
Yes No Family
Sports
Luxury c1
c10
c20
C0: 0
C1: 1...
c11
22
Before Splitting: 10 records of class 0, 10 records of class 1
Which test condition is the best?
Why is student id a bad feature to use?
How to Determine Best Split
C0: 5
C1: 5
23
Greedy approach:
Nodes with homogeneous class distribution are preferred
Need a measure of node impurity:
C0: 9
C1: 1
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
Information Theory
Think of playing "20 questions": I am thinking of an integer between 1 and 1,000 -- what is it? What is the first question you would ask?
What question will you ask?
Why?
Entropy measures how much more information you need before you can identify the integer.
Initially, there are 1000 possible values, which we assume are equally likely.
What is the maximum number of question you need to ask?
24
Entropy
Entropy (disorder, impurity) of a set of examples, S, relative to a binary classification is:
where p1 is the fraction of positive examples in S and p0 is fraction of negatives.
If all examples are in one category, entropy is zero (we define 0log(0)=0)
If examples are equally mixed (p1=p0=0.5), entropy is a maximum of 1.
For multi-class problems with c categories, entropy generalizes to:
)(log)(log)( 020121 ppppSEntropy
25
c
i
ii ppSEntropy1
2 )(log)(
Entropy for Binary Classification
The entropy is 0 if the outcome is certain.
The entropy is maximum if we have no knowledge of the system (or any outcome is equally possible).
26
Entropy of a 2-class problem with regard to
the portion of one of the two groups
27
Information Gain in Decision Tree Induction
• Is the expected reduction in entropy caused by partitioning the
examples according to this attribute.
• Assume that using attribute A, a current set will be partitioned into
some number of child sets
• The encoding information that would be gained by branching on A
)()()( setschildallEsetCurrentEAGain
The summation in the above formula is a bit misleading since when doing
the summation we weight each entropy by the fraction of total examples in
the particular child set. This applies to GINI and error rate also.
Examples for Computing Entropy
28
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = – 0 log2 0 – 1 log2 1 = – 0 – 0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
j
tjptjptEntropy )|(log)|()(2
C1 3
C2 3
P(C1) = 3/6=1/2 P(C2) = 3/6 = 1/2
Entropy = – (1/2) log2 (1/2) – (1/2) log2 (1/2)
= -(1/2)(-1) – (1/2)(-1) = ½ + ½ = 1
NOTE: p( j | t) is computed as the relative frequency of class j at node t
How to Calculate log2x
Many calculators only have a button for log10x and logex (“log” typically means log10)
You can calculate the log for any base b as follows:
logb(x) = logk(x) / logk(b)
Thus log2(x) = log10(x) / log10(2)
Since log10(2) = .301, just calculate the log base 10 and divide by .301 to get log base 2.
You can use this for HW if needed
29
Splitting Based on INFO...
Information Gain:
Parent Node, p is split into k partitions;
ni is number of records in partition i
Uses a weighted average of the child nodes, where weight is based on number of examples
Used in ID3 and C4.5 decision tree learners
WEKA’s J48 is a Java version of C4.5
Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.
k
i
i
splitiEntropy
n
npEntropyGAIN
1
)()(
How Split on Continuous Attributes?
For continuous attributes
Partition the continuous value of attribute A into a discrete set of intervals
Create a new boolean attribute Ac , looking for a threshold c
One method is to try all possible splits
if
otherwise
c
c
true A cA
false
31 How to choose c ?
32
Person Hair
Length
Weight Age Class
Homer 0” 250 36 M
Marge 10” 150 34 F
Bart 2” 90 10 M
Lisa 6” 78 8 F
Maggie 4” 20 1 F
Abe 1” 170 70 M
Selma 8” 160 41 F
Otto 10” 180 38 M
Krusty 6” 200 45 M
Comic 8” 290 38 ?
33
Hair Length <= 5?
yes no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911
np
n
np
n
np
p
np
pSEntropy 22 loglog)(
Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911
)()()( setschildallEsetCurrentEAGain
Let us try splitting on Hair length
34
Weight <= 160?
yes no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911
np
n
np
n
np
p
np
pSEntropy 22 loglog)(
Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
)()()( setschildallEsetCurrentEAGain
Let us try splitting on Weight
35
age <= 40?
yes no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911
np
n
np
n
np
p
np
pSEntropy 22 loglog)(
Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
)()()( setschildallEsetCurrentEAGain
Let us try splitting on Age
36
Weight <= 160?
yes no
Hair Length <= 2?
yes no
Of the 3 features we had, Weight was best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified… So we simply recurse!
This time we find that we can split on Hair length, and we are done!
37
Weight <= 160?
yes no
Hair Length <= 2?
yes no
We don’t need to keep the data around, just the test conditions.
Male
Male Female
How would these people be classified?
38
It is trivial to convert Decision Trees to rules…
Weight <= 160?
yes no
Hair Length <= 2?
yes no
Male
Male Female
Rules to Classify Males/Females If Weight greater than 160, classify as Male Elseif Hair Length less than or equal to 2, classify as Male Else classify as Female
Note: could avoid use of “elseif” by specifying all test conditions from root to corresponding leaf.
39
Decision tree for a typical shared-care setting applying the system for the diagnosis of prostatic obstructions.
Once we have learned the decision tree, we don’t even need a computer!
This decision tree is attached to a medical machine, and is designed to help nurses make decisions about what type of doctor to call.
40
Wears green?
Yes No
The worked examples we have seen were performed on small datasets. However with small datasets there is a great danger of overfitting the data… When you have few datapoints, there are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets.
For example, the rule “Wears green?” perfectly classifies the data, so does “Mothers name is Jacqueline?”, so does “Has blue shoes”…
Male Female
GINI is Another Measure of Impurity
Gini for a given node t with classes j
NOTE: p( j | t) is again computed as relative frequency of class j at node t
Compute best split by computing the partition that yields the lowest GINI where we
again take the weighted average of the children’s GINI
41
j
tjptGINI 2)]|([1)(
C1 0
C2 6
Gini=0.000
C1 2
C2 4
Gini=0.444
C1 3
C2 3
Gini=0.500
C1 1
C2 5
Gini=0.278
Worst GINI = 0.5 Best GINI = 0.0
Splitting Criteria based on Classification Error
Classification error at a node t :
Measures misclassification error made by a node.
Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information. This is ½ for 2-class problems
Minimum (0.0) when all records belong to one class, implying most interesting information
42
)|(max1)( tiPtErrori
Examples for Computing Error
43
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
)|(max1)( tiPtErrori
Equivalently, predict majority class and determine fraction of errors
Complete Example using Error Rate
Initial sample has 3 C1 and 15 C2
Based on one 3-way split you get the 3 child nodes to the left
What is the decrease in error rate?
What is the error rate initially?
What is it afterwards?
As usual you need to take the weighted average (but there is a shortcut)
44
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
Error Rate Example Continued
Error rate before: 3/18
Error rate after:
Shortcut:
Number of errors = 0 + 1 + 2
Out of 18 examples
Error rate = 3/18
Weighted average method:
6/18 x 0 + 6/18 x 1/6 + 6/18 x 2/6
Simplifies to 1/18 + 2/18 = 3/18
45
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
Comparison among Splitting Criteria
46
For a 2-class problem:
Discussion
Error rate is often the metric used to evaluate a classifier (but not always) So it seems reasonable to use error rate to determine
the best split
That is, why not just use a splitting metric that matches the ultimate evaluation metric?
But this is wrong! The reason is related to the fact that decision trees use a
greedy strategy, so we need to use a splitting metric that leads to globally better results
The other metrics will empirically outperform error rate, although there is no proof for this.
47
How to Specify Test Condition?
Depends on attribute types
Nominal
Ordinal
Continuous
Depends on number of ways to split
2-way split
Multi-way split
48
Splitting Based on Nominal Attributes
Multi-way split: Use as many partitions as distinct values.
Binary split: Divides values into two subsets. Need to find optimal partitioning.
49
CarType Family
Sports
Luxury
CarType {Family, Luxury} {Sports}
CarType {Sports, Luxury} {Family} OR
Splitting Based on Ordinal Attributes
Multi-way split: Use as many partitions as distinct values.
Binary split: Divides values into two subsets. Need to find optimal partitioning.
What about this split?
50
Size Small
Medium
Large
Size {Medium,
Large} {Small}
Size {Small,
Medium} {Large} OR
Size {Small, Large} {Medium}
Splitting Based on Continuous Attributes
Different ways of handling
Discretization to form an ordinal categorical attribute
Static – discretize once at the beginning
Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.
Binary Decision: (A < v) or (A v)
consider all possible splits and finds the best cut
can be more compute intensive
51
Splitting Based on Continuous Attributes
Taxable
Income
> 80K?
Yes No
Taxable
Income?
(i) Binary split (ii) Multi-way split
< 10K
[10K,25K) [25K,50K) [50K,80K)
> 80K
52
Data Fragmentation
Number of instances gets smaller as you traverse down the tree
Number of instances at the leaf nodes could be too small to make statistically significant decision
Decision trees can suffer from data fragmentation Especially true if there are many features and not too many
examples
True or False: All classification methods may suffer data fragmentation. False: not logistic regression or instance-based learning.
Only applies to divide-and-conquer methods
53
Expressiveness
Expressiveness relates to flexibility of the classifier in forming decision boundaries Linear models are not that expressive since they can only form linear
boundaries
Decision tree models can form rectangular regions
Which is more expressive and why? Decision trees because they can form many regions, but DTs do have the
limitation of only forming axis-parallel boundaries.
Decision tree do not generalize well to certain types of functions (like parity which depends on all features) For accurate modeling, must have a complete trees
Not expressive enough for modeling continuous variables especially when more than one variable at a time is involved
54
Decision Boundary
y < 0.33?
: 0
: 3
: 4
: 0
y < 0.47?
: 4
: 0
: 0
: 4
x < 0.43?
Yes
Yes
No
No Yes No
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
55
• Border line between two neighboring regions of different classes is known as decision boundary
• Decision boundary is parallel to axes because test condition involves a single attribute at-a-time
Oblique Decision Trees
56
x + y < 1
Class = + Class =
This special type of decision tree avoids some weaknesses and increases the expressiveness of decision trees
This is not what we mean when we refer to decision trees (e.g., on an exam)
Tree Replication
57
P
Q R
S 0 1
0 1
Q
S 0
0 1
This can be viewed as a weakness of decision trees, but this is really a minor issue
Pros and Cons of Decision Trees
Advantages: Easy to understand
Can get a global view of what is going on and also explain individual decisions
Can generate rules from them Fast to build and apply Can handle redundant and irrelevant features and
missing values
Disadvantages: Limited expressive power May suffer from overfitting and validation set may be
necessary to avoid overfitting
58
More to Come on Decision Trees
We have covered most of the essential aspects of decision trees except pruning
We will cover pruning next and, more generally, overfitting avoidance
We will also cover evaluation, which applies to decision trees but also to all predictive models
59