decision tree induction - storm.cis.fordham.edugweiss/classes/cisc6930/slides/04 decision...

1

DATA MINING DECISION TREE INDUCTION

Classification Techniques

Linear Models Support Vector Machines

Decision Tree based Methods

Rule-based Methods

Memory based reasoning

Neural Networks

Naïve Bayes and Bayesian Belief Networks

Support Vector Machines

2

Example of a Decision Tree

3

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree

Another Decision Tree Example

4

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

MarSt

Refund

TaxInc

YES NO

NO

NO

Yes No

Married Single,

Divorced

< 80K > 80K

More than one tree may perfectly fit the data

Decision Tree Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Tree

Induction

algorithm

Training Set

5

Decision Tree

Apply Model to Test Data

6

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No


< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Start from the root of tree.


7

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data


8

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data


9

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data


10

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data


11

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data

Assign Cheat to “No”

Decision Tree Terminology

12

Decision Tree Induction

Many Algorithms: Hunt’s Algorithm (one of the earliest)

CART

ID3, C4.5

SLIQ,SPRINT

John Ross Quinlan is a computer science researcher in data

mining and decision theory. He has contributed extensively to the development of decision tree algorithms, including inventing the canonical C4.5 and ID3 algorithms.

13

14

Decision Tree Classifier

Ross Quinlan

An

ten

na

Le

ng

th

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Abdomen Length

Abdomen Length > 7.1?

no yes

Katydid Antenna Length > 6.0?

no yes

Katydid Grasshopper

15

Grasshopper

Antennae shorter than body?

Cricket

Foretiba has ears?

Katydids Camel Cricket

Yes

Yes

Yes

No

No

3 Tarsi?

No

Decision trees predate computers

Definition

16

Decision tree is a classifier in the form of a tree structure

– Decision node: specifies a test on a single attribute

– Leaf node: indicates the value of the target attribute

– Arc/edge: split of one attribute

– Path: a disjunction of test to make the final decision

Decision trees classify instances or examples by starting at the root of the tree and moving through it until a leaf node.

17

• Decision tree generation consists of two phases

– Tree construction

• At start, all the training examples are at the root

• Partition examples recursively based on selected attributes

• This can also be called supervised segmentation

• This emphasizes that we are segmenting the instance space

– Tree pruning

• Identify and remove branches that reflect noise or outliers

Decision Tree Classification

Decision Tree Representation

Each internal node tests an attribute

Each branch corresponds to attribute value

Each leaf node assigns a classification

18

outlook

sunny overcast rain

yes humidity wind

high normal strong weak

yes yes no no

How do we Construct a Decision Tree?

Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-

and-conquer manner At start, all the training examples are at the root

Examples are partitioned recursively based on selected attributes.

Test attributes are selected on the basis of a heuristic or statistical measure (e.g., info. gain)

Why do we call this a greedy algorithm? Because it makes locally optimal decisions (at

each node).

19

When Do we Stop Partitioning?

All samples for a node belong to same class

No remaining attributes

majority voting used to assign class

No samples left

20

How to Pick Locally Optimal Split

Hunt’s algorithm: recursively partition training records into successively purer subsets.

How to measure purity/impurity? Entropy and associated information gain

Gini

Classification error rate

Never used in practice but good for understanding and simple exercises

21

How to Determine Best Split

Own

Car?

C0: 6

C1: 4

C0: 4

C1: 6

C0: 1

C1: 3

C0: 8

C1: 0

C0: 1

C1: 7

Car

Type?

C0: 1

C1: 0

C0: 1

C1: 0

C0: 0

C1: 1

Student

ID?

...

Yes No Family

Sports

Luxury c1

c10

c20

C0: 0

C1: 1...

c11

22

Before Splitting: 10 records of class 0, 10 records of class 1

Which test condition is the best?

Why is student id a bad feature to use?

How to Determine Best Split

C0: 5

C1: 5

23

Greedy approach:

Nodes with homogeneous class distribution are preferred

Need a measure of node impurity:

C0: 9

C1: 1

Non-homogeneous,

High degree of impurity

Homogeneous,

Low degree of impurity

Information Theory

Think of playing "20 questions": I am thinking of an integer between 1 and 1,000 -- what is it? What is the first question you would ask?

What question will you ask?

Why?

Entropy measures how much more information you need before you can identify the integer.

Initially, there are 1000 possible values, which we assume are equally likely.

What is the maximum number of question you need to ask?

24

Entropy

Entropy (disorder, impurity) of a set of examples, S, relative to a binary classification is:

where p1 is the fraction of positive examples in S and p0 is fraction of negatives.

If all examples are in one category, entropy is zero (we define 0log(0)=0)

If examples are equally mixed (p1=p0=0.5), entropy is a maximum of 1.

For multi-class problems with c categories, entropy generalizes to:

)(log)(log)( 020121 ppppSEntropy

25

c

i

ii ppSEntropy1

2 )(log)(

Entropy for Binary Classification

The entropy is 0 if the outcome is certain.

The entropy is maximum if we have no knowledge of the system (or any outcome is equally possible).

26

Entropy of a 2-class problem with regard to

the portion of one of the two groups

27

Information Gain in Decision Tree Induction

• Is the expected reduction in entropy caused by partitioning the

examples according to this attribute.

• Assume that using attribute A, a current set will be partitioned into

some number of child sets

• The encoding information that would be gained by branching on A

)()()( setschildallEsetCurrentEAGain

The summation in the above formula is a bit misleading since when doing

the summation we weight each entropy by the fraction of total examples in

the particular child set. This applies to GINI and error rate also.

Examples for Computing Entropy

28

C1 0

C2 6

C1 2

C2 4

C1 1

C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Entropy = – 0 log2 0 – 1 log2 1 = – 0 – 0 = 0

P(C1) = 1/6 P(C2) = 5/6

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65

P(C1) = 2/6 P(C2) = 4/6

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

j

tjptjptEntropy )|(log)|()(2

C1 3

C2 3

P(C1) = 3/6=1/2 P(C2) = 3/6 = 1/2

Entropy = – (1/2) log2 (1/2) – (1/2) log2 (1/2)

= -(1/2)(-1) – (1/2)(-1) = ½ + ½ = 1

NOTE: p( j | t) is computed as the relative frequency of class j at node t

How to Calculate log2x

Many calculators only have a button for log10x and logex (“log” typically means log10)

You can calculate the log for any base b as follows:

logb(x) = logk(x) / logk(b)

Thus log2(x) = log10(x) / log10(2)

Since log10(2) = .301, just calculate the log base 10 and divide by .301 to get log base 2.

You can use this for HW if needed

29

Splitting Based on INFO...

Information Gain:

Parent Node, p is split into k partitions;

ni is number of records in partition i

Uses a weighted average of the child nodes, where weight is based on number of examples

Used in ID3 and C4.5 decision tree learners

WEKA’s J48 is a Java version of C4.5

Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.

k

i

i

splitiEntropy

n

npEntropyGAIN

1

)()(

How Split on Continuous Attributes?

For continuous attributes

Partition the continuous value of attribute A into a discrete set of intervals

Create a new boolean attribute Ac , looking for a threshold c

One method is to try all possible splits

if

otherwise

c

c

true A cA

false

31 How to choose c ?

32

Person Hair

Length

Weight Age Class

Homer 0” 250 36 M

Marge 10” 150 34 F

Bart 2” 90 10 M

Lisa 6” 78 8 F

Maggie 4” 20 1 F

Abe 1” 170 70 M

Selma 8” 160 41 F

Otto 10” 180 38 M

Krusty 6” 200 45 M

Comic 8” 290 38 ?

33

Hair Length <= 5?

yes no

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911

np

n

np

n

np

p

np

pSEntropy 22 loglog)(

Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911


Let us try splitting on Hair length

34

Weight <= 160?

yes no


np

n

np

n

np

p

np


Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900


Let us try splitting on Weight

35

age <= 40?

yes no


np

n

np

n

np

p

np


Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183


Let us try splitting on Age

36

Weight <= 160?

yes no

Hair Length <= 2?

yes no

Of the 3 features we had, Weight was best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified… So we simply recurse!

This time we find that we can split on Hair length, and we are done!

37

Weight <= 160?

yes no

Hair Length <= 2?

yes no

We don’t need to keep the data around, just the test conditions.

Male

Male Female

How would these people be classified?

38

It is trivial to convert Decision Trees to rules…

Weight <= 160?

yes no

Hair Length <= 2?

yes no

Male

Male Female

Rules to Classify Males/Females If Weight greater than 160, classify as Male Elseif Hair Length less than or equal to 2, classify as Male Else classify as Female

Note: could avoid use of “elseif” by specifying all test conditions from root to corresponding leaf.

39

Decision tree for a typical shared-care setting applying the system for the diagnosis of prostatic obstructions.

Once we have learned the decision tree, we don’t even need a computer!

This decision tree is attached to a medical machine, and is designed to help nurses make decisions about what type of doctor to call.

40

Wears green?

Yes No

The worked examples we have seen were performed on small datasets. However with small datasets there is a great danger of overfitting the data… When you have few datapoints, there are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets.

For example, the rule “Wears green?” perfectly classifies the data, so does “Mothers name is Jacqueline?”, so does “Has blue shoes”…

Male Female

GINI is Another Measure of Impurity

Gini for a given node t with classes j

NOTE: p( j | t) is again computed as relative frequency of class j at node t

Compute best split by computing the partition that yields the lowest GINI where we

again take the weighted average of the children’s GINI

41

j

tjptGINI 2)]|([1)(

C1 0

C2 6

Gini=0.000

C1 2

C2 4

Gini=0.444

C1 3

C2 3

Gini=0.500

C1 1

C2 5

Gini=0.278

Worst GINI = 0.5 Best GINI = 0.0

Splitting Criteria based on Classification Error

Classification error at a node t :

Measures misclassification error made by a node.

Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information. This is ½ for 2-class problems

Minimum (0.0) when all records belong to one class, implying most interesting information

42

)|(max1)( tiPtErrori

Examples for Computing Error

43

C1 0

C2 6

C1 2

C2 4

C1 1

C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Error = 1 – max (0, 1) = 1 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

P(C1) = 2/6 P(C2) = 4/6

Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

)|(max1)( tiPtErrori

Equivalently, predict majority class and determine fraction of errors

Complete Example using Error Rate

Initial sample has 3 C1 and 15 C2

Based on one 3-way split you get the 3 child nodes to the left

What is the decrease in error rate?

What is the error rate initially?

What is it afterwards?

As usual you need to take the weighted average (but there is a shortcut)

44

C1 0

C2 6

C1 2

C2 4

C1 1

C2 5

Error Rate Example Continued

Error rate before: 3/18

Error rate after:

Shortcut:

Number of errors = 0 + 1 + 2

Out of 18 examples

Error rate = 3/18

Weighted average method:

6/18 x 0 + 6/18 x 1/6 + 6/18 x 2/6

Simplifies to 1/18 + 2/18 = 3/18

45

C1 0

C2 6

C1 2

C2 4

C1 1

C2 5

Comparison among Splitting Criteria

46

For a 2-class problem:

Discussion

Error rate is often the metric used to evaluate a classifier (but not always) So it seems reasonable to use error rate to determine

the best split

That is, why not just use a splitting metric that matches the ultimate evaluation metric?

But this is wrong! The reason is related to the fact that decision trees use a

greedy strategy, so we need to use a splitting metric that leads to globally better results

The other metrics will empirically outperform error rate, although there is no proof for this.

47

How to Specify Test Condition?

Depends on attribute types

Nominal

Ordinal

Continuous

Depends on number of ways to split

2-way split

Multi-way split

48

Splitting Based on Nominal Attributes

Multi-way split: Use as many partitions as distinct values.

Binary split: Divides values into two subsets. Need to find optimal partitioning.

49

CarType Family

Sports

Luxury

CarType {Family, Luxury} {Sports}

CarType {Sports, Luxury} {Family} OR

Splitting Based on Ordinal Attributes

Multi-way split: Use as many partitions as distinct values.

Binary split: Divides values into two subsets. Need to find optimal partitioning.

What about this split?

50

Size Small

Medium

Large

Size {Medium,

Large} {Small}

Size {Small,

Medium} {Large} OR

Size {Small, Large} {Medium}

Splitting Based on Continuous Attributes

Different ways of handling

Discretization to form an ordinal categorical attribute

Static – discretize once at the beginning

Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.

Binary Decision: (A < v) or (A v)

consider all possible splits and finds the best cut

can be more compute intensive

51

Splitting Based on Continuous Attributes

Taxable

Income

> 80K?

Yes No

Taxable

Income?

(i) Binary split (ii) Multi-way split

< 10K

[10K,25K) [25K,50K) [50K,80K)

> 80K

52

Data Fragmentation

Number of instances gets smaller as you traverse down the tree

Number of instances at the leaf nodes could be too small to make statistically significant decision

Decision trees can suffer from data fragmentation Especially true if there are many features and not too many

examples

True or False: All classification methods may suffer data fragmentation. False: not logistic regression or instance-based learning.

Only applies to divide-and-conquer methods

53

Expressiveness

Expressiveness relates to flexibility of the classifier in forming decision boundaries Linear models are not that expressive since they can only form linear

boundaries

Decision tree models can form rectangular regions

Which is more expressive and why? Decision trees because they can form many regions, but DTs do have the

limitation of only forming axis-parallel boundaries.

Decision tree do not generalize well to certain types of functions (like parity which depends on all features) For accurate modeling, must have a complete trees

Not expressive enough for modeling continuous variables especially when more than one variable at a time is involved

54

Decision Boundary

y < 0.33?

: 0

: 3

: 4

: 0

y < 0.47?

: 4

: 0

: 0

: 4

x < 0.43?

Yes

Yes

No

No Yes No

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

55

• Border line between two neighboring regions of different classes is known as decision boundary

• Decision boundary is parallel to axes because test condition involves a single attribute at-a-time

Oblique Decision Trees

56

x + y < 1

Class = + Class =

This special type of decision tree avoids some weaknesses and increases the expressiveness of decision trees

This is not what we mean when we refer to decision trees (e.g., on an exam)

Tree Replication

57

P

Q R

S 0 1

0 1

Q

S 0

0 1

This can be viewed as a weakness of decision trees, but this is really a minor issue

Pros and Cons of Decision Trees

Advantages: Easy to understand

Can get a global view of what is going on and also explain individual decisions

Can generate rules from them Fast to build and apply Can handle redundant and irrelevant features and

missing values

Disadvantages: Limited expressive power May suffer from overfitting and validation set may be

necessary to avoid overfitting

58

More to Come on Decision Trees

We have covered most of the essential aspects of decision trees except pruning

We will cover pruning next and, more generally, overfitting avoidance

We will also cover evaluation, which applies to decision trees but also to all predictive models

59

decision tree induction - storm.cis.fordham.edugweiss/classes/cisc6930/slides/04 decision...

Documents