data mining – algorithms: decision trees - id3

43
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3

Upload: bazyli

Post on 05-Jan-2016

72 views

Category:

Documents


1 download

DESCRIPTION

Data Mining – Algorithms: Decision Trees - ID3. Chapter 4, Section 4.3. Common Recursive Decision Tree Induction Pseudo Code. If all instances have same classification Stop, create leaf with that classification Else Choose attribute for root - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Mining – Algorithms: Decision Trees - ID3

Data Mining – Algorithms: Decision Trees - ID3

Chapter 4, Section 4.3

Page 2: Data Mining – Algorithms: Decision Trees - ID3

Common Recursive Decision Tree Induction Pseudo Code

• If all instances have same classification– Stop, create leaf with that classification

• Else– Choose attribute for root– Make one branch for each possible value (or range if

numeric attribute)– For each branch

• Recursively call same method

Page 3: Data Mining – Algorithms: Decision Trees - ID3

Choosing an Attribute• Key to how this works• Main difference between many decision tree learning

algorithms• First, let’s look at intuition

– We want a small tree

– Therefore, we prefer attributes that divide instances well

– A perfect attribute to use is one for which each value is associated with only one class (e.g. if all rainy days play=no and all sunny days play=yes and all overcast days play=yes)

– We don’t often get perfect very high in the tree

– So, we need a way to measure relative purity

Page 4: Data Mining – Algorithms: Decision Trees - ID3

Relative Purity – My Weather

• Intuitively, it looks like outlook creates more purity, but computers can’t use intuition

• ID3 (famous early algorithm) – measured based on “information theory” (Shannon 1948)

Hot Mild Cool

No No No

Yes Yes No

No No Yes

No Yes Yes

Yes

Hot Mild Cool

No No No

Yes Yes No

No No Yes

No Yes Yes

Yes

Sunny Overcast Rainy

No No No

Yes Yes No

Yes Yes No

Yes No No

Yes

Page 5: Data Mining – Algorithms: Decision Trees - ID3

Information Theory

• A piece of information is more valuable if it adds to what is already known– If all instances in a group were known to be all in the same

class, then the information value of being told the class of a particular instance is Zero

– If instances are evenly split between classes, then the information value of being told the class of a particular instance is Maximized

– Hence more purity = less information – Information theory measures the value of information using

“entropy”, which is measured in “bits”

Page 6: Data Mining – Algorithms: Decision Trees - ID3

Entropy

• Derivation of formula is beyond our scope

• Calculation – entropy =

(-cnt1 log2cnt1 – cnt2 log2cnt2 – cnt3 log2cnt3 … + totalcnts log2totalcnts ) / totalcnts

where cnt1, cnt2, … are counts of number of instances in each class and totalcnts is the total of those counts

<see entropy.xls – show how it goes up with less purity>

Page 7: Data Mining – Algorithms: Decision Trees - ID3

Entropy calculated in Excelclasses cat1 cat2 cat3 cat4 cat5 totalcounts 1 7 8log counts 0 2.807355 #NUM! #NUM! #NUM! 3neg 0 -2.80735 #NUM! #NUM! #NUM!term 0 -19.6515 #NUM! #NUM! #NUM! 24careful 0 -19.6515 0 0 0 24 4.348516

0.543564

classes cat1 cat2 cat3 cat4 cat5 totalcounts 2 3 4 9log counts 1 1.584963 2 #NUM! #NUM! 3.169925neg -1 -1.58496 -2 #NUM! #NUM!term -2 -4.75489 -8 #NUM! #NUM! 28.52933careful -2 -4.75489 -8 0 0 28.52933 13.77444

1.530493

Page 8: Data Mining – Algorithms: Decision Trees - ID3

Information Gain

• Amount entropy is reduced as a result of dividing

• This is the deciding measure for ID3

Page 9: Data Mining – Algorithms: Decision Trees - ID3

Example: My Weather (Nominal)Outlook Temp Humid Windy Play?sunny hot high FALSE nosunny hot high TRUE yesovercast hot high FALSE norainy mild high FALSE norainy cool normal FALSE norainy cool normal TRUE noovercast cool normal TRUE yessunny mild high FALSE yessunny cool normal FALSE yesrainy mild normal FALSE nosunny mild normal TRUE yesovercast mild high TRUE yesovercast hot normal FALSE norainy mild high TRUE no

Page 10: Data Mining – Algorithms: Decision Trees - ID3

Let’s take this a little more realistic than book does

• this will be cross validated

• Normally 10-fold is used, but with 14 instances that is a little awkward

• For each fold will divide into training and test data

• This time through, let’s save the last record as a test

Page 11: Data Mining – Algorithms: Decision Trees - ID3

Using Entropy to Divide

• Entropy for all training instances (5 yes, 8 no) = .96

• Entropy for Outlook division = weighted average of Nodes created by division =

5/13 * .72 (entropy [4,1])

+ 4/13 * 1 (entropy [2,2])

+ 4/13 * 0 (entropy [0,4])

= .585

• Info Gain = .96 - .585 = .375

Page 12: Data Mining – Algorithms: Decision Trees - ID3

Using Entropy to Divide

• Entropy for Temperature division = weighted average of Nodes created by division =

4/13 * .81 (entropy [1,3])

+ 5/13 * .97 (entropy [3,2])

+ 4/13 * 1.0 (entropy [2,2])

= .931

• Info Gain = .96 - .931 = .029

Page 13: Data Mining – Algorithms: Decision Trees - ID3

Using Entropy to Divide

• Entropy for Humidity division = weighted average of Nodes created by division =

6/13 * 1 (entropy [3,3])

+ 7/13 * .98 (entropy [3,4])

= .992

• Info Gain = .96 - .992 = -0.032

Page 14: Data Mining – Algorithms: Decision Trees - ID3

Using Entropy to Divide

• Entropy for Windy division = weighted average of Nodes created by division =

8/13 * .81 (entropy [2,6])+ 5/13 * .72 (entropy [4,1])

= .777• Info Gain = .96 - .777 = .183

• Biggest Gain is via Outlook

Page 15: Data Mining – Algorithms: Decision Trees - ID3

Recursive Tree Building

• On sunny instances, will consider other attributes

Page 16: Data Mining – Algorithms: Decision Trees - ID3

Example: My Weather (Nominal)Outlook Temp Humid Windy Play?sunny hot high FALSE nosunny hot high TRUE yes

sunny mild high FALSE yessunny cool normal FALSE yes

sunny mild normal TRUE yes

Page 17: Data Mining – Algorithms: Decision Trees - ID3

Using Entropy to Divide

• Entropy for all sunny training instances (4 yes, 1 no) = .72

• Outlook does not have to be considered because it has already been used

Page 18: Data Mining – Algorithms: Decision Trees - ID3

Using Entropy to Divide

• Entropy for Temperature division = weighted average of Nodes created by division =

2/5 * 1.0 (entropy [1,1])

+ 2/5 * 0.0 (entropy [2,0])

+ 1/5 * 0.0 (entropy [1,0])

= .400

• Info Gain = .72 - .4 = .32

Page 19: Data Mining – Algorithms: Decision Trees - ID3

Using Entropy to Divide

• Entropy for Humidity division = weighted average of Nodes created by division =

3/5 * .918 (entropy [2,1])

+ 2/5 * .0 (entropy [2,0])

= .551

• Info Gain = .72 - .55 = 0.17

Page 20: Data Mining – Algorithms: Decision Trees - ID3

Using Entropy to Divide

• Entropy for Windy division = weighted average of Nodes created by division =

3/5 * .918 (entropy [3,2])+ 2/5 * 0 (entropy [2,0])

= .551• Info Gain = .72 - .55 = .17

• Biggest Gain is via Temperature

Page 21: Data Mining – Algorithms: Decision Trees - ID3

Tree So Far

outlook

temp

Sunny

OvercastRainy

Hot Mild Cool

1 yes, 1 no

2 yes 1 yes

2 yes, 2 no

4 no

Page 22: Data Mining – Algorithms: Decision Trees - ID3

Recursive Tree Building• On sunny, hot instances, will consider other attributes

Outlook Temp Humid Windy Play?sunny hot high FALSE nosunny hot high TRUE yes

Page 23: Data Mining – Algorithms: Decision Trees - ID3

Tree So Faroutlook

temp

Sunny

OvercastRainy

Hot

MildCool

2 yes 1 yes

2 yes, 2 no

4 no

windy

1 yes 1 no

TrueFalse

Page 24: Data Mining – Algorithms: Decision Trees - ID3

Recursive Tree Building

• On overcast instances, will consider other attributes

Page 25: Data Mining – Algorithms: Decision Trees - ID3

Example: My Weather (Nominal)Outlook Temp Humid Windy Play?

overcast hot high FALSE no

overcast cool normal TRUE yes

overcast mild high TRUE yesovercast hot normal FALSE no

Page 26: Data Mining – Algorithms: Decision Trees - ID3

Using Entropy to Divide

• Entropy for all overcast training instances (2 yes, 2 no) = 1.0

• Outlook does not have to be considered because it has already been used

Page 27: Data Mining – Algorithms: Decision Trees - ID3

Using Entropy to Divide

• Entropy for Temperature division = weighted average of Nodes created by division =

2/4 * 0.0 (entropy [0,2])

+ 1/4 * 0.0 (entropy [1,0])

+ 1/4 * 0.0 (entropy [1,0])

= .000

• Info Gain = 1.0 - 0.0 = 1.0

Page 28: Data Mining – Algorithms: Decision Trees - ID3

Using Entropy to Divide

• Entropy for Humidity division = weighted average of Nodes created by division =

2/4 * 1.0 (entropy [1,1])

+ 2/4 * 1.0 (entropy [1,1])

= 1.0

• Info Gain = 1.0 – 1.0 = 0.0

Page 29: Data Mining – Algorithms: Decision Trees - ID3

Using Entropy to Divide

• Entropy for Windy division = weighted average of Nodes created by division =

2/4 * 0.0 (entropy [0,2])

+ 2/4 * 0.0 (entropy [2,0])

= 0.0• Info Gain = 1.0 – 0.0 = 1.0

• Biggest Gain is tie between Temperature and Windy

Page 30: Data Mining – Algorithms: Decision Trees - ID3

Tree So Faroutlook

temp

Sunny

OvercastRainy

Hot

MildCool

2 yes 1 yes

4 no

windy

1 yes 1 no

TrueFalse

windy

2 yes 2 no

FalseTrue

Page 31: Data Mining – Algorithms: Decision Trees - ID3

Test Instance

• Top of the tree checks outlook

• Test instance value = rainy

• Branch right

• Reach a leaf

• Predict “No” (which is correct)

Page 32: Data Mining – Algorithms: Decision Trees - ID3

In a 14-fold cross validation, this would continue 13 more times

• Let’s run WEKA on this …

Page 33: Data Mining – Algorithms: Decision Trees - ID3

WEKA results – first look near the bottom

=== Stratified cross-validation ====== Summary ===

Correctly Classified Instances 9 64.2857 %Incorrectly Classified Instances 2 14.2857 %3 Unclassified Instances============================================• On the cross validation – it got 9 out of 14 tests correct (the unclassified instances are tough to understand without seeing all of the trees that were built. It surprises me. They may do some work to avoid overfitting

Page 34: Data Mining – Algorithms: Decision Trees - ID3

More Detailed Results=== Confusion Matrix ===

a b <-- classified as

2 1 | a = yes

1 7 | b = no

====================================•Here we see –the program 3 times predicted play=yes, on 2 of those it was correct

•The program 8 times predicted play = no, on 7 of those it was correct

•There were 3 instances whose actual value was play=yes, the program correctly predicted that on 2 of them

•There were 8 instances whose actual value was play=no, the program correctly predicted that on 7 of them

•All of the unclassified instances were actually play=yes

Page 35: Data Mining – Algorithms: Decision Trees - ID3

Part of our purpose is to have a take-home message for humans

• Not 14 take home messages!

• So instead of reporting each of the things learned on each of the 14 training sets …

• … The program runs again on all of the data and builds a pattern for that – a take home message

Page 36: Data Mining – Algorithms: Decision Trees - ID3

WEKA - Take-Homeoutlook = sunny| temperature = hot| | windy = TRUE: yes| | windy = FALSE: no| temperature = mild: yes| temperature = cool: yesoutlook = overcast| temperature = hot: no| temperature = mild: yes| temperature = cool: yesoutlook = rainy: no

•This is a decision tree!! Can you see it?

•This is almost the same as we generated, we took a tie breaker a different way

•This is a fairly simple classifier – not as simple as with OneR – but it could be the take home message from running this algorithm on this data – if you are satisfied with the results!

Page 37: Data Mining – Algorithms: Decision Trees - ID3

Let’s Try WEKA ID3 on njcrimenominal• Try 10-foldunemploy = hi: badunemploy = med| education = hi: ok| education = med| | twoparent = hi: null| | twoparent = med: bad| | twoparent = low: ok| education = low| | pop = hi: null| | pop = med: ok| | pop = low: badunemploy = low: ok=== Confusion Matrix === a b <-- classified as 5 2 | a = bad 3 22 | b = ok

Seems to noticeably improve on our very simple methods on this slightly more interesting dataset

Page 38: Data Mining – Algorithms: Decision Trees - ID3

Another Thing or Two

• Using this method, if an attribute is essentially a primary key – identifying the instances, dividing based on it will give the maximum information gain because no further information is needed to determine the class (the entropy will be 0)

• However, branching based on a key is not interesting, nor is useful for predicting future un-trained on instances

• The more general idea is that the entropy measure favors splitting using attributes with more possible values

• A common adjustment is to use a “gain ratio” …

Page 39: Data Mining – Algorithms: Decision Trees - ID3

Gain Ratio• Take info gain and divide by “intrinsic info” from the split• E.g. top split above

Attribute Wt Ave Info Gain Split info Gain ratio

Outlook .585 .375 Info([5,4,4] )= 1.577

.238

Temperature .931 .029 Info([4,5,4]) = 1.577

.018

Humidity .992 -.032 Info([6,7]) = .996

-0.032

Windy .777 .183 Info([8,5]) = .961

.190

Page 40: Data Mining – Algorithms: Decision Trees - ID3

Still not a slam dunk

• Even this gets adjusted in some approaches to avoid yet other problems (see p 97)

Page 41: Data Mining – Algorithms: Decision Trees - ID3

ID3 in context

• Quinlan (1986) published about ID3 as first major successful decision tree learner in Machine Learning

• He continued to improve the algorithm– His C4.5 was published in a book, and is available as J48 in

WEKA• Improvements included dealing with numeric attributes, missing

values, noisy data, and generating rules from trees (see Section 6.1)

– His further efforts were commercial and proprietary instead of published in research literature

• Probably almost every graduate student in Machine Learning starts out by writing a version of ID3 so that they FULLY understand it

Page 42: Data Mining – Algorithms: Decision Trees - ID3

Class Exercise

• ID3 cannot run on jappanbank data since it includes some numeric attributes

• Let’s run WEKA J48 on japanbank

Page 43: Data Mining – Algorithms: Decision Trees - ID3

End Section 4.3