decision tree problems cse-391: artificial intelligence university of pennsylvania matt huenerfauth...

Decision Tree Problems

CSE-391: Artificial IntelligenceUniversity of Pennsylvania

Matt HuenerfauthApril 2005

Homework 7

• Perform some entropy and information gain calculations. – We’ll also do some information gain ratio calculations

in class today. You don’t need to do these on the midterm, but you should understand generally how it’s calculated and you should know when we should use this metric.

• Using the C4.5 decision tree learning software.– You’ll learn trees to do word

sense disambiguation.

• Read chapter 18.1 – 18.3.

Looking at some dataColor Size Shape Edible?Yellow Small Round +Yellow Small Round -Green Small Irregular +Green Large Irregular -Yellow Large Round +Yellow Small Round +Yellow Small Round +Yellow Small Round +Green Small Round -Yellow Large Round -Yellow Large Round +Yellow Large Round -Yellow Large Round -Yellow Large Round -Yellow Small Irregular +Yellow Large Irregular +

Calculate Entropy

• For many of the tree-building calculations we do today, we’ll need to know the entropy of a data set.– Entropy is the degree to which a dataset is

mixed up. That is, how much variety of classifications (+/-) are still in the set.

– For example, a set that is still 50/50 +/- classified will have an Entropy of 1.0.

– A set that’s all + or all – will have Entropy 0.0.

Entropy Calculations: I()

If we have a set with k different values in it, we can calculate the entropy as follows:

Where P(valuei) is the probability of getting the ith value when randomly selecting one from the set.

So, for the set R = {a,a,a,b,b,b,b,b}

k

i

ii valuePvaluePSetISetentropy1

2 )(log)()()(

8

5log

8

5

8

3log

8

3)()( 22RIRentropy

a-values b-values

Looking at some dataColor Size Shape Edible?Yellow Small Round +Yellow Small Round -Green Small Irregular +Green Large Irregular -Yellow Large Round +Yellow Small Round +Yellow Small Round +Yellow Small Round +Green Small Round -Yellow Large Round -Yellow Large Round +Yellow Large Round -Yellow Large Round -Yellow Large Round -Yellow Small Irregular +Yellow Large Irregular +

Entropy for our data set

• 16 instances: 9 positive, 7 negative.

• This equals: 0.9836• This makes sense – it’s almost a 50/50 split; so,

the entropy should be close to 1.

16

7log

16

7

16

9log

16

9)_( 22dataallI

)_( dataallentropy

How do we use this?

• The computer needs a way to decide how to build a decision tree.– First decision: what’s the attribute it should

use to ‘branch on’ at the root? – Recursively: what’s the attribute it should use

to ‘branch on’ at all subsequent nodes.

• Guideline: Always branch on the attribute that will divide the data into subsets that have as low entropy as possible (that are as unmixed +/- as possible).

Information Gain Metric: G()

• When we select an attribute to use as our branching criteria at the root, then we’ve effectively split our data into two sets, the set the goes down the left branch, and the set that goes down the right.

• If we know the entropy before we started, and then we calculate the entropy of each of these resulting subsets, then we can calculate the information gain.

Information Gain Metric: G()

• Why is reducing entropy a good idea?– Eventually we’d like our tree to distinguish

data items into groups that are fine-grained enough that we can label them as being either + or -

– In other words, we’d like to separate our data in such a way that each group is as ‘unmixed’ in terms of +/- classifications as possible.

– So, the ideal attribute to branch at the root would be the one that can separate the data into an entirely + group and an entirely – one.

Visualizing Information Gain

Size

Small

Color Size Shape Edible?Yellow Small Round +Yellow Small Round -Green Small Irregular +Green Large Irregular -Yellow Large Round +Yellow Small Round +Yellow Small Round +Yellow Small Round +Green Small Round -Yellow Large Round -Yellow Large Round +Yellow Large Round -Yellow Large Round -Yellow Large Round -Yellow Small Irregular +Yellow Large Irregular +

Large

Color Size Shape Edible?Yellow Small Round +Yellow Small Round -Green Small Irregular +Yellow Small Round +Yellow Small Round +Yellow Small Round +Green Small Round -Yellow Small Irregular +

Entropy of set = 0.9836 (16 examples)

Entropy = 0.8113 (from 8 examples)

Entropy = 0.9544

(from 8 examples) Color Size Shape Edible?Green Large Irregular -Yellow Large Round +Yellow Large Round -Yellow Large Round +Yellow Large Round -Yellow Large Round -Yellow Large Round -Yellow Large Irregular +

Visualizing Information GainSize

Small Large

0.8113 0.9544 (8 examples) (8 examples)

0.9836 (16 examples)

8 examples with ‘small’ 8 examples with ‘large’

The data set that goes down each branch of the tree has its own entropy value. We can calculate for each possible attribute its expected entropy. This is the degree to which the entropy would change if branch on this attribute. You add the entropies of the two children, weighted by the proportion of examples from the parent node that ended up at that child.

Entropy of left child is 0.8113 I(size=small) = 0.8113

Entropy of right child is 0.9544 I(size=large) = 0.9544

8828.09544.016

88113.0

16

8I(size) e)ntropy(sizexpected_e

G(attrib) = I(parent) – I(attrib)

G(size) = I(parent) – I(size) G(size) = 0.9836 – 0.8828 G(size) = 0.1008

Entropy of all data at parent node = I(parent) = 0.9836 Child’s expected entropy for ‘size’ split = I(size) = 0.8828

So, we have gained 0.1008 bits of information about the dataset by choosing ‘size’ as the first branch of our decision tree.

We want to calculate the information gain (or entropy reduction).

This is the reduction in ‘uncertainty’ when choosing our first branch as ‘size’. We will represent information gain as “G.”

Using Information Gain

• For each of the attributes we’re thinking about branching on, and for all of the data that will reach this node (which is all of the data when at the root), do the following:– Calculate the Information Gain if we were to

split the current data on this attribute.

• In the end, select the attribute with the greatest Information Gain to split on.

• Create two subsets of the data (one for each branch of the tree), and recurse on each branch.

Showing the calculations

• For color, size, shape.

• Select the one with the greatest info gain value as the attribute we’ll branch on at the root.

• Now imagine what our data set will look like on each side of the branch.

• We would then recurse on each of these data sets to select how to branch below.

Our Data TableColor Size Shape Edible?Yellow Small Round +Yellow Small Round -Green Small Irregular +Green Large Irregular -Yellow Large Round +Yellow Small Round +Yellow Small Round +Yellow Small Round +Green Small Round -Yellow Large Round -Yellow Large Round +Yellow Large Round -Yellow Large Round -Yellow Large Round -Yellow Small Irregular +Yellow Large Irregular +

Sequence of Calculations• Calculate I(parent). This is entropy of the data set before

the split. Since we’re at the root, this is simply the entropy for all the data.

I(all_data) = (-9/16)*log2(9/16)+(-7/16)*log2(7/16)• Next, calculate the I() for the subset of the data where the

color=green and for the subset of the data where color=yellow. I(color=green) = (-1/3)*log2(1/3) + (-2/3)*log2(2/3) I(color=yellow) = (-8/13)*log2(8/13) + (-5/13)*log2(5/13)

• Now calculate expected entropy for ‘color.’

I(color)=(3/16)*I(color=green)+(13/16)*I(color=yellow)• Finally, the information gain for ‘color.’

G(color) = I(parent) – I(color)

Calculations

• I(all_data) = .9836

• I(size)=.8829 G(size) = .1007size=small,+2,-6; I(size=small)=.8112

size=large,+3,-5; I(size=large)=.9544

• I(color)=.9532 G(color)= .0304color=green,+1,-2; I(color=green)=.9183

color=yellow,+8,-5; I(color=yellow)=.9612

• I(shape)=.9528 G(shape)= .0308shape=regular,+6,-6; I(shape=regular)=1.0

shape=irregular,+3,-1; I(shape=irregular)=.8113

Visualizing the Recursive StepNow we have split on a particular feature, we delete that feature from the set considered at the next layer. Since this effectively gives us a ‘new’ smaller dataset, with one less feature, at each of these child nodes, we simply apply the same entropy calculation procedures recursively for each child.

Size

Small LargeColor Size Shape Edible?Yellow Small Round +Yellow Small Round -Green Small Irregular +Yellow Small Round +Yellow Small Round +Yellow Small Round +Green Small Round -Yellow Small Irregular +

Color Size Shape Edible?Green Large Irregular -Yellow Large Round +Yellow Large Round -Yellow Large Round +Yellow Large Round -Yellow Large Round -Yellow Large Round -Yellow Large Irregular +

Color Shape Edible?Yellow Round +Yellow Round -Green Irregular +Yellow Round +Yellow Round +Yellow Round +Green Round -Yellow Irregular +

Color Shape Edible?Green Irregular -Yellow Round +Yellow Round -Yellow Round +Yellow Round -Yellow Round -Yellow Round -Yellow Irregular +

Calculations

• Entropy of this whole set(+6,-2): 0.8113

• I(color)=.7375 G(color)=.0738 color=yellow,+5,-1; I(color=yellow)=0.65

color=green,+1,-1; I(color=green)= 1.0

• I(shape)=.6887 G(shape)=.1226 shape=regular,+4,-2; I(shape=regular)= .9183

shape=irregular,+2,-0; I(shape=irregular)= 0


Binary Data

• Sometimes most of our attributes are binary values or have a low number of possible values. (Like the berry example.)– In this case, the information gain metric is

appropriate for selecting which attribute to use to branch at each node.

• When we have some attributes with very many values, then there is another metric which is better to use.

Information Gain Ratio: GR()

• The information gain metric has a bias toward branching on attributes that have very many possible values.

• To combat this bias, we use a different branching-attribute selection metric, which is called:

“Information Gain Ratio”GR(size)

Formula for Info Gain Ratio

• Formula for Information Gain Ratio…

• P(v) is the proportion of the values of this attribute that are equal to v. – Note: we’re not counting +/- in this case. We’re

counting the values in the ‘attribute’ column.

• Let’s use the information gain ratio metric to select the best attribute to branch on.

)(2 )(log)(

)()(

attributeValuesOfv

vPvP

attributeGattributeGR

Calculation of GR()

• GR(size) = G(size) / Sum(…)• GR(size) = .1007 G(size) = .1007

8 occurrences of size=small; 8 occurrences of size=large.

Sum(…) = (-8/16)*log2(8/16) + (-8/16)*log2(8/16) =1

• GR(color)= .0437 G(color)=.03043 occurrences of color=yellow; 13 of color=green.

Sum(…) = (-3/16)*log2(3/16) + (-13/16)*log2(13/16) =.6962

• GR(shape)= .0379 G(shape)=.030812 occurrences of shape=regular; 4 of shape=irregular

Sum(…) = (-12/16)*log2(12/16) + (-4/16)*log2(4/16) =.8113

Selecting the root

• Same as before, but now instead of selecting the attribute with the highest information gain, we select the one with the highest information gain ratio.

• We will use this attribute to branch at the root.

Data Subsets / Recursive Step

• Same as before.• After we select an attribute for the root, then

partition the data set into subsets. And then remove that attribute from consideration for those subsets below its node.

• Now, we recurse. We calculate what each of our subsets will be down each branch.– We recursively calculate the info gain ratio for all the

attributes on each of these data subsets in order to select how the tree will branch below the root.

Recursively

• Entropy of this whole set(+6,-2): 0.8113

• G(color)=.0738 GR(color)=.0909 color=yellow,+5,-1; I(color=yellow)=0.65

color=green,+1,-1; I(color=green)= 1.0

• G(shape)=.1226 GR(shape)=.1511 shape=regular,+4,-2; I(shape=regular)= .9183

shape=irregular,+2,-0; I(shape=irregular)= 0


decision tree problems cse-391: artificial intelligence university of pennsylvania matt huenerfauth...

Documents