classification & regressionrjohns15/cse40647.sp14/www... · preprocessing classification &...

Classification & Regression

Data Preprocessing

Classification& Regression

Decision Trees

• Example of inductive learning– The process of learning by example – where a system tries to

induce a general rule from a set of observed instances.

• Directed structure comprised of nodes– Each node specifies a test on an attribute

– Each branch corresponds to an attribute value or condition

– Leaves represent a class (or decision)

• Very wide application range

2

Data Preprocessing


Constructing Decision Trees

• Top-down, recursive, divide and conquer1. Select best feature for root node. Construct a branch for

every possible value of that feature.

2. Split data into mutually exclusive subsets for each branch

3. Repeat this process recursively using only the portion of data arriving at each node

4. Stop when training examples can be perfectly classified create a leaf node with the class decision

3

Data Preprocessing


How to choose the splitting attribute?

• Information Gain (used in ID3, C4.5)

• Gain Ratio (used in C4.5)

• Gini Measure (used in CART)

4

Data Preprocessing


Determining the best split

• Greedy approach:– Choose nodes with homogeneous class distributions

– Suppose we are trying to analyze a dataset to figure out if people will wait outside a restaurant for food

5

WaitNot wait

Rain outside

Yes No

Type of food

Chinese GreekItalian

HomogeneousLow degree of impurityLower entropyBetter attribute!

Data Preprocessing


Weather Data

6

Outlook Temp Humidity Windy Play?

Sunny Hot High False No

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild High False Yes

Rainy Cool Normal False Yes

Rainy Cool Normal True No

Overcast Cool Normal True Yes

Sunny Mild High False No

Sunny Cool Normal False Yes

Rainy Mild Normal False Yes

Sunny Mild Normal True Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes

Rainy Mild High True No

Data Preprocessing


Which attribute to select?

7

outlook

sunny overcast rainy

YesYesNoNoNo

YesYesYesYes

YesYesYesNoNo

humidity

high normal

YesYesYesNoNoNoNo

YesYesYesYesYesYesNo

windy

false true

YesYesYesYesYesYesNoNo

YesYesYesNoNoNo

temperature

hot mild cool

YesYesNoNo

YesYesYesYesNoNo

YesYesYesNo

Data Preprocessing


Information Gain

• Information gain (IG) measures how much “information” an attribute gives us about the class.– attributes that perfectly partition should give maximal

information

– unrelated attributes should give no information

• It measures the reduction in entropy – Entropy: (im)purity in an arbitrary collection of examples

8

Data Preprocessing


Aside on Entropy

• 𝑆 is a sample of training examples

• 𝑝⊕ is the proportion of positive examples in 𝑆

• 𝑝⊖ is the proportion of negative examples in 𝑆

• Entropy measures the impurity of 𝑆𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 = −𝒑⊕ 𝐥𝐨𝐠𝟐 𝒑⨁ − 𝒑⊝ 𝒍𝒐𝒈𝟐 𝒑⊝

9

Data Preprocessing


Aside on Entropy

• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = expected number of bits needed to encode class (⊕ or ⊖) of randomly drawn member of 𝑆 (under the optimal, shortest-length code)

• Why?

– Information theory: optimal length code assigns − log2 𝑝 bits to message having probability 𝑝

– So, expected number of bits to encode ⊕ or ⊖ of a random member of S:

𝑝⊕ − log2 𝑝⊕ + 𝑝⊖ − log2 𝑝⊖

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = 𝐻 𝑆 = 𝑝⊕ − log2 𝑝⊕ + 𝑝⊖ − log2 𝑝⊖

10

Data Preprocessing


Aside on Entropy

• Minimum number of bits needed for c different classes (general case):

• Properties of entropy1. High entropy: uniform distribution

2. Low entropy: varied distribution (more desirable)

11

H Y = −p1 log2 p1 − p2 log2 p2…− pc log2 pc

H(Y) = −

i=1

c

pi log2 pi

Data Preprocessing


Conditional Entropy

• For an example at random, the conditional entropy of Y (class-label) conditioned on the m feature values taken for a feature 𝑥𝑘 is given by:

12

𝐻 𝑌|𝑥𝑘 =

𝑗=1

𝑚

𝑃 𝑥𝑘 = 𝑣𝑗 𝐻(𝑌|𝑥𝑘 = 𝑣𝑗)

𝑰𝒏𝒇𝑮𝒂𝒊𝒏 𝒀 𝒙𝒌 = 𝑯 𝒀 −𝑯 𝒀|𝒙𝒌

Data Preprocessing


Example

13

SchoolLikes

football?

ND Yes

MSU No

ND No

ND Yes

ND No

USC Yes

MSU No

USC Yes

Compute 𝑯(𝒀|𝑿)

𝒗𝒋 𝐏(𝒙 = 𝒗𝒋) 𝑯(𝒀|𝒙)

𝑯 𝒀 𝑿 = 𝟎. 𝟓 ∗ 𝟏 + 𝟎. 𝟐𝟓 ∗ 𝟎 + 𝟎. 𝟐𝟓 ∗ 𝟎𝑯 𝒀 𝑿 = 𝟎. 𝟓

MSU 0.25 0

ND 0. 5 1

USC 0.25 0

𝑰𝒏𝒇𝑮𝒂𝒊𝒏 𝒀 𝒙𝒌 = 𝑯 𝒀 − 𝑯 𝒀|𝒙𝒌𝑰𝒏𝒇𝑮𝒂𝒊𝒏 𝒀 𝒙𝒌 = 𝟏 − 𝟎. 𝟓 = 𝟎. 𝟓

𝑯 𝒀 −𝟒

𝟖𝒍𝒐𝒈𝟐

𝟒

𝟖−𝟒

𝟖𝒍𝒐𝒈𝟐

𝟒

𝟖

𝑯 𝒀 = 𝟏

Data Preprocessing


Back to the Decision Tree

14

outlook

sunny overcast rainy

YesYesNoNoNo

YesYesYesYes

YesYesYesNoNo

• Information gain– (entropy before split) – (entropy after split)

• Information gain for outlook:𝐼𝑛𝑓𝐺𝑎𝑖𝑛 outlook = 𝐼𝐺 9,5 − 𝐼𝐺 2,3 , 4,0 , 3,2

𝐼𝑛𝑓𝐺𝑎𝑖𝑛 outlook = 0.94 − 0.693 = 𝟎. 𝟐𝟒𝟕

• For other features:𝐼𝑛𝑓𝐺𝑎𝑖𝑛(“𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒”) = 𝟎. 𝟎𝟐𝟗

𝐼𝑛𝑓𝐺𝑎𝑖𝑛(“ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦”) = 𝟎. 𝟏𝟓𝟐

𝐼𝑛𝑓𝐺𝑎𝑖𝑛(“𝑤𝑖𝑛𝑑𝑦”) = 𝟎. 𝟎𝟒𝟖

Data Preprocessing


Continuing to split

15

outlook

…temperature

hot mild cool

NoNo

YesNo

Yes

sunny

…

outlook

…windy

false true

YesYesNo

YesNo

sunny

…

outlook

…humidity

high normal

NoNoNo

YesYes

sunny

…

𝑰𝒏𝒇𝒐𝒈𝒂𝒊𝒏(“𝒕𝒆𝒎𝒑𝒆𝒓𝒂𝒕𝒖𝒓𝒆”) = 0.571

𝑰𝒏𝒇𝒐𝒈𝒂𝒊𝒏(“𝒘𝒊𝒏𝒅𝒚”) = 0.020

𝑰𝒏𝒇𝒐𝒈𝒂𝒊𝒏(“𝒉𝒖𝒎𝒊𝒅𝒊𝒕𝒚”) = 0.971

Data Preprocessing


Final Tree

• Note: Leaves need not be pure as there can often be similar instances with different classes.

outlook

humidity

high normal

No

sunny

Yes

Yes windy

false true

Yes No

overcast rainy

Data Preprocessing


Applying Model to test Data

Outlook Temp Humidity Windy Play?

Rainy Hot High False ?

Test data:

outlook

humidity

high normal

No

sunny

Yes

Yes windy

false true

Yes No

overcast rainy

Yes

Data Preprocessing


How to Specify Test Condition

• Depends on:– Type of attributes/features:• Nominal• Continuous

– Number of ways to split• 2-way split• Multi-way split

18

Data Preprocessing


Splitting Based on Nominal Attributes

• Multi-way split: – Use as many partitions

as there are values

19

• Binary split: – Divide values into two

subsets

Car Type

Luxury Sports Family

Car Type

{Luxury,Sports} Family

Car Type

Luxury

{Sports,Family}

Data Preprocessing


Splitting Based on Continuous Attributes

• Discretization:– Form an ordinal categorical feature

– It can be done at the beginning (static – global), or at each level individually (dynamic – local)

• Binary Decision:– (𝐴 < 𝑣) or (𝐴 ≥ 𝑣)

– Considers all splits and chooses the best

– More computationally intensive

20

Data Preprocessing


Highly branching attributes

• Problematic: attributes with a large number of values (extreme case: ID code)

• Subsets are more likely to be pure if there is a large number of values – Information gain is biased towards choosing attributes

with a large number of values

– This may result in overfitting (selection of an attribute that is non-optimal for prediction)

21

Data Preprocessing


Highly branching attributes – Example

22

Day Outlook Temp Humidity Windy Play?

D1 Sunny Hot High False No

D2 Sunny Hot High True No

D3 Overcast Hot High False Yes

D4 Rainy Mild High False Yes

D5 Rainy Cool Normal False Yes

D6 Rainy Cool Normal True No

D7 Overcast Cool Normal True Yes

D8 Sunny Mild High False No

D9 Sunny Cool Normal False Yes

D10 Rainy Mild Normal False Yes

D11 Sunny Mild Normal True Yes

D12 Overcast Mild High True Yes

D13 Overcast Hot Normal False Yes

D14 Rainy Mild High True No

Data Preprocessing


Highly branching attributes – Example

23

Day

D1D2 D3 D4

…D13

D14

No

NoYes Yes

Yes

No

• Entropy of split = 0 each leaf is “pure”

• Information gain is maximum for this feature

• Is that good?

classification & regressionrjohns15/cse40647.sp14/www... · preprocessing classification &...

Documents