cs b551: d ecision t rees. a genda decision trees complexity learning curves combatting overfitting...

44
CS B551: DECISION TREES

Upload: precious-akeley

Post on 01-Apr-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

CS B551: DECISION TREES

Page 2: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

AGENDA

Decision trees Complexity Learning curves Combatting overfitting

Boosting

Page 3: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

RECAP

Still in supervised setting with logical attributes

Find a representation of CONCEPT in the form:

CONCEPT(x) S(A,B, …)

where S(A,B,…) is a sentence built with the observable attributes, e.g.:

CONCEPT(x) A(x) (B(x) v C(x))

Page 4: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

PREDICATE AS A DECISION TREE

The predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree:

A?

B?

C?True

True

True

True

FalseTrue

False

FalseFalse

False

Example:A mushroom is poisonous iffit is yellow and small, or yellow, big and spotted• x is a mushroom• CONCEPT = POISONOUS• A = YELLOW• B = BIG• C = SPOTTED

Page 5: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

PREDICATE AS A DECISION TREE

The predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree:

A?

B?

C?True

True

True

True

FalseTrue

False

FalseFalse

False

Example:A mushroom is poisonous iffit is yellow and small, or yellow, big and spotted• x is a mushroom• CONCEPT = POISONOUS• A = YELLOW• B = BIG• C = SPOTTED• D = FUNNEL-CAP• E = BULKY

Page 6: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

TRAINING SET

Ex. # A B C D E CONCEPT

1 False False True False True False

2 False True False False False False

3 False True True True True False

4 False False True False False False

5 False False False True True False

6 True False True False False True

7 True False False True False True

8 True False True False True True

9 True True True False True True

10 True True True True True True

11 True True False False False False

12 True True False False True False

13 True False True True True True

Page 7: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

TrueTrueTrueTrueFalseTrue13

FalseTrueFalseFalseTrueTrue12

FalseFalseFalseFalseTrueTrue11

TrueTrueTrueTrueTrueTrue10

TrueTrueFalseTrueTrueTrue9

TrueTrueFalseTrueFalseTrue8

TrueFalseTrueFalseFalseTrue7

TrueFalseFalseTrueFalseTrue6

FalseTrueTrueFalseFalseFalse5

FalseFalseFalseTrueFalseFalse4

FalseTrueTrueTrueTrueFalse3

FalseFalseFalseFalseTrueFalse2

FalseTrueFalseTrueFalseFalse1

CONCEPTEDCBAEx. #

POSSIBLE DECISION TREE

D

CE

B

E

AA

A

T

F

F

FF

F

T

T

T

TT

Page 8: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

POSSIBLE DECISION TREE

D

CE

B

E

AA

A

T

F

F

FF

F

T

T

T

TT

CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA))))))

A?

B?

C?True

True

True

True

FalseTrue

False

FalseFalse

False

CONCEPT A (B v C)

Page 9: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

POSSIBLE DECISION TREE

D

CE

B

E

AA

A

T

F

F

FF

F

T

T

T

TT

A?

B?

C?True

True

True

True

FalseTrue

False

FalseFalse

False

CONCEPT A (B v C)

KIS bias Build smallest decision tree

Computationally intractable problem greedy algorithm

CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA))))))

Page 10: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

TOP-DOWNINDUCTION OF A DT

DTL(D, Predicates)1. If all examples in D are positive then return True2. If all examples in D are negative then return False3. If Predicates is empty then return majority rule4. A error-minimizing predicate in Predicates5. Return the tree whose:

- root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A)

A

C

True

True

TrueB

True

TrueFalse

False

FalseFalse

False

Page 11: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

LEARNABLE CONCEPTS

Some simple concepts cannot be represented compactly in DTsParity(x) = X1 xor X2 xor … xor Xn

Majority(x) = 1 if most of Xi’s are 1, 0 otherwise

Exponential size in # of attributesNeed exponential # of examples to

learn exactlyThe ease of learning is dependent on

shrewdly (or luckily) chosen attributes that correlate with CONCEPT

Page 12: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

PERFORMANCE ISSUES

Assessing performance: Training set and test set Learning curve

size of training set

% c

orr

ect

on

tes

t se

t 100

Typical learning curve

Page 13: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

PERFORMANCE ISSUES

Assessing performance: Training set and test set Learning curve

size of training set

% c

orr

ect

on

tes

t se

t 100

Typical learning curve

Some concepts are unrealizable within a machine’s capacity

Page 14: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

PERFORMANCE ISSUES

Assessing performance: Training set and test set Learning curve

OverfittingRisk of using irrelevant

observable predicates togenerate an hypothesis

that agrees with all examples

in the training set

size of training set

% c

orr

ect

on

tes

t se

t

100

Typical learning curve

Page 15: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

PERFORMANCE ISSUES

Assessing performance: Training set and test set Learning curve

Overfitting Tree pruning

Risk of using irrelevantobservable predicates togenerate an hypothesis

that agrees with all examples

in the training set

Terminate recursion when# errors / information gain

is small

Page 16: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

PERFORMANCE ISSUES

Assessing performance: Training set and test set Learning curve

Overfitting Tree pruning

Terminate recursion when# errors / information gain

is small

Risk of using irrelevantobservable predicates togenerate an hypothesis

that agrees with all examples

in the training setThe resulting decision tree + majority rule may not classify correctly all examples in the training set

Page 17: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

PERFORMANCE ISSUES

Assessing performance: Training set and test set Learning curve

Overfitting Tree pruning

Incorrect examples Missing data Multi-valued and continuous attributes

Page 18: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

USING INFORMATION THEORY Rather than minimizing the probability of

error, minimize the expected number of questions needed to decide if an object x satisfies CONCEPT

Use the information-theoretic quantity known as information gain

Split on variable with highest information gain

Page 19: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

ENTROPY / INFORMATION GAIN Entropy: encodes the quantity of uncertainty in a

random variable H(X) = -xVal(X) P(x) log P(x)

Properties H(X) = 0 if X is known, i.e. P(x)=1 for some value x H(X) > 0 if X is not known with certainty H(X) is maximal if P(X) is uniform distribution

Information gain: measures the reduction in uncertainty in X given knowledge of Y I(X,Y) = Ey[H(X) – H(X|Y)] =

y P(y) x [P(x|y) log P(x|y) – P(x)log P(x)] Properties

Always nonnegative = 0 if X and Y are independent

If Y is a choice, maximizing IG = > minimizing Ey[H(X|Y)]

Page 20: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

MAXIMIZING IG / MINIMIZING CONDITIONAL ENTROPY IN DECISION TREES

Ey[H(X|Y)] = y P(y) x P(x|y) log P(x|y)

Let n be # of examples Let n+,n- be # of examples on T/F branches of

Y Let p+,p- be accuracy on true/false branches

of Y P(Y) = (p+n++p-n-)/n P(correct|Y) = p+, P(correct|-Y) = p-

Ey[H(X|Y)] n+ [p+log p+ + (1-p+)log (1-p+)] + n- [p-log p- + (1-p-) log (1-p-)]

Page 21: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

CONTINUOUS ATTRIBUTES

Continuous attributes can be converted into logical ones via thresholds X => X<a

When considering splitting on X, pick the threshold a to minimize # of errors / entropy

7 7 6 5 6 5 4 5 4 3 4 5 4 5 6 7

Page 22: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

MULTI-VALUED ATTRIBUTES

Simple change: consider splits on all values A can take on

Caveat: the more values A can take on, the more important it may appear to be, even if it is irrelevant More values => dataset split into smaller

example sets when picking attributes Smaller example sets => more likely to fit well

to spurious noise

Page 23: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

STATISTICAL METHODS FOR ADDRESSING OVERFITTING / NOISE There may be few training examples that

match the path leading to a deep node in the decision tree More susceptible to choosing irrelevant/incorrect

attributes when sample is small Idea:

Make a statistical estimate of predictive power (which increases with larger samples)

Prune branches with low predictive power Chi-squared pruning

Page 24: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

TOP-DOWN DT PRUNING

Consider an inner node X that by itself (majority rule) predicts p examples correctly and n examples incorrectly

At k leaf nodes, number of correct/incorrect examples are p1/n1,…,pk/nk

Chi-squared statistical significance test: Null hypothesis: example labels randomly chosen

with distribution p/(p+n) (X is irrelevant) Alternate hypothesis: examples not randomly

chosen (X is relevant) Prune X if testing X is not statistically

significant

Page 25: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

CHI-SQUARED TEST

Let Z = Si (pi – pi’)2/pi’ + (ni – ni’)2/ni’ Where pi’ = pi(pi+ni)/(p+n), ni’ = ni(pi+ni)/(p+n)

are the expected number of true/false examples at leaf node i if the null hypothesis holds

Z is a statistic that is approximately drawn from the chi-squared distribution with k degrees of freedom

Look up p-Value of Z from a table, prune if p-Value > a for some a (usually ~.05)

Page 26: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

ENSEMBLE LEARNING (BOOSTING)

Page 27: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

IDEA

It may be difficult to search for a single hypothesis that explains the data

Construct multiple hypotheses (ensemble), and combine their predictions

“Can a set of weak learners construct a single strong learner?” – Michael Kearns, 1988

Page 28: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

MOTIVATION

5 classifiers with 60% accuracy On a new example, run them all, and pick the

prediction using majority voting

If errors are independent, new classifier has 94% accuracy! (In reality errors will not be independent, but we

hope they will be mostly uncorrelated)

Page 29: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

BOOSTING

Main idea: If learner 1 fails to learn an example correctly,

this example is more important for learner 2 If learner 1 and 2 fail to learn an example

correctly, this example is more important for learner 3

… Weighted training set

Weights encode importance

Page 30: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

BOOSTING

Weighted training set

Ex. # Weight A B C D E CONCEPT

1 w1 False False True False True False

2 w2 False True False False False False

3 w3 False True True True True False

4 w4 False False True False False False

5 w5 False False False True True False

6 w6 True False True False False True

7 w7 True False False True False True

8 w8 True False True False True True

9 w9 True True True False True True

10 w10 True True True True True True

11 w11 True True False False False False

12 w12 True True False False True False

13 w13 True False True True True True

Page 31: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

BOOSTING

Start with uniform weights wi=1/N

Use learner 1 to generate hypothesis h1

Adjust weights to give higher importance to misclassified examples

Use learner 2 to generate hypothesis h2

… Weight hypotheses according to

performance, and return weighted majority

Page 32: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

MUSHROOM EXAMPLE

“Decision stumps” - single attribute DT

Ex. # Weight A B C D E CONCEPT

1 1/13 False False True False True False

2 1/13 False True False False False False

3 1/13 False True True True True False

4 1/13 False False True False False False

5 1/13 False False False True True False

6 1/13 True False True False False True

7 1/13 True False False True False True

8 1/13 True False True False True True

9 1/13 True True True False True True

10 1/13 True True True True True True

11 1/13 True True False False False False

12 1/13 True True False False True False

13 1/13 True False True True True True

Page 33: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

MUSHROOM EXAMPLE

Pick C first, learns CONCEPT = C

Ex. # Weight A B C D E CONCEPT

1 1/13 False False True False True False

2 1/13 False True False False False False

3 1/13 False True True True True False

4 1/13 False False True False False False

5 1/13 False False False True True False

6 1/13 True False True False False True

7 1/13 True False False True False True

8 1/13 True False True False True True

9 1/13 True True True False True True

10 1/13 True True True True True True

11 1/13 True True False False False False

12 1/13 True True False False True False

13 1/13 True False True True True True

Page 34: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

MUSHROOM EXAMPLE

Pick C first, learns CONCEPT = C

Ex. # Weight A B C D E CONCEPT

1 1/13 False False True False True False

2 1/13 False True False False False False

3 1/13 False True True True True False

4 1/13 False False True False False False

5 1/13 False False False True True False

6 1/13 True False True False False True

7 1/13 True False False True False True

8 1/13 True False True False True True

9 1/13 True True True False True True

10 1/13 True True True True True True

11 1/13 True True False False False False

12 1/13 True True False False True False

13 1/13 True False True True True True

Page 35: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

MUSHROOM EXAMPLE

Update weights (precise formula given in R&N)

Ex. # Weight A B C D E CONCEPT

1 .125 False False True False True False

2 .056 False True False False False False

3 .125 False True True True True False

4 .125 False False True False False False

5 .056 False False False True True False

6 .056 True False True False False True

7 .125 True False False True False True

8 .056 True False True False True True

9 .056 True True True False True True

10 .056 True True True True True True

11 .056 True True False False False False

12 .056 True True False False True False

13 .056 True False True True True True

Page 36: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

MUSHROOM EXAMPLE

Next try A, learn CONCEPT=A

Ex. # Weight A B C D E CONCEPT

1 .125 False False True False True False

2 .056 False True False False False False

3 .125 False True True True True False

4 .125 False False True False False False

5 .056 False False False True True False

6 .056 True False True False False True

7 .125 True False False True False True

8 .056 True False True False True True

9 .056 True True True False True True

10 .056 True True True True True True

11 .056 True True False False False False

12 .056 True True False False True False

13 .056 True False True True True True

Page 37: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

MUSHROOM EXAMPLE

Next try A, learn CONCEPT=A

Ex. # Weight A B C D E CONCEPT

1 .125 False False True False True False

2 .056 False True False False False False

3 .125 False True True True True False

4 .125 False False True False False False

5 .056 False False False True True False

6 .056 True False True False False True

7 .125 True False False True False True

8 .056 True False True False True True

9 .056 True True True False True True

10 .056 True True True True True True

11 .056 True True False False False False

12 .056 True True False False True False

13 .056 True False True True True True

Page 38: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

MUSHROOM EXAMPLE

Update weights

Ex. # Weight A B C D E CONCEPT

1 0.07 False False True False True False

2 0.03 False True False False False False

3 0.07 False True True True True False

4 0.07 False False True False False False

5 0.03 False False False True True False

6 0.03 True False True False False True

7 0.07 True False False True False True

8 0.03 True False True False True True

9 0.03 True True True False True True

10 0.03 True True True True True True

11 0.25 True True False False False False

12 0.25 True True False False True False

13 0.03 True False True True True True

Page 39: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

MUSHROOM EXAMPLE

Next try E, learn CONCEPT=E

Ex. # Weight A B C D E CONCEPT

1 0.07 False False True False True False

2 0.03 False True False False False False

3 0.07 False True True True True False

4 0.07 False False True False False False

5 0.03 False False False True True False

6 0.03 True False True False False True

7 0.07 True False False True False True

8 0.03 True False True False True True

9 0.03 True True True False True True

10 0.03 True True True True True True

11 0.25 True True False False False False

12 0.25 True True False False True False

13 0.03 True False True True True True

Page 40: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

MUSHROOM EXAMPLE

Next try E, learn CONCEPT=E

Ex. # Weight A B C D E CONCEPT

1 0.07 False False True False True False

2 0.03 False True False False False False

3 0.07 False True True True True False

4 0.07 False False True False False False

5 0.03 False False False True True False

6 0.03 True False True False False True

7 0.07 True False False True False True

8 0.03 True False True False True True

9 0.03 True True True False True True

10 0.03 True True True True True True

11 0.25 True True False False False False

12 0.25 True True False False True False

13 0.03 True False True True True True

Page 41: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

MUSHROOM EXAMPLE

Update Weights…

Ex. # Weight A B C D E CONCEPT

1 0.07 False False True False True False

2 0.03 False True False False False False

3 0.07 False True True True True False

4 0.07 False False True False False False

5 0.03 False False False True True False

6 0.03 True False True False False True

7 0.07 True False False True False True

8 0.03 True False True False True True

9 0.03 True True True False True True

10 0.03 True True True True True True

11 0.25 True True False False False False

12 0.25 True True False False True False

13 0.03 True False True True True True

Page 42: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

MUSHROOM EXAMPLE

Final classifier, order C,A,E,D,B Weights on hypotheses determined by overall

error Weighted majority weights

A=2.1, B=0.9, C=0.8, D=1.4, E=0.09 100% accuracy on training set

Page 43: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

BOOSTING STRATEGIES

Prior weighting strategy was the popular AdaBoost algorithm

see R&N pp. 667 Many other strategies Typically as the number of hypotheses

increases, accuracy increases as well Does this conflict with Occam’s razor?

Page 44: CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

ANNOUNCEMENTS

Next class: Neural networks & function learning R&N 18.6-7