lecture 2: classification and decision trees …...lecture 2: classification and decision trees...

50
Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine Learning”, and slides adapted from David Sontag, Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore COS 402 – Machine Learning and Artificial Intelligence Fall 2016

Upload: others

Post on 27-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Lecture2:ClassificationandDecisionTrees

SanjeevArora EladHazan

ThislecturecontainsmaterialfromtheT.Micheltext“MachineLearning”,andslidesadaptedfrom

DavidSontag,LukeZettlemoyer,CarlosGuestrin,andAndrewMoore

COS402– MachineLearningand

ArtificialIntelligenceFall2016

Page 2: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Admin

• Enrollingtothecourse– priorities• Pre-requisitesreminder(calculus,linearalgebra,discretemath,probability,datastructures&graphalgorithms)

• Movietickets• Slides• Exercise1– theory– dueinoneweek,inclass

Page 3: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Agenda

Thiscourse:Basicprinciplesofhowtodesignmachines/programsthatact“intelligently.”

• Recallcrow/foxexample.• Startwithsimplertask– classification,learningfromexamples• Today:stillNaiveAI.Nextweek:statisticallearningtheory

Page 4: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Classification

Goal: Find best mapping from domain (features) to output (labels)

• Given a document (email), classify spam or ham. Features = words , labels = {spam, ham}

• Given a picture, classify if it contains a chair or notfeatures = bits in a bitmap image, labels = {chair, no chair}

GOAL: automatic machine that learns from examples

Terminology for learning from examples:• Set aside a ”training set” of examples, train a classification machine• Test on a “test set”, to see how well machine performs on unseen examples

Page 5: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Classifyingfuelefficiency

From the UCI repository (thanks to Ross Quinlan)

• 40datapoints

• Goal:predictMPG

• Needtofind:f : Xà Y

• Discretedata(fornow)

mpg cylinders displacement horsepower weight acceleration modelyear maker

good 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europe

X Y

Page 6: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Decisiontreesforclassification

• Whyusedecisiontrees?• Whatistheirexpressivepower?• Cantheybeconstructedautomatically?• Howaccuratecantheyclassify?• Whatmakesagooddecisiontreebesidesaccuracyongivenexamples?

Page 7: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Somerealexamples(fromRussell&Norvig,Mitchell)• BP’sGasOIL systemforseparatinggasandoilonoffshoreplatforms- decisiontreesreplacedahand-designedrulessystemwith2500rules.C4.5-basedsystemoutperformedhumanexpertsandsavedBPmillions.(1986)

• learningtoflyaCessnaonaflightsimulatorbywatchinghumanexpertsflythesimulator(1992)

• canalsolearntoplaytennis,analyzeC-sectionrisk,etc.

Decisiontreesforclassification

Page 8: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

• interpretable/intuitive,popularinmedicalapplicationsbecausetheymimicthewayadoctorthinks

• modeldiscreteoutcomesnicely• canbeverypowerful (expressive)• C4.5andCART- from“top10dataminingmethods”- verypopular

• ThisThu:we’llseewhynot…

Decisiontreesforclassification

Page 9: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

decision trees f : X à Y

• Each internal node tests an attribute xi

• One branch for each possible attribute value xi=v

• Each leaf assigns a class y

• To classify input x: traverse the tree from root to leaf, output the labeled y

• Can we construct a tree automatically?

Cylinders

3 4 5 6 8

good bad badMaker Horsepower

low med highamerica asia europe

bad badgoodgood goodbad

Humaninterpretable!

Page 10: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

ExpressivepowerofDT

Whatkindof functionscantheypotentially represent?• Booleanfunctions?

F= 0,1 $ ↦ {0,1} X1 X2 X3 F(X1,X2,X3)

0 0 0 0

0 0 1 0

0 1 0 0

0 1 1 0

1 0 0 0

1 0 1 0

1 1 0 0

1 1 1 0

0 0

x1

x3x3 x3x3

x2 x2

0

0

0

1 1

1

0 10 1 0 10 10

0

0

0

Page 11: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Issimplerbetter?

Whatkindof functionscantheypotentially represent?• Booleanfunctions?

F= 0,1 $ ↦ {0,1} X1 X2 X3 F(X1,X2,X3)

0 0 0 0

0 0 1 0

0 1 0 0

0 1 1 0

1 0 0 0

1 0 1 0

1 1 0 0

1 1 1 0

?

0

Page 12: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

WhatistheSimplestTree?

mpg cylinders displacement horsepower weight acceleration modelyear maker

good 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europe

Isthisagoodtree?

[22+,18-] Means:correcton22examplesincorrecton18examples

predictmpg=bad

Page 13: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Arealldecisiontreesequal?

• Manytreescanrepresentthesameconcept• But,notalltreeswillhavethesamesize!

– e.g.,((AandB)or(notAandC))

A

B Ct

t

f

f

+ _t f

+ _

• Which tree do we prefer?

B

C C

t f

f

t f+ _

At f

A

_ +

_t t f

Aft

+ _

Page 14: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Learning simplestdecisiontreeis NP-hard

• Formaljustification- statisticallearningtheory(nextlecture)

• Learningthesimplest(smallest)decisiontreeisanNP-completeproblem[Hyafil &Rivest ’76]

• Resorttoagreedyheuristic:– Startfromemptydecisiontree– Splitonnextbestattribute(feature)– Recurs

Page 15: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

( ) ( ){ }1 1, ,..., ,N NS y y= x x

LearningAlgorithmforDecisionTrees( ){ }

1,...,

, 0,1d

j

x x

x y

=

Œ

x

What happens if features are not binary? What about regression?Decision Trees

DTalgs differonthischoice!• ID3• CAT4.5• CART

Page 16: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

ADecisionStump

Page 17: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Keyidea:Greedilylearntreesusingrecursion

Take theOriginalDataset..

And partition it accordingto the value of the attribute we split on

Records in which cylinders

= 4

Records in which cylinders

= 5

Records in which cylinders

= 6

Records in which cylinders

= 8

Page 18: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

RecursiveStep

Records in which cylinders

= 4

Records in which cylinders

= 5

Records in which cylinders

= 6

Records in which cylinders

= 8

Build tree fromThese records..

Build tree fromThese records..

Build tree fromThese records..

Build tree fromThese records..

Page 19: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Secondleveloftree

Recursively build a tree from the seven records in which there are four cylinders and the maker was based in Asia

(Similar recursion in the other cases)

Page 20: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

A full tree

Page 21: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Splitting:choosingagoodattribute

X1 X2 YT T TT F TT T TT F TF T TF F FF T FF F F

X1

Y=t : 4Y=f : 0

t f

Y=t : 1Y=f : 3

X2

Y=t : 3Y=f : 1

t f

Y=t : 2Y=f : 2

Would we prefer to split on X1 or X2?

Idea: use counts at leaves to define probability distributions, so we can measure uncertainty!

Page 22: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Measuringuncertainty

• Goodsplitifwearemorecertainaboutclassificationaftersplit– Deterministicgood(alltrueorallfalse)– Uniformdistributionbad– Whataboutdistributionsinbetween?

P(Y=A) = 1/4 P(Y=B) = 1/4 P(Y=C) = 1/4 P(Y=D) = 1/4

P(Y=A) = 1/2 P(Y=B) = 1/4 P(Y=C) = 1/8 P(Y=D) = 1/8

Page 23: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

EntropyEntropyH(Y) ofarandomvariableY

More uncertainty, more entropy!

Information Theory interpretation:H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code)

ProbabilityofheadsEntropy

Entropyofacoinflip

Page 24: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

High,LowEntropy

• “HighEntropy”– Yisfromauniformlikedistribution– Flathistogram– Valuessampledfromitarelesspredictable

• “LowEntropy”– Yisfromavaried(peaksandvalleys)distribution

– Histogramhasmanylowsandhighs– Valuessampledfromitaremorepredictable

(Slide from Vibhav Gogate)

Page 25: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

EntropyExample

X1 X2 YT T TT F TT T TT F TF T TF F F

P(Y=t) = 5/6P(Y=f) = 1/6

H(Y) = - 5/6 log2 5/6 - 1/6 log2 1/6= 0.65

Probabilityofheads

Entropy

Entropyofacoinflip

Page 26: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

ConditionalEntropyConditionalEntropyH(Y |X) ofarandomvariableY conditionedona

randomvariableX

X1

Y=t : 4Y=f : 0

t f

Y=t : 1Y=f : 1

P(X1=t) = 4/6P(X1=f) = 2/6

X1 X2 YT T TT F TT T TT F TF T TF F F

Example:

H(Y|X1) = - 4/6 (1 log2 1 + 0 log2 0)- 2/6 (1/2 log2 1/2 + 1/2 log2 1/2)

= 2/6

Page 27: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Informationgain• Decreaseinentropy(uncertainty)aftersplitting

X1 X2 YT T TT F TT T TT F TF T TF F F

In our running example:

IG(X1) = H(Y) – H(Y|X1)= 0.65 – 0.33

IG(X1) > 0 à we prefer the split!

Page 28: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Learningdecisiontrees

• Startfromemptydecisiontree• Splitonnextbestattribute(feature)

– Use,forexample,informationgaintoselectattribute:

• Recurs

Page 29: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Lookatalltheinformationgains…

Suppose we want to predict MPG

Page 30: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Whentostop?

First split looks good! But, when do we stop?

Page 31: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Base Case One

Don’t split a node if all matching

records have the same

output value

Page 32: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Base Case Two

Don’t split a node if data points are

identical on remaining attributes

Page 33: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

BaseCases:Anidea

• BaseCaseOne:Ifallrecordsincurrentdatasubsethavethesameoutputthendon’trecurs

• BaseCaseTwo:Ifallrecordshaveexactlythesamesetofinputattributesthendon’trecurs

Proposed Base Case 3:If all attributes have small

information gain then don’t recurs

•This is not a good idea

Page 34: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Theproblemwith proposedcase3a b y0 0 00 1 11 0 11 1 0

y = a XOR b

The information gains:

Page 35: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Ifweomit proposedcase3:

a b y0 0 00 1 11 0 11 1 0

y = a XOR bThe resulting decision tree:

Instead, perform pruning after building a tree

Page 36: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Non-BooleanFeatures

• Real-valuedfeatures?

Page 37: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Real->threshold

WEIGHT>?

𝑁𝑜 𝑌𝑒𝑠

• Number of thresholds <= # of different values in dataset

• Can choose threshold based on information gain

Page 38: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Summary:BuildingDecisionTrees

BuildTree(DataSet,Output)• IfalloutputvaluesarethesameinDataSet,returnaleafnode

thatsays“predictthisuniqueoutput”• Ifallinputvaluesarethesame,returnaleafnodethatsays

“predictthemajorityoutput”• ElsefindattributeX withhighestInfoGain• SupposeX hasnX distinctvalues(i.e.Xhasarity nX).

– Createanon-leafnodewithnX children.– Thei’th childshouldbebuiltbycalling

BuildTree(DSi,Output)WhereDSi containstherecordsinDataSetwhereX=ith valueofX.

Page 39: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Machine Space Search

• ID3 / C4.5 / CART search for a succinct tree that perfectly fits the data.

• They are not going to find it in general (NP-hard)

• Entropy-guided splitting – well-performing heuristic. Exists others.

• Why should we search for a small tree at all?

Page 40: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

MPG Test set error

The test set error is much worse than the training set error…

…why?

Page 41: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Decisiontreeswilloverfit

Page 42: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Fittingapolynomial𝑡 = sin 2𝜋𝑥 + 𝜖

Figure fromMachineLearningandPatternRecognition,Bishop

Page 43: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Fittingapolynomial

Figure fromMachineLearningandPatternRecognition,Bishop

𝑡 = sin 2𝜋𝑥 + 𝜖

RegressionusingpolynomialofdegreeM

Page 44: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Figure fromMachineLearningandPatternRecognition,Bishop

𝑡 = sin 2𝜋𝑥 + 𝜖

Page 45: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Figure fromMachineLearningandPatternRecognition,Bishop

𝑡 = sin 2𝜋𝑥 + 𝜖

Page 46: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

𝑡 = sin 2𝜋𝑥 + 𝜖

Figure fromMachineLearningandPatternRecognition,Bishop

Page 47: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Figure fromMachineLearningandPatternRecognition,Bishop

Page 48: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Overfitting

• Precisecharacterization– statisticallearningtheory• SpecialtechnicstopreventoverfittinginDTlearning

• Pruningthetree,e.g.“reducederror”pruning:Dountilfurtherpruningisharmful:1. Evaluatetheimpactonvalidation(test)setofthedataofpruningeachpossiblenode

(andit’ssubtree)2. Greedilyremoveonethatmostimprovesvalidation(test)error

Page 49: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Conceptswehaveencountered(andwillappearagaininthecourse!)

• Greedyalgorithm.

• Tryingtofindsuccinctmachines.

• Computationalefficiencyofanalgorithm(running time).

• Overfitting.

Page 50: Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees Sanjeev Arora Elad Hazan This lecture contains material from the T. Michel text “Machine

Questions

• Whyaresmallertrees(theory) preferabletolargetrees(theory)?

• Whencanatreegeneralizetounseenexamples,andhowwell?

• Howmanyexamplesareneededtobuildatreethatgeneralizeswell?

• Howtoidentifyandpreventoverfitting?

• Arethereothernaturalclassificationmachinesandhowdotheycompare?

Needtoreasonmoregenerallyabout“whatlearningmeans”,TBConThu…