cpsc 340: machine learning and data mining

40
CPSC 340: Machine Learning and Data Mining Decision Trees Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1

Upload: others

Post on 16-Oct-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CPSC 340: Machine Learning and Data Mining

CPSC340:MachineLearningandDataMining

DecisionTrees

OriginalversionoftheseslidesbyMarkSchmidt,withmodificationsbyMikeGelbart. 1

Page 2: CPSC 340: Machine Learning and Data Mining

Admin• Assignment0 isdueWednesdayat9pm(in2days)• Assignment1shouldbereleasedWednesday,dueaweeklater– Ifyouwanttoworkwithapartner,youbothmustrequestitBEFOREa1release– InstructionsintheHomeworkSubmissionInstructionsdocument

• Importantwebpages:– https://www.cs.ubc.ca/getacct/– https://github.ugrad.cs.ubc.ca/CPSC340-2017W-T2/home– https://piazza.com/class/j9uk5ecmb7e4ks

• Tutorialsandofficehoursstartthisweek.– Seecoursehomepagefortutorialtopicsandofficehoursschedule.

• Auditing– Noroomforofficialauditors.– Unofficialauditors,pleasedonottakeseatsifothersarestanding.

2

Page 3: CPSC 340: Machine Learning and Data Mining

LastTime:DataRepresentationandExploration• Wediscussedobject-featurerepresentation:– Examples:anothernamewe’lluseforobjects.

• Wediscussedsummarystatistics andvisualizingdata.

Age Job? City Rating Income

23 Yes Van A 22,000.0023 Yes Bur BBB 21,000.0022 No Van CC 0.0025 Yes Sur AAA 57,000.00

3

Page 4: CPSC 340: Machine Learning and Data Mining

MotivatingExample:FoodAllergies• Youfrequentlystartgettinganupsetstomach• Yoususpectanadult-onsetfoodallergy.

4

Page 5: CPSC 340: Machine Learning and Data Mining

MotivatingExample:FoodAllergies• Tosolvethemystery,youstartafoodjournal:

• Butit’shardtofindthepattern:– Youcan’tisolateandonlyeatonefoodatatime.– Youmaybeallergictomorethanonefood.– Thequantitymatters:asmallamountmaybeok.– Youmaybeallergictospecificinteractions.

Egg Milk Fish Wheat Shellfish Peanuts … Sick?

0 0.7 0 0.3 0 0 1

0.3 0.7 0 0.6 0 0.01 1

0 0 0 0.8 0 0 0

0.3 0.7 1.2 0 0.10 0.01 1

0.3 0 1.2 0.3 0.10 0.01 1

5

Page 6: CPSC 340: Machine Learning and Data Mining

SupervisedLearning• Wecanformulatethisassupervisedlearning:

• Inputforanobject (dayoftheweek)isasetoffeatures (quantitiesoffood).• Outputisadesiredclasslabel(whetherornotwegotsick).• Goalofsupervisedlearning:

– Usedatatofindamodelthatoutputstherightlabelbasedonthefeatures.– Modelpredictswhetherfoodswillmakeyousick(evenwithnewcombinations).

Egg Milk Fish Wheat Shellfish Peanuts …

0 0.7 0 0.3 0 0

0.3 0.7 0 0.6 0 0.01

0 0 0 0.8 0 0

0.3 0.7 1.2 0 0.10 0.01

0.3 0 1.2 0.3 0.10 0.01

Sick?

1

1

0

1

1

6

Page 7: CPSC 340: Machine Learning and Data Mining

SupervisedLearning• Generalsupervisedlearning problem:– Takefeaturesofobjectsandcorrespondinglabelsasinputs.– Findamodelthatcanaccuratelypredictthelabelsofnewobjects.

• Thisisthemostsuccessfulmachinelearningtechnique:– Spamfiltering,opticalcharacterrecognition,MicrosoftKinect,speechrecognition,classifyingtumours,etc.

• We’llfirstfocusoncategoricallabels,whichiscalled“classification”.– Themodelisacalleda“classifier”.

7

Page 8: CPSC 340: Machine Learning and Data Mining

NaïveSupervisedLearning:“PredictMode”

• Averynaïvesupervisedlearningmethod:– Counthowmanytimeseachlabeloccurredinthedata(4vs.1above).– Alwayspredictthemostcommonlabel,the“mode”(“sick”above).

• Thisignoresthefeatures,soisonlyaccurateifweonlyhave1label.• Thereisnounique“right”waytousethefeatures.– Todaywe’llconsideraclassicwayknownasdecisiontreelearning.

Egg Milk Fish Wheat Shellfish Peanuts …

0 0.7 0 0.3 0 0

0.3 0.7 0 0.6 0 0.01

0 0 0 0.8 0 0

0.3 0.7 1.2 0 0.10 0.01

0.3 0 1.2 0.3 0.10 0.01

Sick?

1

1

0

1

1

8

Page 9: CPSC 340: Machine Learning and Data Mining

DecisionTrees• Decisiontreesaresimpleprogramsconsistingof:

– Anestedsequenceof“if-else”decisionsbasedonthefeatures (splittingrules).– Aclasslabelasareturnvalueattheendofeachsequence.

• Exampledecisiontree:

if(milk>0.5){

return‘sick’}else{

if(egg>1)return‘sick’

elsereturn‘notsick’

}

Candrawsequencesofdecisionsasatree:

9

Page 10: CPSC 340: Machine Learning and Data Mining

SupervisedLearningasWritingAProgram• Therearemanypossibledecisiontrees.– We’regoingtosearchforonethatisgoodatoursupervisedlearningproblem.

• Soourinputisdataandtheoutputwillbeaprogram.– Thisiscalled“training”thesupervisedlearningmodel.– Differentthanusualinput/outputspecificationforwritingaprogram.

• SupervisedlearningisusefulwhenyouhavelotsoflabeleddataBUT:1. problemistoocomplicatedtowriteaprogramourselves,or2. humanexpertcan’texplainwhyyouassigncertainlabels,or3. wedon’thaveahumanexpertfortheproblem.

10

Page 11: CPSC 340: Machine Learning and Data Mining

LearningADecisionStump• We’llstartwith"decisionstumps”:– Simpledecisiontreewith1splittingrulebasedonthresholding1feature.

• Howdowefindthebest“rule”(feature,threshold,andleaflabels)?1. Definea‘score’fortherule.2. Searchfortherulewiththebestscore.

11

Page 12: CPSC 340: Machine Learning and Data Mining

DecisionStump:AccuracyScore• Mostintuitivescore:classificationaccuracy.– “Ifweusethisrule,howmanyobjectsdowelabelcorrectly?”

• Computingclassificationaccuracyfor(egg>1):– Findmostcommonlabelsifweusethisrule:

• When(egg>1),wewere“sick”bothtimes.• When(egg<=1),wewere“notsick”threeoutoffourtimes.

– Computeaccuracy:• Rule(egg>1)iscorrecton5/6objects.

• Scoresofotherrules:– (milk>0.5)obtainsloweraccuracyof4/6.– (egg>0)obtainsoptimalaccuracyof6/6.– ()obtains“baseline”accuracyof3/6,asdoes(egg>2).

Egg Milk Fish …

1 0.7 0

2 0.7 0

0 0 0

0 0.7 1.2

2 0 1.2

0 0 0

Sick?

1

1

0

0

1

0

12

Page 13: CPSC 340: Machine Learning and Data Mining

DecisionStump:RuleSearch(Attempt1)• Accuracy“score”evaluatesqualityofarule.– Findthebestrulebymaximizingscore.

• Attempt1(exhaustivesearch):

• Asyougo,keeptrackofthehighestscore.• Returnhighest-scoringrule (variable,threshold,andleafvalues).

Computescoreof(egg>0) Computescoreof(milk>0) …Computescoreof(egg>0.01) Computescoreof(milk>0.01) …Computescoreof(egg>0.02) Computescoreof(milk>0.02) …Computescoreof(egg>0.03) Computescoreof(milk>0.03) …… … …Computescoreof(egg>99.99) Computescoreof(milk>0.99) …

13

Page 14: CPSC 340: Machine Learning and Data Mining

SupervisedLearningNotation(MEMORIZETHIS)

• Featurematrix‘X’ hasrowsasobjects,columnsasfeatures.– xij isfeature‘j’forobject‘i’(quantityoffood‘j’onday‘i’).– xi isthelistofallfeaturesforobject‘i’(allthequantitiesonday‘i’).– xj iscolumn‘j’ofthematrix (thevalueoffeature‘j’acrossallobjects).

• Labelvector‘y’ containsthelabelsoftheobjects.– yi isthelabelofobject ‘i’ (1for“sick”,0for“notsick”).

Egg Milk Fish Wheat Shellfish Peanuts

0 0.7 0 0.3 0 0

0.3 0.7 0 0.6 0 0.01

0 0 0 0.8 0 0

0.3 0.7 1.2 0 0.10 0.01

0.3 0 1.2 0.3 0.10 0.01

Sick?

1

1

0

1

1

14

Page 15: CPSC 340: Machine Learning and Data Mining

SupervisedLearningNotation(MEMORIZETHIS)

• Trainingphase:– Use‘X’and‘y’tofinda‘model’(likeadecisionstump).

• Predictionphase:– Givenanobjectxi,usethe‘model’topredictalabel‘yhati’ (“sick”or“notsick”).

• Trainingerror:– Fractionoftimesourprediction‘yhati’doesnotequalthetrueyi label.

Egg Milk Fish Wheat Shellfish Peanuts

0 0.7 0 0.3 0 0

0.3 0.7 0 0.6 0 0.01

0 0 0 0.8 0 0

0.3 0.7 1.2 0 0.10 0.01

0.3 0 1.2 0.3 0.10 0.01

Sick?

1

1

0

1

1

15

Page 16: CPSC 340: Machine Learning and Data Mining

DecisionStumpLearningPseudo-Code

16

Page 17: CPSC 340: Machine Learning and Data Mining

CostofDecisionStumps(Attempt1)• Howmuchdoesthiscost?• Assumewehave:– ‘n’objects(daysthatwemeasured).– ‘d’features(foodsthatwemeasured).– ‘k’thresholds(>0,>0.01,>0.02,…)

• ComputingthescoreofonerulecostsO(n):– Weneedtogothroughall‘n’examples.– Seenotesonwebpageforreviewof“O(n)”notation.

• Tocomputescoresford*krules,totalcostisO(ndk).– But‘k’mightbehuge

• Canwedobetter?17

Page 18: CPSC 340: Machine Learning and Data Mining

SpeedingupRuleSearch• Wecanignorerulesoutsidefeatureranges:– E.g., weneverhave(egg>50)inourdata.– Theserulescanneverimproveaccuracy.– Restrictthresholdstorangeoffeatures.

• Mostofthethresholdsgivethesamescore.– Ifweneverhave(0.5<egg<1)inthedata,

• then(egg<0.6)and(egg<0.9)havethesamescore.

– Restrictthresholdstovaluesindata.

18

Page 19: CPSC 340: Machine Learning and Data Mining

DecisionStump:RuleSearch(Attempt2)• Attempt2(searchonlyoverfeaturesindata):

• Nowatmost‘n’thresholdsforeachfeature.• WeonlyconsiderO(nd)rulesinsteadofO(dk)rules:– TotalcostchangesfromO(ndk)toO(n2d).

Computescoreof(eggs>0) Computescoreof(milk>0.5) …Computescoreof(eggs>1) Computescoreof(milk>0.7) …Computescoreof(eggs>2) Computescoreof(milk>1) …Computescoreof(eggs>3) Computescoreof(milk>1.25) …Computescoreof(eggs>4) …

19

Page 20: CPSC 340: Machine Learning and Data Mining

DecisionStump:RuleSearch(Attempt3)• Dowehavetocomputethescorefromscratch?

– Rule(egg>1)and(egg>2)havesamedecisions,exceptwhen(egg==2).• Wecanactuallycomputethebestruleinvolving‘egg’inO(nlogn):

– Sorttheexamplesbasedon‘egg’,andusethesepositionstore-arrange‘y’.– Gothroughthesortedvaluesinorder,updatingthecountsof#sickand#not-sickthat

bothsatisfyanddon’tsatisfytherules.– Withthesecounts,it’seasytocomputetheclassificationaccuracy(seebonusslide).

• SortingcostsO(nlogn)perfeature.• TotalcostofupdatingcountsisO(n)perfeature.• TotalcostisreducedfromO(n2d)toO(nd logn).• Thisisagoodruntime:

– O(nd)isthesizeofdata,soO(nd logn)is sameaslookingatdata,uptoalogfactor.– Wecanapplythisalgorithmtohugedatasets.

20

Page 21: CPSC 340: Machine Learning and Data Mining

(pause)

21

Page 22: CPSC 340: Machine Learning and Data Mining

DecisionTreeLearning• Decisionstumps haveonly1rulebasedononly1feature.– Verylimitedclassofmodels:usuallynotveryaccurateformosttasks.

• Decisiontreesallowsequencesofsplits basedonmultiplefeatures.– Verygeneralclassofmodels:cangetveryhighaccuracy.– However,it’scomputationallyinfeasibletofindthebestdecisiontree.

• Mostcommondecisiontreelearningalgorithminpractice:– Greedyrecursivesplitting.

22

Page 23: CPSC 340: Machine Learning and Data Mining

ExampleofGreedyRecursiveSplitting• Startwiththefulldataset:

Egg Milk …

0 0.7

1 0.7

0 0

1 0.6

1 0

2 0.6

0 1

2 0

0 0.3

1 0.6

2 0

Findthedecisionstumpwiththebestscore:

Splitintotwosmallerdatasetsbasedonstump:Egg Milk …

0 0

1 0

2 0

0 0.3

2 0

Egg Milk …

0 0.7

1 0.7

1 0.6

2 0.6

0 1

1 0.6

Sick?

1

1

0

1

0

1

1

1

0

0

1

Sick?

0

0

1

0

1

Sick?1

1

1

1

1

023

Page 24: CPSC 340: Machine Learning and Data Mining

GreedyRecursiveSplittingWenowhaveadecisionstumpandtwodatasets:

Egg Milk … Sick?

0 0 0

1 0 0

2 0 1

0 0.3 0

2 0 1

Egg Milk … Sick?

0 0.7 1

1 0.7 1

1 0.6 1

2 0.6 1

0 1 1

1 0.6 0

Fitadecisionstumptoeachleaf’sdata.

24

Page 25: CPSC 340: Machine Learning and Data Mining

GreedyRecursiveSplittingWenowhaveadecisionstumpandtwodatasets:

Egg Milk … Sick?

0 0 0

1 0 0

2 0 1

0 0.3 0

2 0 1

Egg Milk … Sick?

0 0.7 1

1 0.7 1

1 0.6 1

2 0.6 1

0 1 1

1 0.6 0

Fitadecisionstumptoeachleaf’sdata.Thenaddthesestumpstothetree.

25

Page 26: CPSC 340: Machine Learning and Data Mining

GreedyRecursiveSplittingThisgivesa“depth2”decisiontree: Itsplitsthetwodatasetsintofourdatasets:

Egg Milk … Sick?

0 0 0

1 0 0

2 0 1

0 0.3 0

2 0 1

Egg Milk … Sick?

0 0.7 1

1 0.7 1

1 0.6 1

2 0.6 1

0 1 1

1 0.6 0

Egg Milk … Sick?

0 0 0

1 0 0

0 0.3 0

Egg Milk … Sick?

2 0 1

2 0 1

Egg Milk … Sick?

0 0.7 1

1 0.7 1

1 0.6 1

2 0.6 1

Egg Milk … Sick?

1 0.6 0

26

Page 27: CPSC 340: Machine Learning and Data Mining

GreedyRecursiveSplittingWecouldtrytosplitthefourleavestomakea“depth3”decisiontree:

Wemightcontinuesplittinguntil:- Theleaveseachhaveonlyonelabel.- Wereachauser-definedmaximumdepth.

27

Page 28: CPSC 340: Machine Learning and Data Mining

DiscussionofDecisionTreeLearning• Advantages:

– Interpretable.– Fasttolearn.– Veryfasttoclassify

• Disadvantages:– Hardtofindoptimalsetofrules.– Greedysplittingoftennotaccurate,requiresverydeeptrees.

• Issues:– Canyourevisitafeature?

• Yes,knowingotherinformationcouldmakefeaturerelevantagain.– Morecomplicatedrules?

• Yes,butsearchingforthebestrulegetsmuchmoreexpensive.– Isaccuracythebestscore?

• No,theremaybenosplitthatincreaseaccuracy.Alternative:informationgain (bonusslides).– Whatdepth?

28

Page 29: CPSC 340: Machine Learning and Data Mining

Summary• Supervisedlearning:– Usingdatatowriteaprogrambasedoninput/outputexamples.

• Decisiontrees:predictingalabelusingasequenceofsimplerules.• Decisionstumps:simpledecisiontreethatisveryfasttofit.• Greedyrecursivesplitting:usesasequenceofstumpstofitatree.– Veryfastandinterpretable,butnotalwaysthemostaccurate.

29

Page 30: CPSC 340: Machine Learning and Data Mining

OtherConsiderationsforFoodAllergyExample• Whattypesofpreprocessing mightwedo?

– Datacleaning:checkforandfixmissing/unreasonablevalues.– Summarystatistics:

• Canhelpidentify“unclean”data.• Correlationmightrevealanobviousdependence(“sick”ó “peanuts”).

– Datatransformations:• Converteverythingtosamescale?(e.g.,grams)• Addfoodsfromdaybefore?(maybe“sick”dependsonmultipledays)• Adddate?(maybewhatmakesyou“sick”changesovertime).

– Datavisualization:lookatascatterplotofeachfeatureandthelabel.• Maybethevisualizationwillshowsomethingweirdinthefeatures.• Maybethepatternisreallyobvious!

• Whatyoudomightdependonhowmuchdatayouhave:– Verylittledata:

• Representfoodbycommonallergicingredients(lactose,gluten,etc.)?– Lotsofdata:

• Usemorefine-grainedfeatures(breadfrombakeryvs.hamburgerbun)?

30

Page 31: CPSC 340: Machine Learning and Data Mining

HowdowefitstumpsinO(dn logn)?• Let’ssaywe’retryingtofindthebestruleinvolvingmilk:

Egg Milk …

0 0.7

1 0.7

0 0

1 0.6

1 0

2 0.6

0 1

2 0

0 0.3

1 0.6

2 0

Sick?

1

1

0

1

0

1

1

1

0

0

1

Firstgrabthemilkcolumnandsortit(usingthesortpositionstore-arrangethesickcolumn). ThisstepcostsO(nlogn)duetosorting.

Now,we’llgothroughthemilkvaluesinorder,keepingtrackof#sickand#notsickthatareabove/belowthecurrentvalue.E.g.,#sickabove0.3is5.

Withthesecounts,accuracyscoreis(sumofmostcommonlabelaboveandbelow)/n.

Milk

0

0

0

0

0.3

0.6

0.6

0.6

0.7

0.7

1

Sick?

0

0

0

0

0

1

1

0

1

1

131

Page 32: CPSC 340: Machine Learning and Data Mining

HowdowefitstumpsinO(dn logn)?

Milk

0

0

0

0

0.3

0.6

0.6

0.6

0.7

0.7

1

Sick?

0

0

0

0

0

1

1

0

1

1

1

Startwiththebaselinerule()whichisalways“satisfied”:Ifsatisfied,#sick=5and#not-sick=6.Ifnotsatisfied,#sick=0and#not-sick=0.Thisgivesaccuracyof(6+0)/n=6/11.

Nexttrytherule(milk>0),andupdatethecountsbasedonthese4rows:Ifsatisfied,#sick=5 and#not-sick=2.Ifnotsatisfied,#sick=0and#not-sick=4.Thisgivesaccuracyof(5+4)/n=9/11,whichisbetter.

Nexttrytherule(milk>0.3),andupdatethecountsbasedonthis1row:Ifsatisfied,#sick=5 and#not-sick=1.Ifnotsatisfied,#sick=0and#not-sick=5.Thisgivesaccuracyof(5+5)/n=10/11,whichisbetter.(andkeepgoinguntilyougettotheend…)

32

Page 33: CPSC 340: Machine Learning and Data Mining

HowdowefitstumpsinO(dn logn)?

Milk

0

0

0

0

0.3

0.6

0.6

0.6

0.7

0.7

1

Sick?

0

0

0

0

0

1

1

0

1

1

1

Noticethatforeachrow,updatingthecountsonlycostsO(1).SincethereareO(n)rows,totalcostofupdatingcountsisO(n).

Insteadof2labels(sickvs.not-sick),considerthecaseof‘k’labels:- UpdatingthecountsstillcostsO(n),sinceeachrowhasonelabel.- Butcomputingthe‘max’acrossthelabelscostsO(k),socostisO(kn).

With‘k’labels,youcandecreasecostusinga“max-heap”datastructure:- CostofgettingmaxisO(1),costofupdatingheapforarowisO(logk).- Butk<=n(eachrowhasonlyonelabel).- SocostisinO(logn)foronerow.

SincetheaboveshowswecanfindbestruleinonecolumninO(nlogn),totalcosttofindbestruleacrossall‘d’columnsisO(nd logn).

33

Page 34: CPSC 340: Machine Learning and Data Mining

Candecisiontreesre-visitafeature?• Yes.

Knowing(icecream>0.3)makessmallmilkquantitiesrelevant.34

Page 35: CPSC 340: Machine Learning and Data Mining

Candecisiontreeshavemorecomplicatedrules?

• Yes:

• Butsearchingforbestrulecangetexpensive.

35

Page 36: CPSC 340: Machine Learning and Data Mining

Doesbeinggreedyactuallyhurt?• Can’tyoujustgodeepertocorrectgreedydecisions?– Yes,butyouneedto“re-discover”ruleswithlessdata.

• Considerthatyouareallergictomilk(anddrinkthisoften),andalsogetsickwhenyou(rarely)combinedietcokewithmentos.

• Greedymethodshouldfirstsplitonmilk(helpsaccuracythemost):

36

Page 37: CPSC 340: Machine Learning and Data Mining

Doesbeinggreedyactuallyhurt?• Can’tyoujustgodeepertocorrectgreedydecisions?– Yes,butyouneedto“re-discover”ruleswithlessdata.

• Considerthatyouareallergictomilk(anddrinkthisoften),andalsogetsickwhenyou(rarely)combinedietcokewithmentos.

• Greedymethodshouldfirstsplitonmilk(helpsaccuracythemost).• Non-greedymethodcouldgetsimplertree(splitonmilklater):

37

Page 38: CPSC 340: Machine Learning and Data Mining

Whichscorefunctionshouldadecisiontreeused?

• Shouldn’twejustuseaccuracyscore?– Forleafs:yes,justmaximizeaccuracy.– Forinternalnodes:maybenot.

• Theremaybenosimplerulelike(egg>0.5)thatimprovesaccuracy.

• Mostcommonscoreinpractice:informationgain.– Choosesplitthatdecreasesentropy (“randomness”)oflabelsthemost.– Motivation:trytomakesplitdata“lessrandom”or“morepredictable”.

• Mightthenbeeasiertofindhigh-accuracyonthe“lessrandom”splitdata.

38

Page 39: CPSC 340: Machine Learning and Data Mining

DecisionTreeswithProbabilisticPredictions• Often,we’llhavemultiple‘y’valuesateachleafnode.• Inthesecases,wemightreturnprobabilitiesinsteadofalabel.

• E.g.,ifintheleafnodewe5have“sick”objectsand1“notsick”:– Returnp(y=“sick”|xi)=5/6andp(y=“notsick”|xi)=1/6.

• Ingeneral,anaturalestimateoftheprobabilitiesattheleafnodes:– Let‘nk’bethenumberofobjectsthatarrivetoleafnode‘k’.– Let‘nkc’bethenumberoftimes(y==c)intheobjectsatleafnode‘k’.– Maximumlikelihoodestimateforthisleafisp(y=c|xi)=nkc/nk.

39

Page 40: CPSC 340: Machine Learning and Data Mining

AlternativeStoppingRules• Therearemorecomplicatedrulesfordecidingwhen*not*tosplit.

• Rulesbasedonminimumsamplesize.– Don’tsplitanynodeswherethenumberofobjectsislessthansome‘m’.– Don’tsplitanynodesthatcreatechildrenwithlessthan‘m’objects.

• Thesetypesofrulestrytomakesurethatyouhaveenoughdatatojustifydecisions.

• Alternately,youcanuseavalidationset(seenextlecture):– Don’tsplitthenodeifitdecreasesanapproximationoftestaccuracy.

40