cpsc 340: machine learning and data mining

CPSC340:MachineLearningandDataMining

DecisionTrees

OriginalversionoftheseslidesbyMarkSchmidt,withmodificationsbyMikeGelbart. 1

Admin• Assignment0 isdueWednesdayat9pm(in2days)• Assignment1shouldbereleasedWednesday,dueaweeklater– Ifyouwanttoworkwithapartner,youbothmustrequestitBEFOREa1release– InstructionsintheHomeworkSubmissionInstructionsdocument

• Importantwebpages:– https://www.cs.ubc.ca/getacct/– https://github.ugrad.cs.ubc.ca/CPSC340-2017W-T2/home– https://piazza.com/class/j9uk5ecmb7e4ks

• Tutorialsandofficehoursstartthisweek.– Seecoursehomepagefortutorialtopicsandofficehoursschedule.

• Auditing– Noroomforofficialauditors.– Unofficialauditors,pleasedonottakeseatsifothersarestanding.

2

LastTime:DataRepresentationandExploration• Wediscussedobject-featurerepresentation:– Examples:anothernamewe’lluseforobjects.

• Wediscussedsummarystatistics andvisualizingdata.

Age Job? City Rating Income

23 Yes Van A 22,000.0023 Yes Bur BBB 21,000.0022 No Van CC 0.0025 Yes Sur AAA 57,000.00

3

MotivatingExample:FoodAllergies• Youfrequentlystartgettinganupsetstomach• Yoususpectanadult-onsetfoodallergy.

4

MotivatingExample:FoodAllergies• Tosolvethemystery,youstartafoodjournal:

• Butit’shardtofindthepattern:– Youcan’tisolateandonlyeatonefoodatatime.– Youmaybeallergictomorethanonefood.– Thequantitymatters:asmallamountmaybeok.– Youmaybeallergictospecificinteractions.

Egg Milk Fish Wheat Shellfish Peanuts … Sick?

0 0.7 0 0.3 0 0 1

0.3 0.7 0 0.6 0 0.01 1

0 0 0 0.8 0 0 0

0.3 0.7 1.2 0 0.10 0.01 1

0.3 0 1.2 0.3 0.10 0.01 1

5

SupervisedLearning• Wecanformulatethisassupervisedlearning:

• Inputforanobject (dayoftheweek)isasetoffeatures (quantitiesoffood).• Outputisadesiredclasslabel(whetherornotwegotsick).• Goalofsupervisedlearning:

– Usedatatofindamodelthatoutputstherightlabelbasedonthefeatures.– Modelpredictswhetherfoodswillmakeyousick(evenwithnewcombinations).

Egg Milk Fish Wheat Shellfish Peanuts …

0 0.7 0 0.3 0 0

0.3 0.7 0 0.6 0 0.01

0 0 0 0.8 0 0

0.3 0.7 1.2 0 0.10 0.01

0.3 0 1.2 0.3 0.10 0.01

Sick?

1

1

0

1

1

6

SupervisedLearning• Generalsupervisedlearning problem:– Takefeaturesofobjectsandcorrespondinglabelsasinputs.– Findamodelthatcanaccuratelypredictthelabelsofnewobjects.

• Thisisthemostsuccessfulmachinelearningtechnique:– Spamfiltering,opticalcharacterrecognition,MicrosoftKinect,speechrecognition,classifyingtumours,etc.

• We’llfirstfocusoncategoricallabels,whichiscalled“classification”.– Themodelisacalleda“classifier”.

7

NaïveSupervisedLearning:“PredictMode”

• Averynaïvesupervisedlearningmethod:– Counthowmanytimeseachlabeloccurredinthedata(4vs.1above).– Alwayspredictthemostcommonlabel,the“mode”(“sick”above).

• Thisignoresthefeatures,soisonlyaccurateifweonlyhave1label.• Thereisnounique“right”waytousethefeatures.– Todaywe’llconsideraclassicwayknownasdecisiontreelearning.

Egg Milk Fish Wheat Shellfish Peanuts …

0 0.7 0 0.3 0 0

0.3 0.7 0 0.6 0 0.01

0 0 0 0.8 0 0

0.3 0.7 1.2 0 0.10 0.01

0.3 0 1.2 0.3 0.10 0.01

Sick?

1

1

0

1

1

8

DecisionTrees• Decisiontreesaresimpleprogramsconsistingof:

– Anestedsequenceof“if-else”decisionsbasedonthefeatures (splittingrules).– Aclasslabelasareturnvalueattheendofeachsequence.

• Exampledecisiontree:

if(milk>0.5){

return‘sick’}else{

if(egg>1)return‘sick’

elsereturn‘notsick’

}

Candrawsequencesofdecisionsasatree:

9

SupervisedLearningasWritingAProgram• Therearemanypossibledecisiontrees.– We’regoingtosearchforonethatisgoodatoursupervisedlearningproblem.

• Soourinputisdataandtheoutputwillbeaprogram.– Thisiscalled“training”thesupervisedlearningmodel.– Differentthanusualinput/outputspecificationforwritingaprogram.

• SupervisedlearningisusefulwhenyouhavelotsoflabeleddataBUT:1. problemistoocomplicatedtowriteaprogramourselves,or2. humanexpertcan’texplainwhyyouassigncertainlabels,or3. wedon’thaveahumanexpertfortheproblem.

10

LearningADecisionStump• We’llstartwith"decisionstumps”:– Simpledecisiontreewith1splittingrulebasedonthresholding1feature.

• Howdowefindthebest“rule”(feature,threshold,andleaflabels)?1. Definea‘score’fortherule.2. Searchfortherulewiththebestscore.

11

DecisionStump:AccuracyScore• Mostintuitivescore:classificationaccuracy.– “Ifweusethisrule,howmanyobjectsdowelabelcorrectly?”

• Computingclassificationaccuracyfor(egg>1):– Findmostcommonlabelsifweusethisrule:

• When(egg>1),wewere“sick”bothtimes.• When(egg<=1),wewere“notsick”threeoutoffourtimes.

– Computeaccuracy:• Rule(egg>1)iscorrecton5/6objects.

• Scoresofotherrules:– (milk>0.5)obtainsloweraccuracyof4/6.– (egg>0)obtainsoptimalaccuracyof6/6.– ()obtains“baseline”accuracyof3/6,asdoes(egg>2).

Egg Milk Fish …

1 0.7 0

2 0.7 0

0 0 0

0 0.7 1.2

2 0 1.2

0 0 0

Sick?

1

1

0

0

1

0

12

DecisionStump:RuleSearch(Attempt1)• Accuracy“score”evaluatesqualityofarule.– Findthebestrulebymaximizingscore.

• Attempt1(exhaustivesearch):

• Asyougo,keeptrackofthehighestscore.• Returnhighest-scoringrule (variable,threshold,andleafvalues).

Computescoreof(egg>0) Computescoreof(milk>0) …Computescoreof(egg>0.01) Computescoreof(milk>0.01) …Computescoreof(egg>0.02) Computescoreof(milk>0.02) …Computescoreof(egg>0.03) Computescoreof(milk>0.03) …… … …Computescoreof(egg>99.99) Computescoreof(milk>0.99) …

13

SupervisedLearningNotation(MEMORIZETHIS)

• Featurematrix‘X’ hasrowsasobjects,columnsasfeatures.– xij isfeature‘j’forobject‘i’(quantityoffood‘j’onday‘i’).– xi isthelistofallfeaturesforobject‘i’(allthequantitiesonday‘i’).– xj iscolumn‘j’ofthematrix (thevalueoffeature‘j’acrossallobjects).

• Labelvector‘y’ containsthelabelsoftheobjects.– yi isthelabelofobject ‘i’ (1for“sick”,0for“notsick”).

Egg Milk Fish Wheat Shellfish Peanuts

0 0.7 0 0.3 0 0

0.3 0.7 0 0.6 0 0.01

0 0 0 0.8 0 0

0.3 0.7 1.2 0 0.10 0.01

0.3 0 1.2 0.3 0.10 0.01

Sick?

1

1

0

1

1

14

SupervisedLearningNotation(MEMORIZETHIS)

• Trainingphase:– Use‘X’and‘y’tofinda‘model’(likeadecisionstump).

• Predictionphase:– Givenanobjectxi,usethe‘model’topredictalabel‘yhati’ (“sick”or“notsick”).

• Trainingerror:– Fractionoftimesourprediction‘yhati’doesnotequalthetrueyi label.

Egg Milk Fish Wheat Shellfish Peanuts

0 0.7 0 0.3 0 0

0.3 0.7 0 0.6 0 0.01

0 0 0 0.8 0 0

0.3 0.7 1.2 0 0.10 0.01

0.3 0 1.2 0.3 0.10 0.01

Sick?

1

1

0

1

1

15

DecisionStumpLearningPseudo-Code

16

CostofDecisionStumps(Attempt1)• Howmuchdoesthiscost?• Assumewehave:– ‘n’objects(daysthatwemeasured).– ‘d’features(foodsthatwemeasured).– ‘k’thresholds(>0,>0.01,>0.02,…)

• ComputingthescoreofonerulecostsO(n):– Weneedtogothroughall‘n’examples.– Seenotesonwebpageforreviewof“O(n)”notation.

• Tocomputescoresford*krules,totalcostisO(ndk).– But‘k’mightbehuge

• Canwedobetter?17

SpeedingupRuleSearch• Wecanignorerulesoutsidefeatureranges:– E.g., weneverhave(egg>50)inourdata.– Theserulescanneverimproveaccuracy.– Restrictthresholdstorangeoffeatures.

• Mostofthethresholdsgivethesamescore.– Ifweneverhave(0.5<egg<1)inthedata,

• then(egg<0.6)and(egg<0.9)havethesamescore.

– Restrictthresholdstovaluesindata.

18

DecisionStump:RuleSearch(Attempt2)• Attempt2(searchonlyoverfeaturesindata):

• Nowatmost‘n’thresholdsforeachfeature.• WeonlyconsiderO(nd)rulesinsteadofO(dk)rules:– TotalcostchangesfromO(ndk)toO(n2d).

Computescoreof(eggs>0) Computescoreof(milk>0.5) …Computescoreof(eggs>1) Computescoreof(milk>0.7) …Computescoreof(eggs>2) Computescoreof(milk>1) …Computescoreof(eggs>3) Computescoreof(milk>1.25) …Computescoreof(eggs>4) …

19

DecisionStump:RuleSearch(Attempt3)• Dowehavetocomputethescorefromscratch?

– Rule(egg>1)and(egg>2)havesamedecisions,exceptwhen(egg==2).• Wecanactuallycomputethebestruleinvolving‘egg’inO(nlogn):

– Sorttheexamplesbasedon‘egg’,andusethesepositionstore-arrange‘y’.– Gothroughthesortedvaluesinorder,updatingthecountsof#sickand#not-sickthat

bothsatisfyanddon’tsatisfytherules.– Withthesecounts,it’seasytocomputetheclassificationaccuracy(seebonusslide).

• SortingcostsO(nlogn)perfeature.• TotalcostofupdatingcountsisO(n)perfeature.• TotalcostisreducedfromO(n2d)toO(nd logn).• Thisisagoodruntime:

– O(nd)isthesizeofdata,soO(nd logn)is sameaslookingatdata,uptoalogfactor.– Wecanapplythisalgorithmtohugedatasets.

20

(pause)

21

DecisionTreeLearning• Decisionstumps haveonly1rulebasedononly1feature.– Verylimitedclassofmodels:usuallynotveryaccurateformosttasks.

• Decisiontreesallowsequencesofsplits basedonmultiplefeatures.– Verygeneralclassofmodels:cangetveryhighaccuracy.– However,it’scomputationallyinfeasibletofindthebestdecisiontree.

• Mostcommondecisiontreelearningalgorithminpractice:– Greedyrecursivesplitting.

22

ExampleofGreedyRecursiveSplitting• Startwiththefulldataset:

Egg Milk …

0 0.7

1 0.7

0 0

1 0.6

1 0

2 0.6

0 1

2 0

0 0.3

1 0.6

2 0

Findthedecisionstumpwiththebestscore:

Splitintotwosmallerdatasetsbasedonstump:Egg Milk …

0 0

1 0

2 0

0 0.3

2 0

Egg Milk …

0 0.7

1 0.7

1 0.6

2 0.6

0 1

1 0.6

Sick?

1

1

0

1

0

1

1

1

0

0

1

Sick?

0

0

1

0

1

Sick?1

1

1

1

1

023

GreedyRecursiveSplittingWenowhaveadecisionstumpandtwodatasets:

Egg Milk … Sick?

0 0 0

1 0 0

2 0 1

0 0.3 0

2 0 1

Egg Milk … Sick?

0 0.7 1

1 0.7 1

1 0.6 1

2 0.6 1

0 1 1

1 0.6 0

Fitadecisionstumptoeachleaf’sdata.

24

GreedyRecursiveSplittingWenowhaveadecisionstumpandtwodatasets:

Egg Milk … Sick?

0 0 0

1 0 0

2 0 1

0 0.3 0

2 0 1

Egg Milk … Sick?

0 0.7 1

1 0.7 1

1 0.6 1

2 0.6 1

0 1 1

1 0.6 0

Fitadecisionstumptoeachleaf’sdata.Thenaddthesestumpstothetree.

25

GreedyRecursiveSplittingThisgivesa“depth2”decisiontree: Itsplitsthetwodatasetsintofourdatasets:

Egg Milk … Sick?

0 0 0

1 0 0

2 0 1

0 0.3 0

2 0 1

Egg Milk … Sick?

0 0.7 1

1 0.7 1

1 0.6 1

2 0.6 1

0 1 1

1 0.6 0

Egg Milk … Sick?

0 0 0

1 0 0

0 0.3 0

Egg Milk … Sick?

2 0 1

2 0 1

Egg Milk … Sick?

0 0.7 1

1 0.7 1

1 0.6 1

2 0.6 1

Egg Milk … Sick?

1 0.6 0

26

GreedyRecursiveSplittingWecouldtrytosplitthefourleavestomakea“depth3”decisiontree:

Wemightcontinuesplittinguntil:- Theleaveseachhaveonlyonelabel.- Wereachauser-definedmaximumdepth.

27

DiscussionofDecisionTreeLearning• Advantages:

– Interpretable.– Fasttolearn.– Veryfasttoclassify

• Disadvantages:– Hardtofindoptimalsetofrules.– Greedysplittingoftennotaccurate,requiresverydeeptrees.

• Issues:– Canyourevisitafeature?

• Yes,knowingotherinformationcouldmakefeaturerelevantagain.– Morecomplicatedrules?

• Yes,butsearchingforthebestrulegetsmuchmoreexpensive.– Isaccuracythebestscore?

• No,theremaybenosplitthatincreaseaccuracy.Alternative:informationgain (bonusslides).– Whatdepth?

28

Summary• Supervisedlearning:– Usingdatatowriteaprogrambasedoninput/outputexamples.

• Decisiontrees:predictingalabelusingasequenceofsimplerules.• Decisionstumps:simpledecisiontreethatisveryfasttofit.• Greedyrecursivesplitting:usesasequenceofstumpstofitatree.– Veryfastandinterpretable,butnotalwaysthemostaccurate.

29

OtherConsiderationsforFoodAllergyExample• Whattypesofpreprocessing mightwedo?

– Datacleaning:checkforandfixmissing/unreasonablevalues.– Summarystatistics:

• Canhelpidentify“unclean”data.• Correlationmightrevealanobviousdependence(“sick”ó “peanuts”).

– Datatransformations:• Converteverythingtosamescale?(e.g.,grams)• Addfoodsfromdaybefore?(maybe“sick”dependsonmultipledays)• Adddate?(maybewhatmakesyou“sick”changesovertime).

– Datavisualization:lookatascatterplotofeachfeatureandthelabel.• Maybethevisualizationwillshowsomethingweirdinthefeatures.• Maybethepatternisreallyobvious!

• Whatyoudomightdependonhowmuchdatayouhave:– Verylittledata:

• Representfoodbycommonallergicingredients(lactose,gluten,etc.)?– Lotsofdata:

• Usemorefine-grainedfeatures(breadfrombakeryvs.hamburgerbun)?

30

HowdowefitstumpsinO(dn logn)?• Let’ssaywe’retryingtofindthebestruleinvolvingmilk:

Egg Milk …

0 0.7

1 0.7

0 0

1 0.6

1 0

2 0.6

0 1

2 0

0 0.3

1 0.6

2 0

Sick?

1

1

0

1

0

1

1

1

0

0

1

Firstgrabthemilkcolumnandsortit(usingthesortpositionstore-arrangethesickcolumn). ThisstepcostsO(nlogn)duetosorting.

Now,we’llgothroughthemilkvaluesinorder,keepingtrackof#sickand#notsickthatareabove/belowthecurrentvalue.E.g.,#sickabove0.3is5.

Withthesecounts,accuracyscoreis(sumofmostcommonlabelaboveandbelow)/n.

Milk

0

0

0

0

0.3

0.6

0.6

0.6

0.7

0.7

1

Sick?

0

0

0

0

0

1

1

0

1

1

131

HowdowefitstumpsinO(dn logn)?

Milk

0

0

0

0

0.3

0.6

0.6

0.6

0.7

0.7

1

Sick?

0

0

0

0

0

1

1

0

1

1

1

Startwiththebaselinerule()whichisalways“satisfied”:Ifsatisfied,#sick=5and#not-sick=6.Ifnotsatisfied,#sick=0and#not-sick=0.Thisgivesaccuracyof(6+0)/n=6/11.

Nexttrytherule(milk>0),andupdatethecountsbasedonthese4rows:Ifsatisfied,#sick=5 and#not-sick=2.Ifnotsatisfied,#sick=0and#not-sick=4.Thisgivesaccuracyof(5+4)/n=9/11,whichisbetter.

Nexttrytherule(milk>0.3),andupdatethecountsbasedonthis1row:Ifsatisfied,#sick=5 and#not-sick=1.Ifnotsatisfied,#sick=0and#not-sick=5.Thisgivesaccuracyof(5+5)/n=10/11,whichisbetter.(andkeepgoinguntilyougettotheend…)

32

HowdowefitstumpsinO(dn logn)?

Milk

0

0

0

0

0.3

0.6

0.6

0.6

0.7

0.7

1

Sick?

0

0

0

0

0

1

1

0

1

1

1

Noticethatforeachrow,updatingthecountsonlycostsO(1).SincethereareO(n)rows,totalcostofupdatingcountsisO(n).

Insteadof2labels(sickvs.not-sick),considerthecaseof‘k’labels:- UpdatingthecountsstillcostsO(n),sinceeachrowhasonelabel.- Butcomputingthe‘max’acrossthelabelscostsO(k),socostisO(kn).

With‘k’labels,youcandecreasecostusinga“max-heap”datastructure:- CostofgettingmaxisO(1),costofupdatingheapforarowisO(logk).- Butk<=n(eachrowhasonlyonelabel).- SocostisinO(logn)foronerow.

SincetheaboveshowswecanfindbestruleinonecolumninO(nlogn),totalcosttofindbestruleacrossall‘d’columnsisO(nd logn).

33

Candecisiontreesre-visitafeature?• Yes.

Knowing(icecream>0.3)makessmallmilkquantitiesrelevant.34

Candecisiontreeshavemorecomplicatedrules?

• Yes:

• Butsearchingforbestrulecangetexpensive.

35

Doesbeinggreedyactuallyhurt?• Can’tyoujustgodeepertocorrectgreedydecisions?– Yes,butyouneedto“re-discover”ruleswithlessdata.

• Considerthatyouareallergictomilk(anddrinkthisoften),andalsogetsickwhenyou(rarely)combinedietcokewithmentos.

• Greedymethodshouldfirstsplitonmilk(helpsaccuracythemost):

36

Doesbeinggreedyactuallyhurt?• Can’tyoujustgodeepertocorrectgreedydecisions?– Yes,butyouneedto“re-discover”ruleswithlessdata.

• Considerthatyouareallergictomilk(anddrinkthisoften),andalsogetsickwhenyou(rarely)combinedietcokewithmentos.

• Greedymethodshouldfirstsplitonmilk(helpsaccuracythemost).• Non-greedymethodcouldgetsimplertree(splitonmilklater):

37

Whichscorefunctionshouldadecisiontreeused?

• Shouldn’twejustuseaccuracyscore?– Forleafs:yes,justmaximizeaccuracy.– Forinternalnodes:maybenot.

• Theremaybenosimplerulelike(egg>0.5)thatimprovesaccuracy.

• Mostcommonscoreinpractice:informationgain.– Choosesplitthatdecreasesentropy (“randomness”)oflabelsthemost.– Motivation:trytomakesplitdata“lessrandom”or“morepredictable”.

• Mightthenbeeasiertofindhigh-accuracyonthe“lessrandom”splitdata.

38

DecisionTreeswithProbabilisticPredictions• Often,we’llhavemultiple‘y’valuesateachleafnode.• Inthesecases,wemightreturnprobabilitiesinsteadofalabel.

• E.g.,ifintheleafnodewe5have“sick”objectsand1“notsick”:– Returnp(y=“sick”|xi)=5/6andp(y=“notsick”|xi)=1/6.

• Ingeneral,anaturalestimateoftheprobabilitiesattheleafnodes:– Let‘nk’bethenumberofobjectsthatarrivetoleafnode‘k’.– Let‘nkc’bethenumberoftimes(y==c)intheobjectsatleafnode‘k’.– Maximumlikelihoodestimateforthisleafisp(y=c|xi)=nkc/nk.

39

AlternativeStoppingRules• Therearemorecomplicatedrulesfordecidingwhen*not*tosplit.

• Rulesbasedonminimumsamplesize.– Don’tsplitanynodeswherethenumberofobjectsislessthansome‘m’.– Don’tsplitanynodesthatcreatechildrenwithlessthan‘m’objects.

• Thesetypesofrulestrytomakesurethatyouhaveenoughdatatojustifydecisions.

• Alternately,youcanuseavalidationset(seenextlecture):– Don’tsplitthenodeifitdecreasesanapproximationoftestaccuracy.

40

cpsc 340: machine learning and data mining

Documents