lecture 16 ml dt - github pages · lecture 16: intro to ml and decision trees theodoros rekatsinas...

49
CS639: Data Management for Data Science Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by Ankur Goswami many slides from David Sontag) 1

Upload: others

Post on 21-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

CS639:DataManagementfor

DataScienceLecture16:IntrotoMLandDecisionTrees

TheodorosRekatsinas(lecturebyAnkur Goswami manyslidesfromDavidSontag)

1

Page 2: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Today’sLecture

1. IntrotoMachineLearning

2. TypesofMachineLearning

3. DecisionTrees

2

Page 3: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

1. IntrotoMachineLearning

3

Page 4: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

WhatisMachineLearning?

• “Learningisanyprocessbywhichasystemimprovesperformancefromexperience”– HerbertSimon

• DefinitionbyTomMitchell(1998):MachineLearningisthestudyofalgorithmsthat• ImprovetheirperformanceP• atsometaskT• withexperienceEAwell-definedlearningtaskisgivenby<P,T,E>.

Page 5: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

WhatisMachineLearning?

MachineLearningisthestudyofalgorithmsthat• ImprovetheirperformanceP• atsometaskT• withexperienceE

Awell-definedlearningtaskisgivenby<P,T,E>.

Experience:data-driventask,thusstatistics,probabilityExample:useheightandweighttopredictgender

Page 6: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Whendoweusemachinelearning?

MLisusedwhen:• Humanexpertisedoesnotexist(navigatingonMars)• Humanscan’texplaintheirexpertise(speechrecognition)• Modelsmustbecustomized(personalizedmedicine)• Modelsarebasedonhugeamountsofdata(genomics)

Page 7: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Ataskthatrequiresmachinelearning

Whatmakesahanddrawingbe2?

Page 8: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Modernmachinelearning:Autonomouscars

Page 9: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Modernmachinelearning:SceneLabeling

Page 10: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Modernmachinelearning:SpeechRecognition

Page 11: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

2.TypesofMachineLearning

11

Page 12: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

TypesofLearning

• Supervised(inductive)learning• Given:trainingdata+desiredoutputs(labels)

• Unsupervisedlearning• Given:trainingdata(withoutdesiredoutputs)

• Semi-supervisedlearning• Given:trainingdata+afewdesiredoutputs

• Reinforcementlearning• Rewardsfromsequenceofactions

Page 13: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

SupervisedLearning:Regression

• Given• Learnafunctionf(x)topredictygivenx• yisreal-valued==regression

Page 14: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

SupervisedLearning:Classification

• Given• Learnafunctionf(x)topredictygivenx• yiscategorical==regression

Page 15: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

SupervisedLearning:Classification

• Given• Learnafunctionf(x)topredictygivenx• yiscategorical==regression

Page 16: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

SupervisedLearning

• Value xcanbemulti-dimensional.• Eachdimensioncorrespondstoanattribute

Page 17: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

TypesofLearning

• Supervised(inductive)learning• Given:trainingdata+desiredoutputs(labels)

• Unsupervisedlearning• Given:trainingdata(withoutdesiredoutputs)

• Semi-supervisedlearning• Given:trainingdata+afewdesiredoutputs

• Reinforcementlearning• Rewardsfromsequenceofactions

Wewillcoverlaterintheclass

Page 18: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

3.DecisionTrees

18

Page 19: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Alearningproblem:predictfuelefficiency

Page 20: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Hypotheses:decisiontreesf:X→Y

InformalAhypothesisisacertainfunctionthatwebelieve(orhope)issimilartothetruefunction,the targetfunction thatwewanttomodel.

Page 21: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

WhatfunctionscanDecisionTreesrepresent?

Page 22: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Spaceofpossibledecisiontrees

• Howwillwechoosethebestone?• Letsfirstlookathowtosplitnodes,thenconsiderhowtofindthebesttree

Page 23: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Whatisthesimplesttree?

• Alwayspredictmpg=bad• Wejusttakethemajorityclass

• Isthisagoodtree?• Weneedtoevaluateitsperformance

• Performance: Wearecorrecton22examplesandincorrecton18examples

Page 24: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Adecisionstump

Page 25: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Recursivestep

Page 26: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Recursivestep

Page 27: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Secondleveloftree

Page 28: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro
Page 29: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Arealldecisiontreesequal?

• Manytreescanrepresentthesameconcept• But,notalltreeswillhavethesamesize!• e.g., φ = ( A∧ B)∨(¬A∧ C) -- ((A and B) or ( not A and C))

• Whichtreedoweprefer?

Page 30: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Learningdecisiontreesishard

• Learningthesimplest(smallest)decisiontreeisanNP-completeproblem[Hyafil &Rivest ’76]• Resorttoagreedyheuristic:• Startfromemptydecisiontree• Splitonnextbestattribute(feature)• Recurse

Page 31: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Splitting:choosingagoodattribute

Page 32: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Measuringuncertainty

• Goodsplitifwearemorecertainaboutclassificationaftersplit• Deterministicgood(alltrueorallfalse)• Uniformdistributionbad• Whataboutdistributionsinbetween?

Page 33: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Entropy

Page 34: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

High,LowEntropy

Page 35: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

EntropyExample

Page 36: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

ConditionalEntropy

Page 37: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Informationgain

Page 38: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Learningdecisiontrees

Page 39: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro
Page 40: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Adecisionstump

Page 41: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro
Page 42: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro
Page 43: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro
Page 44: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

BaseCases:AnIdea

• BaseCaseOne: Ifallrecordsincurrentdatasubsethavethesameoutputthendonotrecurse• BaseCaseTwo: Ifallrecordshaveexactlythesamesetofinputattributesthendonotrecurse

Page 45: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

TheproblemwithBaseCase3

Page 46: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

IfweomitBaseCase3

Page 47: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Summary:BuildingDecisionTrees

Page 48: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Fromcategoricaltoreal-valuedattributes

Page 49: Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Whatyouneedtoknowaboutdecisiontrees

• DecisiontreesareoneofthemostpopularMLtools• Easytounderstand,implement,anduse• Computationallycheap(tosolveheuristically)

• Informationgaintoselectattributes• Presentedforclassificationbutcanbeusedforregressionanddensityestimationtoo• Decisiontreeswilloverfit!!!• Wewillseethedefinitionofoverfittingandrelatedconceptslaterinclass.