lecture 4: supervised learning - carnegie mellon …nasmith/psnlp/lecture4.pdfa diatribe on...

46
Lecture 4: Supervised Learning

Upload: ngoxuyen

Post on 02-Apr-2018

221 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

Lecture4:SupervisedLearning

Page 2: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

QuickRecap

•  Graphicalmodels•  Inference•  Decodingformodelsofstructure

•  Finally,wegettolearning.– Today,assumeacollecDonofNpairs(x,y);supervisedlearningwithcompletedata.

Page 3: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

Loss

•  Lethbeahypothesis(aninstanDated,predicDvemodel).

•  loss(x,y;h)=ameasureofhowbadlyhperformsoninputxifyisthecorrectoutput.

•  Howtodecidewhat“loss”shouldbe?1.  computaDonalexpense2.  knowledgeofactualcostsoferrors3.  formalfoundaDonsenablingtheoreDcalguarantees

Page 4: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

Risk

•  ThereissometruedistribuDonp*overinput,outputpairs(X,Y).

•  UnderthatdistribuDon,whatdoweexpecth’slosstobe?

•  Wedon’thavep*,butwehavetheempiricaldistribuDon,givingempiricalrisk:

Page 5: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

EmpiricalRiskMinimizaDon

•  Providesacriteriontodecideonh:

•  BackgroundpreferencesoverhcanbeincludedinregularizedempiricalriskminimizaDon:

Page 6: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

ParametricAssumpDons

•  Typicallywedonotmovein“h‐space,”butratherinthespaceofconDnuously‐parameterizedpredictors.

Page 7: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

ThreeKindsofLossFuncDons

•  Error– Couldbezero‐one,ortask‐specific.– Meansquarederrormakessenseforcon*nuouspredicDonsandisusedinregression.

•  Logloss– ProbabilisDcinterpretaDon(“likelihood”)

•  Hingeloss– GeometricinterpretaDon(“margin”)

Page 8: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

LogLoss(FirstVersion)

•  MaximumlikelihoodesDmaDon:R(w)is0formodelsinthefamily,+∞forothermodels.

•  Maximuma posteriori (MAP) esDmaDon:R(w)is–logp(w)

•  Ohencalledgenera6vemodeling.

Page 9: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

LogLoss(FirstVersion)

Examples:•  N‐gramlanguagemodels

•  SupervisedHMMtaggers

•  Charniak,Collins,andStanfordparsers

Page 10: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

LogLoss(FirstVersion)

ComputaDonally…•  ConvexanddifferenDable.•  Closedformfordirected,mulDnomial‐basedmodelspw.

–  Countandnormalize!•  Inothercases,requiresposteriorinference,whichcanbe

expensivedependingonthemodel’sstructure.•  Lineardecoding(forsomeparameterizaDons).

Page 11: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

LogLoss(FirstVersion)

Error…•  NonoDonoferror.•  Learnerwinsbymovingasmuchprobabilitymassaspossibletotrainingexamples.

Page 12: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

LogLoss(FirstVersion)

Guarantees…•  Consistency:ifthetruemodelisintherightfamily,enoughdatawillleadyoutoit.

Page 13: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

LogLoss(FirstVersion)

DifferentparameterizaDons…

•  MulDnomials(BN‐like):

•  Globallog‐linear(MN‐like):

•  Locallynormalizedlog‐linear:

Page 14: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

ReflecDonsonGeneraDveModels

•  MostearlysoluDonsaregeneraDve.

•  MostunsupervisedapproachesaregeneraDve.

•  SomepeopleonlybelieveingeneraDvemodels.

•  SomeDmesesDmatorsarenotaseasyastheyseem(“deficiency”).

•  Starthereifthere’sasensiblegeneraDvestory.–  Youcanalwaysusea“bener”lossfuncDonwiththesamelinearmodellateron.

Page 15: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

ADiatribeon“Deficiency”

•  Weusetheterm“deficiency”torefertotheabilityofgeneraDvemodelstoleakprobabilitymassonill‐formedoutcomes.– WordsontopofeachotherinIBMMTmodels.

•  Ifourmodelswereunabletogenerateill‐formedoutcomes,we’dhavesolvedNLP!–  “Ill‐formed”isintheeyeofthebeholder.

•  Theproblemiswhenyouadd“filtering”stepsinthegeneraDvestoryanddon’taccountfortheminesDmaDon.–  It’syouresDmatorthatis“deficient,”notyourmodel.

Page 16: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

Zero‐OneLoss

Page 17: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

Zero‐OneLoss

ComputaDonally:•  Piecewiseconstant.

Error:

Guarantees:none

Page 18: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

ErrorasLoss

Generalizeszero‐one,samedifficulDes.Example:Bleu‐scoremaximizaDoninmachinetranslaDon,with“MERT”linesearch.

Page 19: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

Comparison

Genera6ve(LogLoss) ErrorasLoss

ComputaDon ConvexopDmizaDon. OpDmizingapiecewiseconstantfuncDon.

Error‐awareness None

Guarantees Consistency. None.

Page 20: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

DiscriminaDveLearning

•  VariouslossfuncDonsbetweenloglossanderror.

•  ThreecommonlyusedinNLP:– CondiDonallogloss(“maxent,”CRFs)– Hingeloss(structuralSVMs)

– Perceptron’sloss•  We’lldiscusseach,compare,andunify.

Page 21: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

LogLoss(SecondVersion)

•  CanbeunderstoodasageneraDvemodeloverY,butdoesnotmodelX.– “CondiDonal”model

loss(x,y;hw) = ! log pw(y | x)

Page 22: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

LogLoss(SecondVersion)

Examples:•  LogisDcregression(forclassificaDon)•  MEMMs

•  CRFs

loss(x,y;hw) = ! log pw(y | x)

Page 23: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

LogLoss(SecondVersion)

ComputaDonally…

•  ConvexanddifferenDable.•  Requiresposteriorinference,whichcanbeexpensivedependingonthemodel’sstructure.

•  Lineardecoding(forsomeparameterizaDons).

loss(x,y;hw) = ! log pw(y | x)

Page 24: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

LogLoss(SecondVersion)

Error…•  NonoDonoferror.•  Learnerwinsbymovingasmuchprobabilitymassaspossibletotrainingexamples’correctoutputs.

loss(x,y;hw) = ! log pw(y | x)

Page 25: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

LogLoss(SecondVersion)

Guarantees…•  Consistency:ifthetruecondiDonalmodelisintherightfamily,enoughdatawillleadyoutoit.

loss(x,y;hw) = ! log pw(y | x)

Page 26: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

LogLoss(SecondVersion)

DifferentparameterizaDons…

•  Globallog‐linear(CRF):

•  Locallynormalizedlog‐linear(MEMM):

!w!g(x,y) + log!

y!

expw!g(x",y")

loss(x,y;hw) = ! log pw(y | x)

Page 27: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

ComparingtheTwoLogLosses

‐logpw(x,y) ‐logpw(y|x)

ParameterizaDon UsuallymulDnomials(BN‐like).

Almostalwayslog‐linear(MN‐like).

UndertheusualparameterizaDon…

ComputaDon Countandnormalize.

ConvexopDmizaDon.

Error‐awareness None. AwareoftheY‐predicDontask,(approximateszero‐one).

Guarantees Consistencyofjoint. Consistencyofcond.

Page 28: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

HingeLoss

•  PenalizesthemodelforlesngcompeDtorsgetclosetothecorrectanswery.– CanpenalizetheminproporDontotheirerror.

loss(x,y;hw) = !w!g(x,y) + maxy!

w!g(x,y") + error(y",y)

Page 29: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

HingeLoss

Examples…•  Perceptron(includingCollins’structuredversion)

–  Classicversionignoreserror term

•  SVMandsomestructuredvariants:– Max‐marginMarkovnetworks(Taskaretal.)– MIRA(1‐best,k‐best)

loss(x,y;hw) = !w!g(x,y) + maxy!

w!g(x,y") + error(y",y)

Page 30: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

HingeLoss

ComputaDonally…•  Convex,noteverywheredifferenDable.

– Manyspecializedtechniquesnowavailable.

•  RequiresMAPor“cost‐augmented”MAPinference.

•  Lineardecoding.

loss(x,y;hw) = !w!g(x,y) + maxy!

w!g(x,y") + error(y",y)

Page 31: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

HingeLoss

Error…•  Builtin.•  MostconvenientwhenerrorfuncDonfactorssimilarlytofeaturesg.

loss(x,y;hw) = !w!g(x,y) + maxy!

w!g(x,y") + error(y",y)

Page 32: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

HingeLoss

Guarantees…•  GeneralizaDonbounds.

– NotclearhowseriouslytotaketheseinNLP;maynotbeDghtenoughtobemeaningful.

•  OhenyouwillfindconvergenceguaranteesforopDmizaDontechniques.

loss(x,y;hw) = !w!g(x,y) + maxy!

w!g(x,y") + error(y",y)

Page 33: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

TheyAreAllRelated

β γ

CondiDonallogloss 1 0

Perceptron’shingeloss ∞ 0

StructuralSVM’shingeloss ∞ >0

Sohmax‐margin(GimpelandSmith,2010) 1 1

1!

log!

y!

exp"!

#w! (g(x,y")! g(x,y)) + "error(y",y)

$%

Page 34: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

CRFs,MaxMargin,orPerceptron?

•  Forsupervisedproblems,wedonotexpectlargedifferences.

•  Perceptroniseasiesttoimplement.– Withcost‐augmentedinference,itshouldgetbenerandbeginstoapproachMIRAandM3Ns.

•  CRFsarebestforprobabilityfeDshists.– Probablymostappropriateifyouareextendingwithlatentvariables;thejuryisout.

•  Notyet“plugandplay..”

Page 35: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

R(w)

•  RegularizaDonterm–avoidoverfisng– Usuallymeans“avoidlargemagnitudesinw”

•  (Log)Prior–respectbackgroundbeliefsaboutthepredictorhw

Page 36: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

R(w)

•  UsualstarDngpoint:squaredL2norm– ComputaDonallyconvenient(it’sstronglyconvex,itisitsownFenchelconjugate,…)

– ProbabilisDcview:Gaussianprioronweights(ChenandRosenfeld,2000)

– Geometricview:Euclideandistance(originalregularizaDonmethodinSVMs)

– Onlyonehyperparameter

R(w) = !!w!22 = !

!

j

w2j

Page 37: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

R(w)

•  AnotheropDon:L1‐norm– ComputaDonallylessconvenient(noteverywheredifferenDable)

– ProbabilisDcview:Laplacianprioronweights(originallyproposedas“lasso”inregression)

– Sparsityinducing(“free”featureselecDon)

R(w) = !!w!1 = !!

j

|wj |

Page 38: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

R(w)

•  LotsofanenDontothisinmachinelearning.•  “Structuredsparsity”

– Wantgroupsoffeaturestogotozero,orgroup‐internalsparsity,or…

•  InterpolaDonbetweenL1andL2–“elasDcnet”– Sparsitybutmaybebenerbehaved

•  Thisisnotyet“plugandplay.”– OpDmizaDonalgorithmisheavilyaffected.

Page 39: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

MAPLearningisInference

•  Seeking“mostprobableexplanaDon”ofthedata,intermsofw.– Explainthedata:p(x,y|w)– Nottoosurprising:p(w)

•  Ifwethinkof“W”asanotherrandomvariable,MAPlearningisMAPinference.– Looksverydifferentfromdecoding!– ButatahighlevelofabstracDon,itisthesame.

Page 40: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

MAPLearningasaGraphicalModel

w Y

X

R

exp –R(w) = p(w)

pw(Y)

pw(X | Y)

•  Thisisaviewoflearninga“noisychannel”model.

Page 41: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

MAPLearningasaGraphicalModel

w Y

X

R

exp –R(w) = p(w)

pw(Y | X)

•  ThisisaviewoflearninginaCRF.

Page 42: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

MAPEsDmaDonforCRFs

maxwp(w|x,y),whichisMAPinference

iteratetoobtaingradient:

sufficientstaDsDcsfromp(y|x,w),obtainedbyposteriorinference

Page 43: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

HowToThinkAboutOpDmizaDon

•  DependingonyourchoiceoflossandR,differentapproachesbecomeavailable.– Learningalgorithmscaninteractwithinference/decodingalgorithms,too.

•  InNLPtoday,itisprobablymoreimportanttofocusonthefeatures,errorfuncDon,andpriorknowledge.– Decidewhatyouwant,andthenusethebestavailableopDmizaDontechnique.

Page 44: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

KeyTechniques

•  Quasi‐Newton–batchmethodfordifferenDablelossfuncDons–  LBFGS,OWLQNwhenusingL1regularizaDon

•  StochasDcsubgradientascent–online– Generalizesperceptron,MIRA,stochasDcgradientascent

–  SomeDmessensiDvetostepsize–  Canohenuse“mini‐batches”tospeedupconvergence

•  ForerrorminimizaDon:randomizaDon

Page 45: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

Pixalls

•  EngineeringonlinelearningproceduresistempDngandmay helpyougetbenerperformance.– Withoutatleastsomeanalysisintermsofloss,error,andregularizaDon,it’sunlikelytobeusefuloutsideyourproblem/dataset.

•  WhenrandomizaDonisinvolved,lookatvarianceacrossruns(Clarketal.,2011)

•  Alwaystunehyperparameters(e.g.,regularizaDonstrength)ondevelopmentdata!

Page 46: Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deficiency” • We use the term “deficiency” to refer to the ability of generave

MajorTopicsinCurrentWork

•  Copingwithapproximateinference•  ExploiDngincompletedata

– Semisupervisedlearning

– CreaDngfeaturesfromrawtext– Latentvariablemodels(discussedtomorrow)

•  Featuremanagement– Structuredsparsity(R)