probability theory for machine learning · •probability theory provides a consistent framework...

49
Probability Theory for Machine Learning Chris Cremer September 2015

Upload: others

Post on 20-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

ProbabilityTheoryforMachineLearning

ChrisCremerSeptember2015

Page 2: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

Outline

• Motivation• ProbabilityDefinitionsandRules• ProbabilityDistributions• MLEforGaussianParameterEstimation• MLEandLeastSquares• LeastSquaresDemo

Page 3: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

Material

• PatternRecognitionandMachineLearning- ChristopherM.Bishop• AllofStatistics– LarryWasserman• WolframMathWorld• Wikipedia

Page 4: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

Motivation

• Uncertaintyarisesthrough:• Noisymeasurements• Finitesizeofdatasets• Ambiguity:Thewordbankcanmean(1)afinancialinstitution,(2)thesideofariver,or(3)tiltinganairplane.Whichmeaningwasintended,basedonthewordsthatappearnearby?

• LimitedModelComplexity

• Probabilitytheoryprovidesaconsistentframeworkforthequantificationandmanipulationofuncertainty• Allowsustomakeoptimalpredictionsgivenalltheinformationavailabletous,eventhoughthatinformationmaybeincompleteorambiguous

Page 5: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

SampleSpace

• ThesamplespaceΩisthesetofpossibleoutcomesofanexperiment.PointsωinΩarecalledsampleoutcomes,realizations,orelements.SubsetsofΩarecalledEvents.

• Example.IfwetossacointwicethenΩ={HH,HT,TH,TT}.TheeventthatthefirsttossisheadsisA={HH,HT}

• WesaythateventsA1andA2aredisjoint(mutuallyexclusive)ifAi∩Aj ={}• Example:firstflipbeingheadsandfirstflipbeingtails

Page 6: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

Probability

• WewillassignarealnumberP(A)toeveryeventA,calledtheprobabilityofA.• Toqualifyasaprobability,Pmustsatisfythreeaxioms:• Axiom1:P(A)≥0foreveryA• Axiom2:P(Ω)=1• Axiom3:IfA1,A2,...aredisjointthen

Page 7: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

JointandConditionalProbabilities

• JointProbability• P(X,Y)• ProbabilityofXandY

• ConditionalProbability• P(X|Y)• ProbabilityofXgivenY

Page 8: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

IndependentandConditionalProbabilities

• AssumingthatP(B)>0,theconditional probabilityofAgivenB:• P(A|B)=P(AB)/P(B)• P(AB)=P(A|B)P(B)=P(B|A)P(A)

• ProductRule

• TwoeventsAandBareindependent if• P(AB)=P(A)P(B)

• Joint=ProductofMarginals

• TwoeventsAandBareconditionally independent givenCiftheyareindependentafterconditioningonC• P(AB|C)=P(B|AC)P(A|C)=P(B|C)P(A|C)

Ifdisjoint,areeventsAandBalsoindependent?

Page 9: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

Example

• 60%ofMLstudentspassthefinaland45%ofMLstudentspassboththefinalandthemidterm*• Whatpercentofstudentswhopassedthefinalalsopassedthemidterm?

*Thesearemadeupvalues.

Page 10: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

Example

• 60%ofMLstudentspassthefinaland45%ofMLstudentspassboththefinalandthemidterm*• Whatpercentofstudentswhopassedthefinalalsopassedthemidterm?

• Reworded:Whatpercentofstudentspassedthemidtermgiventheypassedthefinal?• P(M|F)=P(M,F)/P(F)• =.45/.60• =.75

*Thesearemadeupvalues.

Page 11: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MarginalizationandLawofTotalProbability

• Marginalization(SumRule)

• LawofTotalProbability

Ishouldmakeexampleofboth!!!!!!!Maybeevenvisualizationofsumrule,someovermatrixofprobs

Page 12: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

Bayes’Rule

P(A|B)=P(AB)/P(B) (ConditionalProbability)P(A|B)=P(B|A)P(A)/P(B)(ProductRule)P(A|B)=P(B|A)P(A)/Σ P(B|A)P(A)(LawofTotalProbability)

Page 13: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

Bayes’Rule

Page 14: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

Example

• Supposeyouhavetestedpositiveforadisease;whatistheprobabilitythatyouactuallyhavethedisease?• Itdependsontheaccuracyandsensitivityofthetest,andonthebackground(prior)probabilityofthedisease.• P(T=1|D=1)=.95(truepositive)• P(T=1|D=0)=.10(falsepositive)• P(D=1)=.01(prior)

• P(D=1|T=1)=?

Page 15: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

Example

• P(T=1|D=1)=.95(truepositive)• P(T=1|D=0)=.10(falsepositive)• P(D=1)=.01(prior)

Bayes’Rule• P(D|T)=P(T|D)P(D)/P(T)=.95*.01/.1085=.087

LawofTotalProbability• P(T)= ΣP(T|D)P(D)=P(T|D=1)P(D=1)+P(T|D=0)P(D=0)=.95*.01+.1*.99=.1085

Theprobabilitythatyouhavethediseasegivenyoutestedpositiveis8.7%

Page 16: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

RandomVariable

• Howdowelinksamplespacesandeventstodata?• ArandomvariableisamappingthatassignsarealnumberX(ω)toeachoutcomeω

• Example:Flipacointentimes.LetX(ω)bethenumberofheadsinthesequenceω.Ifω=HHTHHTHHTT,thenX(ω)=6.

Page 17: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

DiscretevsContinuousRandomVariables

• Discrete:canonlytakeacountablenumberofvalues• Example:numberofheads• Distributiondefinedbyprobabilitymassfunction(pmf)• Marginalization:

• Continuous: cantakeinfinitelymanyvalues(realnumbers)• Example:timetakentoaccomplishtask• Distributiondefinedbyprobabilitydensityfunction(pdf)• Marginalization:

Page 18: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

ProbabilityDistributionStatistics

• Mean:E[x]=μ =firstmoment=

• Variance:Var(X)=

• Nthmoment=

Univariatecontinuousrandomvariable

Univariatediscreterandomvariable=

=

Page 19: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

BernoulliDistribution

• RV:x∈ {0,1}• Parameter:μ

• Mean=E[x]=μ• Variance=μ(1−μ)

DiscreteDistribution

= .6$ (1 − .6)$)$= .6

Example:Probabilityofflippingheads(x=1)withaunfaircoin

Page 20: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

BinomialDistribution

• RV:m=numberofsuccesses• Parameters:N=numberoftrials

μ =probabilityofsuccess

• Mean=E[x]=Nμ• Variance=Nμ(1−μ)

DiscreteDistribution

Example:Probabilityofflippingheadsmtimesoutof15independentflipswithsuccessprobability0.2

Page 21: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MultinomialDistribution

• Themultinomialdistributionisageneralizationofthebinomialdistributiontokcategoriesinsteadofjustbinary(success/fail)• Fornindependenttrialseachofwhichleadstoasuccessforexactlyoneofkcategories,themultinomialdistributiongivestheprobabilityofanyparticularcombinationofnumbersofsuccessesforthevariouscategories• Example:RollingadieNtimes

DiscreteDistribution

Page 22: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MultinomialDistribution

• RVs:m1 …mK (counts)• Parameters:N=numberoftrials

μ =μ1 …μK probabilityofsuccessforeachcategory,Σμ=1

• Meanofmk:Nµk

• Varianceofmk:Nµk(1-µk)

DiscreteDistribution

Page 23: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MultinomialDistribution

• RVs:m1 …mK (counts)• Parameters:N=numberoftrials

μ =μ1 …μK probabilityofsuccessforeachcategory,Σμ=1

• Meanofmk:Nµk

• Varianceofmk:Nµk(1-µk)

DiscreteDistribution

Ex:Rolling2onafairdie5timesoutof10rolls.

[0,5,0,0,0,0]

10

[1/6,1/6,1/6,1/6,1/6,1/6]

105

$-

.= /0$-

Page 24: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

GaussianDistribution

• Akathenormaldistribution• Widelyusedmodelforthedistributionofcontinuousvariables• Inthecaseofasinglevariablex,theGaussiandistributioncanbewrittenintheform

• whereμisthemeanandσ2 isthevariance

ContinuousDistribution

Page 25: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

GaussianDistribution

• Akathenormaldistribution• Widelyusedmodelforthedistributionofcontinuousvariables• Inthecaseofasinglevariablex,theGaussiandistributioncanbewrittenintheform

• whereμisthemeanandσ2 isthevariance

ContinuousDistribution

normalizationconstant 𝑒()2345678892:5;<7=6>??75;)

Page 26: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

GaussianDistribution

• Gaussianswithdifferentmeansandvariances

Page 27: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MultivariateGaussianDistribution

• ForaD-dimensionalvectorx,themultivariateGaussiandistributiontakestheform

• whereμisaD-dimensionalmeanvector• ΣisaD× Dcovariancematrix• |Σ|denotesthedeterminantofΣ

Page 28: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

InferringParameters

• WehavedataXandweassumeitcomesfromsomedistribution• Howdowefigureouttheparametersthat‘best’fitthatdistribution?• MaximumLikelihoodEstimation(MLE)

• MaximumaPosteriori(MAP)

See‘GibbsSamplingfortheUninitiated’forastraightforwardintroductiontoparameterestimation:http://www.umiacs.umd.edu/~resnik/pubs/LAMP-TR-153.pdf

Page 29: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

I.I.D.

• Randomvariablesareindependentandidenticallydistributed(i.i.d.)iftheyhavethesameprobabilitydistributionastheothersandareallmutuallyindependent.

• Example:CoinflipsareassumedtobeIID

Page 30: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MLEforparameterestimation

• TheparametersofaGaussiandistributionarethemean(µ)andvariance(σ2)

• We’llestimatetheparametersusingMLE• Givenobservationsx1,...,xN ,thelikelihoodofthoseobservationsforacertainµandσ2(assumingIID)is

Likelihood=

Recall:IfIID,P(ABC)=P(A)P(B)P(A)

Page 31: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MLEforparameterestimation

What’sthedistribution’smeanandvariance?

Likelihood=

Page 32: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MLEforGaussianParameters

• Nowwewanttomaximizethisfunctionwrt µ• Insteadofmaximizingtheproduct,wetakethelogofthelikelihoodsotheproductbecomesasum

• Wecandothisbecauselogismonotonicallyincreasing• Meaning

Likelihood=

LogLikelihood=log Log

Page 33: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MLEforGaussianParameters

• LogLikelihoodsimplifiesto:

• Nowwewanttomaximizethisfunctionwrt μ• How?

Toseeproofsforthesederivations:http://www.statlect.com/normal_distribution_maximum_likelihood.htm

Page 34: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MLEforGaussianParameters

• LogLikelihoodsimplifiesto:

• Nowwewanttomaximizethisfunctionwrt μ• Takethederivative,setto0,solveforμ

Toseeproofsforthesederivations:http://www.statlect.com/normal_distribution_maximum_likelihood.htm

Page 35: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MaximumLikelihoodandLeastSquares

• Supposethatyouarepresentedwithasequenceofdatapoints(X1,T1),...,(Xn,Tn),andyouareaskedtofindthe“bestfit”linepassingthroughthosepoints.• Inordertoanswerthisyouneedtoknowpreciselyhowtotellwhetheronelineis“fitter”thananother• Acommonmeasureoffitnessisthesquared-error

ForagooddiscussionofMaximumlikelihoodestimatorsandleastsquaresseehttp://people.math.gatech.edu/~ecroot/3225/maximum_likelihood.pdf

Page 36: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MaximumLikelihoodandLeastSquares

y(x,w)isestimatingthetargett

• Error/Loss/Cost/Objectivefunctionmeasuresthesquarederror

• LeastSquareRegression• MinimizeL(w)wrt w

Greenlines

Redline

Page 37: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MaximumLikelihoodandLeastSquares

• Nowweapproachcurvefittingfromaprobabilisticperspective• Wecanexpressouruncertaintyoverthevalueofthetargetvariableusingaprobabilitydistribution• Weassume,giventhevalueofx,thecorrespondingvalueofthasaGaussiandistributionwithameanequaltothevaluey(x,w)

βistheprecisionparameter(inversevariance)

Page 38: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MaximumLikelihoodandLeastSquares

Page 39: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MaximumLikelihoodandLeastSquares

• Wenowusethetrainingdata{x,t}todeterminethevaluesoftheunknownparameterswandβbymaximumlikelihood

• LogLikelihood

Page 40: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MaximumLikelihoodandLeastSquares• LogLikelihood

• MaximizeLogLikelihoodwrt tow• Sincelasttwoterms,don’tdependonw,theycanbeomitted.

• Also,scalingtheloglikelihoodbyapositiveconstantβ/2doesnotalterthelocationofthemaximumwithrespecttow,soitcanbeignored

• Result:Maximize

Page 41: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MaximumLikelihoodandLeastSquares

• MLE• Maximize

• LeastSquares• Minimize

• Therefore,maximizinglikelihoodisequivalent,sofarasdeterminingwisconcerned,tominimizingthesum-of-squareserrorfunction• Significance:sum-of-squareserrorfunctionarisesasaconsequenceofmaximizinglikelihoodundertheassumptionofaGaussiannoisedistribution

Page 42: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

Matlab LinearRegressionDemo

Page 43: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

TrainingSet

Page 44: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

TrainingSet

ValidationSetHeldOutData

Page 45: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

TrainingSet

ValidationSetHeldOutData

TrainingSet Error ValidationSetError

Linear ++++ +++++

Quadratic +++ ++++++

Cubic ++ +++++++

4th degreepolynomial + ++++++++

Page 46: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

TrainingSet Error ValidationSetError

Linear ++++ +++++

Quadratic +++ ++++++

Cubic ++ +++++++

4th degreepolynomial + ++++++++

Howwellyourmodelgeneralizestonewdataiswhatmatters!

Page 47: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

MultivariateGaussianDistribution

• ForaD-dimensionalvectorx,themultivariateGaussiandistributiontakestheform

• whereμisaD-dimensionalmeanvector• ΣisaD× Dcovariancematrix• |Σ|denotesthedeterminantofΣ

Page 48: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

CovarianceMatrix

Page 49: Probability Theory for Machine Learning · •Probability theory provides a consistent framework for the quantification and manipulation of uncertainty •Allows us to make optimal

Questions?