probability theory for machine learning · •probability theory provides a consistent framework...
TRANSCRIPT
ProbabilityTheoryforMachineLearning
ChrisCremerSeptember2015
Outline
• Motivation• ProbabilityDefinitionsandRules• ProbabilityDistributions• MLEforGaussianParameterEstimation• MLEandLeastSquares• LeastSquaresDemo
Material
• PatternRecognitionandMachineLearning- ChristopherM.Bishop• AllofStatistics– LarryWasserman• WolframMathWorld• Wikipedia
Motivation
• Uncertaintyarisesthrough:• Noisymeasurements• Finitesizeofdatasets• Ambiguity:Thewordbankcanmean(1)afinancialinstitution,(2)thesideofariver,or(3)tiltinganairplane.Whichmeaningwasintended,basedonthewordsthatappearnearby?
• LimitedModelComplexity
• Probabilitytheoryprovidesaconsistentframeworkforthequantificationandmanipulationofuncertainty• Allowsustomakeoptimalpredictionsgivenalltheinformationavailabletous,eventhoughthatinformationmaybeincompleteorambiguous
SampleSpace
• ThesamplespaceΩisthesetofpossibleoutcomesofanexperiment.PointsωinΩarecalledsampleoutcomes,realizations,orelements.SubsetsofΩarecalledEvents.
• Example.IfwetossacointwicethenΩ={HH,HT,TH,TT}.TheeventthatthefirsttossisheadsisA={HH,HT}
• WesaythateventsA1andA2aredisjoint(mutuallyexclusive)ifAi∩Aj ={}• Example:firstflipbeingheadsandfirstflipbeingtails
Probability
• WewillassignarealnumberP(A)toeveryeventA,calledtheprobabilityofA.• Toqualifyasaprobability,Pmustsatisfythreeaxioms:• Axiom1:P(A)≥0foreveryA• Axiom2:P(Ω)=1• Axiom3:IfA1,A2,...aredisjointthen
JointandConditionalProbabilities
• JointProbability• P(X,Y)• ProbabilityofXandY
• ConditionalProbability• P(X|Y)• ProbabilityofXgivenY
IndependentandConditionalProbabilities
• AssumingthatP(B)>0,theconditional probabilityofAgivenB:• P(A|B)=P(AB)/P(B)• P(AB)=P(A|B)P(B)=P(B|A)P(A)
• ProductRule
• TwoeventsAandBareindependent if• P(AB)=P(A)P(B)
• Joint=ProductofMarginals
• TwoeventsAandBareconditionally independent givenCiftheyareindependentafterconditioningonC• P(AB|C)=P(B|AC)P(A|C)=P(B|C)P(A|C)
Ifdisjoint,areeventsAandBalsoindependent?
Example
• 60%ofMLstudentspassthefinaland45%ofMLstudentspassboththefinalandthemidterm*• Whatpercentofstudentswhopassedthefinalalsopassedthemidterm?
*Thesearemadeupvalues.
Example
• 60%ofMLstudentspassthefinaland45%ofMLstudentspassboththefinalandthemidterm*• Whatpercentofstudentswhopassedthefinalalsopassedthemidterm?
• Reworded:Whatpercentofstudentspassedthemidtermgiventheypassedthefinal?• P(M|F)=P(M,F)/P(F)• =.45/.60• =.75
*Thesearemadeupvalues.
MarginalizationandLawofTotalProbability
• Marginalization(SumRule)
• LawofTotalProbability
Ishouldmakeexampleofboth!!!!!!!Maybeevenvisualizationofsumrule,someovermatrixofprobs
Bayes’Rule
P(A|B)=P(AB)/P(B) (ConditionalProbability)P(A|B)=P(B|A)P(A)/P(B)(ProductRule)P(A|B)=P(B|A)P(A)/Σ P(B|A)P(A)(LawofTotalProbability)
Bayes’Rule
Example
• Supposeyouhavetestedpositiveforadisease;whatistheprobabilitythatyouactuallyhavethedisease?• Itdependsontheaccuracyandsensitivityofthetest,andonthebackground(prior)probabilityofthedisease.• P(T=1|D=1)=.95(truepositive)• P(T=1|D=0)=.10(falsepositive)• P(D=1)=.01(prior)
• P(D=1|T=1)=?
Example
• P(T=1|D=1)=.95(truepositive)• P(T=1|D=0)=.10(falsepositive)• P(D=1)=.01(prior)
Bayes’Rule• P(D|T)=P(T|D)P(D)/P(T)=.95*.01/.1085=.087
LawofTotalProbability• P(T)= ΣP(T|D)P(D)=P(T|D=1)P(D=1)+P(T|D=0)P(D=0)=.95*.01+.1*.99=.1085
Theprobabilitythatyouhavethediseasegivenyoutestedpositiveis8.7%
RandomVariable
• Howdowelinksamplespacesandeventstodata?• ArandomvariableisamappingthatassignsarealnumberX(ω)toeachoutcomeω
• Example:Flipacointentimes.LetX(ω)bethenumberofheadsinthesequenceω.Ifω=HHTHHTHHTT,thenX(ω)=6.
DiscretevsContinuousRandomVariables
• Discrete:canonlytakeacountablenumberofvalues• Example:numberofheads• Distributiondefinedbyprobabilitymassfunction(pmf)• Marginalization:
• Continuous: cantakeinfinitelymanyvalues(realnumbers)• Example:timetakentoaccomplishtask• Distributiondefinedbyprobabilitydensityfunction(pdf)• Marginalization:
ProbabilityDistributionStatistics
• Mean:E[x]=μ =firstmoment=
• Variance:Var(X)=
• Nthmoment=
Univariatecontinuousrandomvariable
Univariatediscreterandomvariable=
=
BernoulliDistribution
• RV:x∈ {0,1}• Parameter:μ
• Mean=E[x]=μ• Variance=μ(1−μ)
DiscreteDistribution
= .6$ (1 − .6)$)$= .6
Example:Probabilityofflippingheads(x=1)withaunfaircoin
BinomialDistribution
• RV:m=numberofsuccesses• Parameters:N=numberoftrials
μ =probabilityofsuccess
• Mean=E[x]=Nμ• Variance=Nμ(1−μ)
DiscreteDistribution
Example:Probabilityofflippingheadsmtimesoutof15independentflipswithsuccessprobability0.2
MultinomialDistribution
• Themultinomialdistributionisageneralizationofthebinomialdistributiontokcategoriesinsteadofjustbinary(success/fail)• Fornindependenttrialseachofwhichleadstoasuccessforexactlyoneofkcategories,themultinomialdistributiongivestheprobabilityofanyparticularcombinationofnumbersofsuccessesforthevariouscategories• Example:RollingadieNtimes
DiscreteDistribution
MultinomialDistribution
• RVs:m1 …mK (counts)• Parameters:N=numberoftrials
μ =μ1 …μK probabilityofsuccessforeachcategory,Σμ=1
• Meanofmk:Nµk
• Varianceofmk:Nµk(1-µk)
DiscreteDistribution
MultinomialDistribution
• RVs:m1 …mK (counts)• Parameters:N=numberoftrials
μ =μ1 …μK probabilityofsuccessforeachcategory,Σμ=1
• Meanofmk:Nµk
• Varianceofmk:Nµk(1-µk)
DiscreteDistribution
Ex:Rolling2onafairdie5timesoutof10rolls.
[0,5,0,0,0,0]
10
[1/6,1/6,1/6,1/6,1/6,1/6]
105
$-
.= /0$-
GaussianDistribution
• Akathenormaldistribution• Widelyusedmodelforthedistributionofcontinuousvariables• Inthecaseofasinglevariablex,theGaussiandistributioncanbewrittenintheform
• whereμisthemeanandσ2 isthevariance
ContinuousDistribution
GaussianDistribution
• Akathenormaldistribution• Widelyusedmodelforthedistributionofcontinuousvariables• Inthecaseofasinglevariablex,theGaussiandistributioncanbewrittenintheform
• whereμisthemeanandσ2 isthevariance
ContinuousDistribution
normalizationconstant 𝑒()2345678892:5;<7=6>??75;)
GaussianDistribution
• Gaussianswithdifferentmeansandvariances
MultivariateGaussianDistribution
• ForaD-dimensionalvectorx,themultivariateGaussiandistributiontakestheform
• whereμisaD-dimensionalmeanvector• ΣisaD× Dcovariancematrix• |Σ|denotesthedeterminantofΣ
InferringParameters
• WehavedataXandweassumeitcomesfromsomedistribution• Howdowefigureouttheparametersthat‘best’fitthatdistribution?• MaximumLikelihoodEstimation(MLE)
• MaximumaPosteriori(MAP)
See‘GibbsSamplingfortheUninitiated’forastraightforwardintroductiontoparameterestimation:http://www.umiacs.umd.edu/~resnik/pubs/LAMP-TR-153.pdf
I.I.D.
• Randomvariablesareindependentandidenticallydistributed(i.i.d.)iftheyhavethesameprobabilitydistributionastheothersandareallmutuallyindependent.
• Example:CoinflipsareassumedtobeIID
MLEforparameterestimation
• TheparametersofaGaussiandistributionarethemean(µ)andvariance(σ2)
• We’llestimatetheparametersusingMLE• Givenobservationsx1,...,xN ,thelikelihoodofthoseobservationsforacertainµandσ2(assumingIID)is
Likelihood=
Recall:IfIID,P(ABC)=P(A)P(B)P(A)
MLEforparameterestimation
What’sthedistribution’smeanandvariance?
Likelihood=
MLEforGaussianParameters
• Nowwewanttomaximizethisfunctionwrt µ• Insteadofmaximizingtheproduct,wetakethelogofthelikelihoodsotheproductbecomesasum
• Wecandothisbecauselogismonotonicallyincreasing• Meaning
Likelihood=
LogLikelihood=log Log
MLEforGaussianParameters
• LogLikelihoodsimplifiesto:
• Nowwewanttomaximizethisfunctionwrt μ• How?
Toseeproofsforthesederivations:http://www.statlect.com/normal_distribution_maximum_likelihood.htm
MLEforGaussianParameters
• LogLikelihoodsimplifiesto:
• Nowwewanttomaximizethisfunctionwrt μ• Takethederivative,setto0,solveforμ
Toseeproofsforthesederivations:http://www.statlect.com/normal_distribution_maximum_likelihood.htm
MaximumLikelihoodandLeastSquares
• Supposethatyouarepresentedwithasequenceofdatapoints(X1,T1),...,(Xn,Tn),andyouareaskedtofindthe“bestfit”linepassingthroughthosepoints.• Inordertoanswerthisyouneedtoknowpreciselyhowtotellwhetheronelineis“fitter”thananother• Acommonmeasureoffitnessisthesquared-error
ForagooddiscussionofMaximumlikelihoodestimatorsandleastsquaresseehttp://people.math.gatech.edu/~ecroot/3225/maximum_likelihood.pdf
MaximumLikelihoodandLeastSquares
y(x,w)isestimatingthetargett
• Error/Loss/Cost/Objectivefunctionmeasuresthesquarederror
• LeastSquareRegression• MinimizeL(w)wrt w
Greenlines
Redline
MaximumLikelihoodandLeastSquares
• Nowweapproachcurvefittingfromaprobabilisticperspective• Wecanexpressouruncertaintyoverthevalueofthetargetvariableusingaprobabilitydistribution• Weassume,giventhevalueofx,thecorrespondingvalueofthasaGaussiandistributionwithameanequaltothevaluey(x,w)
βistheprecisionparameter(inversevariance)
MaximumLikelihoodandLeastSquares
MaximumLikelihoodandLeastSquares
• Wenowusethetrainingdata{x,t}todeterminethevaluesoftheunknownparameterswandβbymaximumlikelihood
• LogLikelihood
MaximumLikelihoodandLeastSquares• LogLikelihood
• MaximizeLogLikelihoodwrt tow• Sincelasttwoterms,don’tdependonw,theycanbeomitted.
• Also,scalingtheloglikelihoodbyapositiveconstantβ/2doesnotalterthelocationofthemaximumwithrespecttow,soitcanbeignored
• Result:Maximize
MaximumLikelihoodandLeastSquares
• MLE• Maximize
• LeastSquares• Minimize
• Therefore,maximizinglikelihoodisequivalent,sofarasdeterminingwisconcerned,tominimizingthesum-of-squareserrorfunction• Significance:sum-of-squareserrorfunctionarisesasaconsequenceofmaximizinglikelihoodundertheassumptionofaGaussiannoisedistribution
Matlab LinearRegressionDemo
TrainingSet
TrainingSet
ValidationSetHeldOutData
TrainingSet
ValidationSetHeldOutData
TrainingSet Error ValidationSetError
Linear ++++ +++++
Quadratic +++ ++++++
Cubic ++ +++++++
4th degreepolynomial + ++++++++
TrainingSet Error ValidationSetError
Linear ++++ +++++
Quadratic +++ ++++++
Cubic ++ +++++++
4th degreepolynomial + ++++++++
Howwellyourmodelgeneralizestonewdataiswhatmatters!
MultivariateGaussianDistribution
• ForaD-dimensionalvectorx,themultivariateGaussiandistributiontakestheform
• whereμisaD-dimensionalmeanvector• ΣisaD× Dcovariancematrix• |Σ|denotesthedeterminantofΣ
CovarianceMatrix
Questions?