introduction to advanced probability for graphical modelsintroduction to advanced probability for...

40
Introduction to Advanced Probability for Graphical Models CSC 412 By Elliot Creager Thursday January 11, 2018 Presented by Jonathan Lorraine *Many slides based on Kaustav Kundu’s, Kevin Swersky’s, Inmar Givoni’s, Danny Tarlow’s, Jasper Snoek’s slides, Sam Roweis ‘s review of probability, Bishop’s book, and some images from Wikipedia

Upload: others

Post on 01-Jun-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

IntroductiontoAdvancedProbabilityforGraphicalModels

CSC412ByElliotCreager

ThursdayJanuary11,2018

PresentedbyJonathanLorraine

*ManyslidesbasedonKaustav Kundu’s, KevinSwersky’s,Inmar Givoni’s,DannyTarlow’s,JasperSnoek’s slides,SamRoweis ‘sreviewofprobability, Bishop’sbook,andsomeimagesfromWikipedia

Outline

• Basics• Probabilityrules• Exponentialfamilymodels• Maximumlikelihood• ConjugateBayesianinference(timepermitting)

WhyRepresentUncertainty?

• Theworldisfullofuncertainty– “Whatwilltheweatherbeliketoday?”– “WillIlikethismovie?”– “Isthereapersoninthisimage?”

• We’retryingtobuildsystemsthatunderstandand(possibly)interactwiththerealworld

• Weoftencan’tprovesomethingistrue,butwecanstillaskhowlikelydifferentoutcomesareoraskforthemostlikelyexplanation

• Sometimesprobabilitygivesaconcisedescriptionofanotherwisecomplexphenomenon.

WhyUseProbabilitytoRepresentUncertainty?

• Writedownsimple,reasonablecriteriathatyou'dwantfromasystemofuncertainty(commonsensestuff),andyoualwaysgetprobability.

• CoxAxioms(Cox1946);SeeBishop,Section1.2.3

• Wewillrestrictourselvestoarelativelyinformaldiscussionofprobabilitytheory.

Notation• A randomvariableX representsoutcomesorstatesoftheworld.

• Wewillwritep(x)tomeanProbability(X=x)• Samplespace:thespaceofallpossibleoutcomes(maybediscrete,continuous,ormixed)

• p(x)istheprobabilitymass(density)function– Assignsanumbertoeachpointinsamplespace– Non-negative,sums(integrates)to1– Intuitively:howoftendoesxoccur,howmuchdowebelieveinx.

JointProbabilityDistribution• Prob(X=x,Y=y)– “ProbabilityofX=xandY=y”– p(x,y)

ConditionalProbabilityDistribution• Prob(X=x|Y=y)– “Probability ofX=xgivenY=y”– p(x|y)=p(x,y)/p(y)

MarginalProbabilityDistribution• Prob(X=x),Prob(Y=y)– “Probability ofX=x”– p(x)=\Sum_{y}p(x,y)=\Sum{y}p(x|y)p(y)

TheRulesofProbability

• SumRule(marginalization/summingout):

• Product/ChainRule:

),...,,(...)(

),()(

2112 3

Nx x x

y

xxxpxp

yxpxp

N

∑∑ ∑

=

=

),...,|()...|()(),...,()()|(),(

111211 −=

=

NNN xxxpxxpxpxxpxpxypyxp

Bayes’Rule

• Oneofthemostimportantformulasinprobabilitytheory

• Thisgivesusawayof“reversing”conditionalprobabilities

• Readas”Posterior=likelihood*prior/evidence”

∑==

')'()'|(

)()|()()()|()|(

xxpxypxpxyp

ypxpxypyxp

Independence

• Tworandomvariablesaresaidtobeindependent iff theirjointdistributionfactors

• Tworandomvariablesareconditionallyindependentgivenathirdiftheyareindependentafterconditioningonthethird

)()()()|()()|(),( ypxpypyxpxpxypyxp ===

zzxpzypzxpzxypzyxp ∀== )|()|()|(),|()|,(

ContinuousRandomVariables• Outcomesarerealvalues.Probabilitydensityfunctionsdefinedistributions.– E.g.,

• Continuousjointdistributions:replacesumswithintegrals,andeverythingholds– E.g.,Marginalizationandconditionalprobability

∫∫ ==yy

yPyzxPzyxPzxP )()|,(),,(),(

⎭⎬⎫

⎩⎨⎧ −−= 2

2 )(21exp

21),|( µ

σσπσµ xxP

SummarizingProbabilityDistributions

• Itisoftenusefultogivesummariesofdistributionswithoutdefiningthewholedistribution(E.g.,meanandvariance)

• Mean:

• Variance:

dxxpxxxEx

)(][ ∫ ⋅==

dxxpxExxx

)(])[()var( 2∫ ⋅−=

=E[x2 ]−E[x]2

ExponentialFamily

• Familyofprobabilitydistributions• Manyofthestandarddistributionsbelongtothisfamily– Bernoulli,binomial/multinomial,Poisson,Normal(Gaussian),beta/Dirichlet,…

• Sharemanyimportantproperties– e.g. Theyhaveaconjugateprior(we’llgettothatlater.ImportantforBayesianstatistics)

Definition• Theexponentialfamilyofdistributionsoverx,givenparameterη (eta)isthesetofdistributionsoftheform

• x-scalar/vector,discrete/continuous• η – ‘naturalparameters’• u(x)– somefunctionofx(sufficientstatistic)• g(η)– normalizer• h(x)– basemeasure(oftenconstant)

)}(exp{)()()|( xugxhxp Tηηη =

1)}(exp{)()( =∫ dxxuxhg Tηη

SufficientStatistics

• Vaguedefinition:calledsobecausetheycompletelysummarizeadistribution.

• Lessvague:theyaretheonlypartofthedistributionthatinteractswiththeparametersandarethereforesufficienttoestimatetheparameters.

• Perhapsthenumberoftimesacoincameupheads,orthesumofvaluesmagnitudes.

Example1:Bernoulli

• Binaryrandomvariable-• p(heads)=µ• Cointoss

xxxp −−= 1)1()|( µµµ

}1,0{∈X]1,0[∈µ

Example1:Bernoulli

xxxp −−= 1)1()|( µµµ

}1

exp{ln)1(

)}1ln()1(lnexp{

xu

xx

⎟⎠

⎞⎜⎝

⎛−

−=

−−+=

µµ

µµ

)()(11)(

1ln

)(1)(

ηση

ησµµµ

η η

−=

+==⇒⎟⎟

⎞⎜⎜⎝

−=

=

=

ge

xxuxh

)}(exp{)()()|( xugxhxp Tηηη =

)exp()()|( xxp ηηση −=

Example2:Multinomial• p(valuek)=µk

• Forasingleobservation– dietoss– SometimescalledCategorical

• Formultipleobservations– integercountsonNtrials– Prob(1cameout3times,2cameoutonce,…,6cameout7timesifItossedadie20times)

1],1,0[1

=∈ ∑=

M

kkk µµ

∏∏ =

=M

k

xk

kk

Mk

xNxxP

11 !

!)|,...,( µµ

∑=

=M

kk Nx

1

Example2:Multinomial(1observation)

}lnexp{1∑=

=M

kkkx µ

xxx=

=

)(1)(

uh

)}(exp{)()()|( xugxhxp Tηηη =

∏=

=M

k

xkMkxxP

11 )|,...,( µµ

)exp()|( xx Tp ηη =

Parametersarenotindependentduetoconstraintofsumming to1,there’saslightlymoreinvolvednotation toaddressthat,seeBishop2.4

Example3:Normal(Gaussian)Distribution

• Gaussian(Normal)

⎭⎬⎫

⎩⎨⎧ −−= 2

2 )(21exp

21),|( µ

σσπσµ xxp

Example3:Normal(Gaussian)Distribution

• µisthemean• σ2 isthevariance• Canverifythesebycomputingintegrals.E.g.,

⎭⎬⎫

⎩⎨⎧ −−= 2

2 )(21exp

21),|( µ

σσπσµ xxp

x ⋅ 12πσ

exp −12σ 2 (x −µ)2

⎧ ⎨ ⎩

⎫ ⎬ ⎭ dx = µ

x→−∞

x→∞

Example3:Normal(Gaussian)Distribution

• MultivariateGaussian

P(x |µ,∑) = 2π ∑−1/ 2 exp −12(x −µ)T ∑−1(x −µ)

⎧ ⎨ ⎩

⎫ ⎬ ⎭

Example3:Normal(Gaussian)Distribution

• MultivariateGaussian

• x isnowavector• µisthemeanvector• Σ isthecovariancematrix

⎭⎬⎫

⎩⎨⎧ −∑−−∑=∑ −− )()(21exp2),|( 12/1

µµπµ xxxp T

ImportantPropertiesofGaussians

• Allmarginals ofaGaussianareagainGaussian• AnyconditionalofaGaussianisGaussian• TheproductoftwoGaussiansisagainGaussian

• EventhesumoftwoindependentGaussianRVsisaGaussian.

• Beyondthescopeofthistutorial,butveryimportant:marginalizationandconditioningrulesformultivariateGaussians.

Gaussianmarginalizationvisualization

ExponentialFamilyRepresentation

}21exp{)

4exp()2()2(

}21

21exp{

21

)(21exp

21),|(

2222

212

1

221

222

22

22

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡ −−=

=−

++−

=

⎭⎬⎫

⎩⎨⎧ −−=

xx

xx

xxp

σσµ

ηη

ηπ

µσσ

µσσπ

µσσπ

σµ

)}(exp{)()()|( xugxhxp Tηηη =

)(xh )(ηg Tη )(xu

Example:MaximumLikelihoodFora1DGaussian

• SupposewearegivenadatasetofsamplesofaGaussianrandomvariableX,D={x1,…,xN}andtoldthatthevarianceofthedataisσ2

Whatisourbestguessofµ?*Needtoassumedataisindependentandidenticallydistributed(i.i.d.)

x1 x2 xN…

Example:MaximumLikelihoodFora1DGaussian

Whatisourbestguessofµ?• Wecanwritedownthelikelihoodfunction:

• Wewanttochoosetheµthatmaximizesthisexpression– Takelog,thenbasiccalculus:differentiatew.r.t.µ,setderivativeto0,solveforµtogetsamplemean

∏ ∏= = ⎭

⎬⎫

⎩⎨⎧ −−==µ

N

i

N

i

ii xxpdp1 1

22 )(

21exp

21),|()|( µ

σσπσµ

µML =1N

xii=1

N∑

Example:MaximumLikelihoodFora1DGaussian

x1 x2 xN…µML

σML

MaximumLikelihood

MLestimationofmodelparametersforExponentialFamily

p(D |η) = p(x1,..., xN ) = h(xn )∏( )g(η)N exp{ηT u(xnn∑ )}

∂ln(p(D |η))

∂η= ..., set to 0, solve for ∇g(η)

∑=

=∇−N

nnML xu

Ng

1)(1)(ln η

• Caninprinciplebesolvedtogetestimateforeta.• ThesolutionfortheMLestimatordependsonthedataonlythroughsumoveru,whichisthereforecalledsufficientstatistic•Whatweneedtostoreinordertoestimateparameters.

BayesianProbabilities

• isthelikelihood function• isthepriorprobability of(orourpriorbelief over)θ– ourbeliefsoverwhatmodelsarelikelyornotbeforeseeinganydata

• isthenormalizationconstantorpartitionfunction

• istheposteriordistribution

– Readjustmentofourpriorbeliefsinthefaceofdata

)()()|()|(

dppdpdp θθ

θ =

∫= θθθ dPdpdp )()|()(

)|( θdp)(θp

)|( dp θ

Example:BayesianInferenceFora1DGaussian

• SupposewehaveapriorbeliefthatthemeanofsomerandomvariableXisµ0 andthevarianceofourbeliefisσ02

• WearethengivenadatasetofsamplesofX,d={x1,…,xN}andsomehowknowthatthevarianceofthedataisσ2

Whatistheposteriordistributionover(ourbeliefaboutthevalueof)µ?

Example:BayesianInferenceFora1DGaussian

x1 x2 xN…

Example:BayesianInferenceFora1DGaussian

x1 x2 xN… µ0

σ0

Priorbelief

Example:BayesianInferenceFora1DGaussian

• Rememberfromearlier

• isthelikelihoodfunction

• isthepriorprobability of(orourpriorbelief over)µ

)()()|()|(

dppdpdp µµ

)|( µdp

)(µp

∏ ∏= = ⎭

⎬⎫

⎩⎨⎧ −−==µ

N

i

N

i

ii xxPdp1 1

22 )(

21exp

21),|()|( µ

σσπσµ

⎭⎬⎫

⎩⎨⎧

µ−µ−=µµ 202

0000 )(

21exp

21),|(

σσπσp

Example:BayesianInferenceFora1DGaussian

),|()|()()|()|(

NNDppDpDp

σµµ=µ

µµ∝µ

Normal

µN =σ 2

Nσ 02 +σ 2 µ0 +

Nσ 02

Nσ 02 +σ 2 µML

1σN2 =

1σ 02 +

Nσ 2

where

Example:BayesianInferenceFora1DGaussian

x1 x2 xN… µ0

σ0

Priorbelief

Example:BayesianInferenceFora1DGaussian

x1 x2 xN… µ0

σ0

PriorbeliefµML

σML

MaximumLikelihood

Example:BayesianInferenceFora1DGaussian

x1 x2 xNµN

σN

PriorbeliefMaximumLikelihood

PosteriorDistribution

ConjugatePriors• NoticeintheGaussianparameterestimationexamplethatthefunctionalformoftheposteriorwasthatoftheprior(Gaussian)

• Priorsthatleadtothatformarecalled‘conjugatepriors’

• Foranymemberoftheexponentialfamilythereexistsaconjugatepriorthatcanbewrittenlike

• Multiplybylikelihoodtoobtainposterior(uptonormalization)oftheform

• Noticetheadditiontothesufficientstatistic• ν istheeffectivenumberofpseudo-observations.

}exp{)(),(),|( χνηηνχνχη ν Tgfp =

)})((exp{)(),,|(1

νχηηνχη ν +∝ ∑=

+N

nn

TN xugDp

ConjugatePriors- Examples

• BetaforBernoulli/binomial• Dirichlet forcategorical/multinomial• NormalformeanofNormal• Andmanymore...

• Whataresomepropertiesoftheconjugatepriorforthecovariance(orprecision)matrixofanormaldistribution?