probabilistic graphical models

40
Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex systems are built by combining simpler parts Why have a model? Compact and modular representation of complex systems Ability to execute complex reasoning patterns Make predictions Generalize from particular problem

Upload: abba

Post on 14-Jan-2016

35 views

Category:

Documents


1 download

DESCRIPTION

Probabilistic Graphical Models. Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex systems are built by combining simpler parts Why have a model? Compact and modular representation of complex systems - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Probabilistic Graphical Models

Probabilistic Graphical Models

Tool for representing complex systems and performing sophisticated reasoning tasks

Fundamental notion: Modularity Complex systems are built by combining simpler parts

Why have a model? Compact and modular representation of complex

systems Ability to execute complex reasoning patterns Make predictions Generalize from particular problem

Page 2: Probabilistic Graphical Models

Probability Theory Probability distribution P over (, S) is a mapping from

events in S such that: P() 0 for all S P() = 1 If ,S and =, then P()=P()+P()

Conditional Probability:

Chain Rule:

Bayes Rule:

Conditional Independence:

)(

)()|(

P

PP

)(

)()|()|(

P

PPP

)()|()( PPP

)|()|()|( PP

Page 3: Probabilistic Graphical Models

Random Variables & Notation

Random variable: Function from to a value Categorical / Ordinal / Continuous Val(X) – set of possible values of RV X Upper case letters denote RVs (e.g., X, Y, Z) Upper case bold letters denote set of RVs (e.g., X, Y) Lower case letters denote RV values (e.g., x, y, z) Lower case bold letters denote RV set values (e.g., x) Values for categorical RVs with |Val(X)|=k: x1,x2,…,xk

Marginal distribution over X: P(X) Conditional independence: X is independent of Y

given Z if:

Pin )|(:,, zZyYxXZzYyXx

Page 4: Probabilistic Graphical Models

Expectation Discrete RVs:

Continuous RVs:

Linearity of expectation:

Expectation of products:(when X Y in P)

Xx

P xxPXE )(][

Xx

P xxpXE )(][

][][

)()(

),(),(

),(),(

),()(][

, ,

,

YEXE

yyPxxP

yxPyyxPx

yxyPyxxP

yxPyxYXE

x y

x y xy

yx yx

yxP

][][

)()(

)()(

),(][

,

,

YEXE

yyPxxP

yPxxyP

yxxyPXYE

yx

yx

yxP

Independence

assumption

Page 5: Probabilistic Graphical Models

Variance Variance of RV:

If X and Y are independent: Var[X+Y]=Var[X]+Var[Y]

Var[aX+b]=a2Var[X]

][][

][][][2][]][[]][2[][

][][2][][

22

22

22

22

2

XEXE

XEXEXEXEXEEXXEEXE

XEXXEXEXEXEXVar

PP

PPPP

PPPPP

PPP

PPP

Page 6: Probabilistic Graphical Models

Information Theory Entropy:

We use log base 2 to interpret entropy as bits of information Entropy of X is a lower bound on avg. # of bits to encode values

of X 0 Hp(X) log|Val(X)| for any distribution P(X)

Conditional entropy:

Information only helps: Mutual information:

0 Ip(X;Y) Hp(X) Symmetry: Ip(X;Y)= Ip(Y;X) Ip(X;Y)=0 iff X and Y are independent

Chain rule of entropies:

XxP xPxPXH )(log)()(

YyXx

PPP

yxPyxP

YHYXHYXH

,)|(log),(

)(),()|(

),...,|(...)(),...,( 1111 nnPPnP XXXHXHXXH

)()|( XHYXH PP

YyXx

PPP

xP

yxPyxP

YXHXHYXI

, )(

)|(log),(

)|()();(

Page 7: Probabilistic Graphical Models

Distances Between Distributions

Relative Entropy: D(P॥Q)0 D(P॥Q)=0 iff P=Q Not a distance metric (no symmetry and triangle inequality)

L1 distance: L2 distance: L distance:

21

22 ))()((

Xx

xQxPQP

Xx xQ

xPxPXQXPD

)(

)(log)())()((

XxxQxPQP |)()(|1

|)()(|max Xx xQxPQP

Page 8: Probabilistic Graphical Models

Independent Random Variables Two variables X and Y are independent if

P(X=x|Y=y) = P(X=x) for all values x,y Equivalently, knowing Y does not change predictions

of X

If X and Y are independent then: P(X, Y) = P(X|Y)P(Y) = P(X)P(Y)

If X1,…,Xn are independent then: P(X1,…,Xn) = P(X1)…P(Xn) O(n) parameters All 2n probabilities are implicitly defined Cannot represent many types of distributions

Page 9: Probabilistic Graphical Models

Conditional Independence X and Y are conditionally independent given Z

if P(X=x|Y=y, Z=z) = P(X=x|Z=z) for all values x, y, z Equivalently, if we know Z, then knowing Y does not

change predictions of X Notation: Ind(X;Y | Z) or (X Y | Z)

Page 10: Probabilistic Graphical Models

Conditional Parameterization S = Score on test, Val(S) = {s0,s1} I = Intelligence, Val(I) = {i0,i1}

I S P(I,S)

i0 s0 0.665

i0 s1 0.035

i1 s0 0.06

i1 s1 0.24

S

I s0 s1

i0 0.95

0.05

i1 0.2 0.8

I

i0 i1

0.7 0.3

P(S|I)P(I)P(I,S)

Joint parameterization Conditional parameterization

3 parameters 3 parameters

Alternative parameterization: P(S) and P(I|S)

Page 11: Probabilistic Graphical Models

Conditional Parameterization S = Score on test, Val(S) = {s0,s1} I = Intelligence, Val(I) = {i0,i1} G = Grade, Val(G) = {g0,g1,g2} Assume that G and S are independent given I

Joint parameterization 223=12-1=11 independent parameters

Conditional parameterization has P(I,S,G) = P(I)P(S|I)P(G|I,S) = P(I)P(S|I)P(G|I) P(I) – 1 independent parameter P(S|I) – 21 independent parameters P(G|I) - 22 independent parameters 7 independent parameters

Page 12: Probabilistic Graphical Models

Biased Coin Toss Example Coin can land in two positions: Head or Tail

Estimation task Given toss examples x[1],...x[m] estimate

P(H)= and P(T)=1-

Assumption: i.i.d samples Tosses are controlled by an (unknown) parameter Tosses are sampled from the same distribution Tosses are independent of each other

Page 13: Probabilistic Graphical Models

Biased Coin Toss Example Goal: find [0,1] that predicts the data well

“Predicts the data well” = likelihood of the data given

Example: probability of sequence H,T,T,H,H

m

i

m

iixPixxixPDPDL

11)|][()],1[],...,1[|][()|():(

23 )1()|()|()|()|()|():,,,,( HPHPTPTPHPHHTTHL

0 0.2 0.4 0.6 0.8 1

L(D

:)

Page 14: Probabilistic Graphical Models

Maximum Likelihood Estimator Parameter that maximizes L(D:)

In our example, =0.6 maximizes the sequence H,T,T,H,H

0 0.2 0.4 0.6 0.8 1

L(D

:)

Page 15: Probabilistic Graphical Models

Maximum Likelihood Estimator General case

Observations: MH heads and MT tails Find maximizing likelihood Equivalent to maximizing log-likelihood

Differentiating the log-likelihood and solving for we get that the maximum likelihood parameter is:

TH MMTH MML )1():,(

)1log(log):,( THTH MMMMl

TH

H

MM

M

Page 16: Probabilistic Graphical Models

Sufficient Statistics For computing the parameter of the coin toss

example, we only needed MH and MT since

MH and MT are sufficient statistics

TH MMDL )1():(

Page 17: Probabilistic Graphical Models

Sufficient Statistics A function s(D) is a sufficient statistics from

instances to a vector in k if for any two datasets D and D’ and any we have

):'():(])[(])[('][][

DLDLixsixsDixDix

Datasets

Statistics

Page 18: Probabilistic Graphical Models

Sufficient Statistics for Multinomial

A sufficient statistics for a dataset D over a variable Y with k values is the tuple of counts <M1,...Mk> such that Mi is the number of times that the Y=yi in D

Sufficient statistic Define s(x[i]) as a tuple of dimension k s(x[i])=(0,...0,1,0,...,0)

(1,...,i-1) (i+1,...,k)

k

i

Dsi

iDL1

)():(

Page 19: Probabilistic Graphical Models

Sufficient Statistic for Gaussian Gaussian distribution:

Rewrite as

sufficient statistics for Gaussian: s(x[i])=<1,x,x2>

2

2

12

2

1)(),(~)(

x

eXpNXP if

2

2

222

2

1exp

2

1)(

xxXp

Page 20: Probabilistic Graphical Models

Maximum Likelihood Estimation

MLE Principle: Choose that maximize L(D:)

Multinomial MLE:

Gaussian MLE: m

mxM

][1

m

i i

ii

M

M

1

m

mxM

2)][(1

Page 21: Probabilistic Graphical Models

MLE for Bayesian Networks Parameters

x0, x1

y0|x0, y1|x0, y0|x1, y1|x1

Data instance tuple: <x[m],y[m]> Likelihood

X

Y

Y

X y0 y1

x0 0.95

0.05

x1 0.2 0.8

X

x0 x1

0.7 0.3

):][|][():][(

):][|][():][(

):][],[():(

11

1

1

M

m

M

m

M

m

M

m

mxmyPmxP

mxmyPmxP

mymxPDL

Likelihood decomposes into two separate terms, one for each variable

Page 22: Probabilistic Graphical Models

MLE for Bayesian Networks Terms further decompose by CPDs:

By sufficient statistics

where M[x0,y0] is the number of data instances in which X takes the value x0 and Y takes the value y0

MLE

0 1

10

0 1

][: ][:||

][: ][:||

1

):][|][():][|][(

):][|][():][|][():][|][(

xmxm xmxmxYxY

xmxm xmxmXYXY

M

m

mxmyPmxmyP

mxmyPmxmyPmxmyP

1

11

1

01

11

][:

],[

|

],[

||):][|][(

xmxm

yxM

xY

yxM

xYxYmxmyP

][

],[

],[],[

],[1

01

1101

01

| 10

xM

yxM

yxMyxM

yxMxy

Page 23: Probabilistic Graphical Models

MLE for Bayesian Networks Likelihood for Bayesian network

if Xi|Pa(Xi) are disjoint then MLE can be

computed by maximizing each local likelihood separately

ii

i miii

m iiii

m

DL

mPamxP

mPamxP

mxPDL

):(

):][|][(

):][|][(

):][():(

Page 24: Probabilistic Graphical Models

MLE for Table CPD BayesNets Multinomial CPD

For each value xX we get an independent multinomial problem where the MLE is

)( )(

],[|

][|][| ):(

Xx

xx

XX

Val YValy

yMy

mmmyYY DL

][

],[| x

xx M

yM i

y i

Page 25: Probabilistic Graphical Models

Limitations of MLE Two teams play 10 times, and the first wins 7

of the 10 matches Probability of first team winning = 0.7

A coin is tosses 10 times, and comes out ‘head’ 7 of the 10 tosses Probability of head = 0.7

Would you place the same bet on the next game as you would on the next coin toss?

We need to incorporate prior knowledge Prior knowledge should only be used as a guide

Page 26: Probabilistic Graphical Models

Bayesian Inference Assumptions

Given a fixed tosses are independent If is unknown tosses are not marginally

independent – each toss tells us something about The following network captures our

assumptions

X[1] X[M]X[2]…

0

1

][1

][)|][(

xmx

xmxmxP

Page 27: Probabilistic Graphical Models

Bayesian Inference Joint probabilistic model

Posterior probability over

TH MM

M

i

P

ixPP

PMxxPMxxP

)1()(

)|][()(

)()|][],...,1[()],[],...,1[(

1

X[1] X[M]X[2] …

])[],...,1[(

)()|][],...,1[(])[],...,1[|(

MxxP

PMxxPMxxP

Likelihood

Prior

Normalizing factor

For a uniform prior, posterior is the

normalized likelihood

Page 28: Probabilistic Graphical Models

Bayesian Prediction Predict the data instance from the previous

ones

Solve for uniform prior P()=1 and binomial variable

dMxxPMxP

dMxxPMxxMxP

dMxxMxP

MxxMxP

])[],...,1[|()|]1[(

])[],...,1[|()],[],...,1[|]1[(

])[],...,1[|],1[(

])[],...,1[|]1[(

2

1

)1(])[],...,1[(

1)],[],...,1[|]1[( 1

TH

H

MM

MM

MMxxP

MxxxMxP TH

Page 29: Probabilistic Graphical Models

Example: Binomial Data Prior: uniform for in [0,1]

P() = 1 P( |D) is proportional to the likelihood L(D:)

MLE for P(X=H) is 4/5 = 0.8 Bayesian prediction is 5/7

)|][],1[(])[],1[|( MxxPMxxP

7142.07

5)|()|]1[( dDPDHMxP

0 0.2 0.4 0.6 0.8 1

(MH,MT ) = (4,1)

Page 30: Probabilistic Graphical Models

Dirichlet Priors A Dirichlet prior is specified by a set of (non-

negative) hyperparameters 1,...k so that

~ Dirichlet(1,...k) if

where and

Intuitively, hyperparameters correspond to the number of imaginary examples we saw before starting the experiment

k

kk

ZP 11

)( )(

)(

1

1

k

i i

k

i iZ

0

1)( dtetx tx

Page 31: Probabilistic Graphical Models

Dirichlet Priors – Example

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 0.2 0.4 0.6 0.8 1

Dirichlet(1,1)Dirichlet(2,2)

Dirichlet(0.5,0.5)Dirichlet(5,5)

Page 32: Probabilistic Graphical Models

Dirichlet Priors Dirichlet priors have the property that the

posterior is also Dirichlet Data counts are M1,...,Mk

Prior is Dir(1,...k) Posterior is Dir(1+M1,...k+Mk)

The hyperparameters 1,…,K can be thought of as “imaginary” counts from our prior experience

Equivalent sample size = 1+…+K The larger the equivalent sample size the more

confident we are in our prior

Page 33: Probabilistic Graphical Models

Effect of Priors

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0 20 40 60 80 100

Different strength H + T Fixed ratio H / T

Fixed strength H + T

Different ratio H / T

Prediction of P(X=H) after seeing data with MH=1/4MT as a function of the sample size

Page 34: Probabilistic Graphical Models

Effect of Priors (cont.)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

5 10 15 20 25 30 35 40 45 50

P(X

= 1

|D)

N

MLEDirichlet(.5,.5)

Dirichlet(1,1)Dirichlet(5,5)

Dirichlet(10,10)

N

0

1Toss Result

In real data, Bayesian estimates are less sensitive to noise in the data

Page 35: Probabilistic Graphical Models

General Formulation Joint distribution over D,

Posterior distribution over parameters

P(D) is the marginal likelihood of the data

As we saw, likelihood can be described compactly using sufficient statistics

We want conditions in which posterior is also compact

)()|(),( PDPDP

)(

)()|()|(

DP

PDPDP

dPDPDP )()|()(

Page 36: Probabilistic Graphical Models

Conjugate Families A family of priors P(:) is conjugate to a model P(|) if

for any possible dataset D of i.i.d samples from P(|) and choice of hyperparameters for the prior over , there are hyperparameters ’ that describe the posterior, i.e.,P(:’) P(D|)P(:) Posterior has the same parametric form as the prior Dirichlet prior is a conjugate family for the multinomial

likelihood

Conjugate families are useful since: Many distributions can be represented with hyperparameters They allow for sequential update within the same

representation In many cases we have closed-form solutions for prediction

Page 37: Probabilistic Graphical Models

Parameter Estimation Summary

Estimation relies on sufficient statistics For multinomials these are of the form M[xi,pai] Parameter estimation

Bayesian methods also require choice of priors MLE and Bayesian are asymptotically

equivalent Both can be implemented in an online manner

by accumulating sufficient statistics

][

],[),|( ,

ipa

iipaxii paM

paxMDpaxP

i

ii

][

],[ˆ|

i

iipax paM

paxMii

MLE Bayesian (Dirichlet)

Page 38: Probabilistic Graphical Models

This Week’s Assignment Compute P(S)

Decompose as a Markov model of order k Collect sufficient statistics Use ratio to genome background

Evaluation & Deliverable Test set likelihood ratio to random locations &

sequences ROC analysis (ranking)

Page 39: Probabilistic Graphical Models

Hidden Markov Model Special case of Dynamic Bayesian network

Single (hidden) state variable Single (observed) observation variable Transition probability P(S’|S) assumed to be sparse

Usually encoded by a state transition graph

S S’

O’

G G0 Unrolled network

S0

O0

S0 S1

O1

S2

O2

S3

O3

Page 40: Probabilistic Graphical Models

Hidden Markov Model Special case of Dynamic Bayesian network

Single (hidden) state variable Single (observed) observation variable Transition probability P(S’|S) assumed to be sparse

Usually encoded by a state transition graph

S1

S2

S3

S4

s1 s2 s3 s4

s1 0.2 0.8 0 0

s2 0 0 1 0

s3 0.4 0 0 0.6

s4 0 0.5 0 0.5

P(S’|S)

State transition representation