graphical models - disi, university of...

Graphical Models

Andrea [email protected]

Statistical relational learning

Graphical Models

Probability distributions

Bernoulli distributionTwo possible values (outcomes): 1 (success), 0 (failure).Parameters: p probability of success.Probability mass function:

P(x ; p) =

{p if x = 11− p if x = 0

Example: tossing a coinHead (success) and tail (failure) possible outcomesp is probability of head

Graphical Models


Multinomial distribution (one sample)Models the probability of a certain outcome for an eventwith m possible outcomes {v1, . . . , vm}Parameters: p1, . . . ,pm probability of each outcomeProbability mass function:

P(vi ; p1, . . . ,pm) = pi

Tossing a dicem is the number of facespi is probability of obtaining face i

Graphical Models


Gaussian (or normal) distribution

Bell-shaped curve.Parameters: µ mean, σ2

variance.Probability densityfunction:

p(x ;µ, σ) =1√2πσ

exp−(x − µ)2

2σ2

Standard normal distribution: N(0,1)

Standardization of a normal distribution N(µ, σ2)

z =x − µσ

Graphical Models

Conditional probabilities

conditional probability probability of x once y is observed

P(x |y) =P(x , y)

P(y)

statistical independence variables X and Y are statisticalindependent iff

P(x , y) = P(x)P(y)

implying:

P(x |y) = P(x) P(y |x) = P(y)

Graphical Models

Basic rules

law of total probability The marginal distribution of a variable isobtained from a joint distribution summing over allpossible values of the other variable (sum rule)

P(x) =∑y∈Y

P(x , y) P(y) =∑x∈X

P(x , y)

product rule conditional probability definition implies that

P(x , y) = P(x |y)P(y) = P(y |x)P(x)

Bayes’ rule

P(y |x) =P(x |y)P(y)

P(x)

Graphical Models

Playing with probabilities

Use rules!Basic rules allow to model a certain probability givenknowledge of some related onesAll our manipulations will be applications of the three basicrulesBasic rules apply to any number of varables:

P(y) =∑

x

∑z

P(x , y , z) (sum rule)

=∑

x

∑z

P(y |x , z)P(x , z) (product rule)

=∑

x

∑z

P(x |y , z)P(y |z)P(x , z)

P(x |z)(Bayes rule)

Graphical Models

Graphical models

WhyAll probabilistic inference and learning amount at repeatedapplications of the sum and product rulesProbabilistic graphical models are graphicalrepresentations of the qualitative aspects of probabilitydistributions allowing to:

visualize the structure of a probabilistic model in a simpleand intuitive waydiscover properties of the model, such as conditionalindependencies, by inspecting the graphexpress complex computations for inference and learning interms of graphical manipulationsrepresent multiple probability distributions with the samegraph, abstracting from their quantitative aspects (e.g.discrete vs continuous distributions)

Graphical Models

Bayesian Networks (BN)

Description

Directed graphical modelsrepresenting joint distributions

p(a,b, c) = p(c|a,b)p(a,b)

= p(c|a,b)p(b|a)p(a)

Fully connected graphs containno independency assumption

a

b

c

Joint probabilities can be decomposed in manydifferent ways by the product ruleAssuming an ordering of the variables, aprobability distributions over K variables can bedecomposed into:

p(x1, . . . , xK ) = p(xK |x1, . . . , xK−1) . . . p(x2|x1)p(x1)

Graphical Models

Bayesian Networks

Modeling independencies

A graph not fully connectedcontains independencyassumptions.The joint probability of aBayesian network with K nodesis computed as the product of theprobabilities of the nodes giventheir parents:

p(x) =K∏

i=1

p(xi |pai)

x1

x2 x3

x4 x5

x6 x7

The joint probability for the BN in the figure iswritten as

p(x1)p(x2)p(x3)p(x4|x1, x2, x3)p(x5|x1, x3)p(x6|x4)p(x7|x4, x5)

Graphical Models

Conditional independence

IntroductionTwo variables a,b are conditionally independent (writtena ⊥⊥ b | ∅ ) if:

p(a,b) = p(a)p(b)

Two variables a,b are conditionally independent given c(written a ⊥⊥ b | c ) if:

p(a,b|c) = p(a|c)p(b|c)

Independency assumptions can be verified by repeatedapplications of sum and product rulesGraphical models allow to directly verify them through thed-separation criterion

Graphical Models

d-separation

Tail-to-tail

Joint distribution:

p(a,b, c) = p(a|c)p(b|c)p(c)

a and b are not conditionallyindependent (written a>>b | ∅ ):

p(a,b) =∑

c

p(a|c)p(b|c)p(c) 6= p(a)p(b)

c

a b

a and b are conditionallyindependent given c:

p(a,b|c) =p(a,b, c)

p(c)= p(a|c)p(b|c)

c

a b

c is tail-to-tail wrt to the path a→ b as it isconnected to the tails of the two arrows

Graphical Models

d-separation

Head-to-tail

Joint distribution:

p(a,b, c) = p(b|c)p(c|a)p(a) = p(b|c)p(a|c)p(c)

a and b are not conditionallyindependent:

p(a,b) = p(a)∑

c

p(b|c)p(c|a) 6= p(a)p(b)

a c b

a and b are conditionallyindependent given c:

p(a,b|c) =p(b|c)p(a|c)p(c)

p(c)= p(b|c)p(a|c)

a c b

c is head-to-tail wrt to the path a→ b as it isconnected to the head of an arrow and to the tailof the other one

Graphical Models

d-separation

Head-to-head

Joint distribution:

p(a,b, c) = p(c|a,b)p(a)p(b)

a and b are conditionallyindependent:

p(a,b) =∑

c

p(c|a,b)p(a)p(b) = p(a)p(b)

c

a b

a and b are not conditionallyindependent given c:

p(a,b|c) =p(c|a,b)p(a)p(b)

p(c)6= p(a|c)p(b|c)

c

a b

c is head-to-head wrt to the path a→ b as it isconnected to the heads of the two arrows

Graphical Models

d-separation

General Head-to-headLet a descendant of a node x be any node which can bereached from x with a path following the direction of thearrowsA head-to-head node c unblocks the dependency pathbetween its parents if either itself or any of its descendantsreceives evidence

Graphical Models

Example of head-to-head connectionA toy regulatory network: CPTs

Genes A and B have independent prior probabilities:

gene value P(value)A active 0.3A inactive 0.7

gene value P(value)B active 0.3B inactive 0.7

Gene C can be enhanced by both A and B:

Aactive inactive

B Bactive inactive active inactive

C active 0.9 0.6 0.7 0.1C inactive 0.1 0.4 0.3 0.9

Graphical Models

Example of head-to-head connection

Probability of A active (1)

Prior:

P(A = 1) = 1− P(A = 0) = 0.3

Posterior after observing activeC:

P(A = 1|C = 1) =P(C = 1|A = 1)P(A = 1)

P(C = 1)' 0.514

NoteThe probability that A is active increases from observing that itsregulated gene C is active

Graphical Models

Example of head-to-head connection

Derivation

P(C = 1|A = 1) =∑

B∈{0,1}P(C = 1,B|A = 1)

=∑

B∈{0,1}P(C = 1|B,A = 1)P(B|A = 1)

=∑

B∈{0,1}P(C = 1|B,A = 1)P(B)

P(C = 1) =∑

B∈{0,1}

∑A∈{0,1}

P(C = 1,B,A)

=∑

B∈{0,1}

∑A∈{0,1}

P(C = 1|B,A)P(B)P(A)

Graphical Models

Example of head-to-head connectionProbability of A active

Posterior after observing that B isalso active:

P(A = 1|C = 1,B = 1) =

P(C = 1|A = 1,B = 1)P(A = 1|B = 1)

P(C = 1|B = 1)' 0.355

NoteThe probability that A is active decreases after observingthat B is also activeThe B condition explains away the observation that C isactiveThe probability is still greater than the prior one (0.3),because the C active observation still gives some evidencein favour of an active A

Graphical Models

General d-separation criterion

d-separation definitionGiven a generic Bayesian networkGiven A,B,C arbitrary nonintersecting sets of nodesThe sets A and B are d-separated by C if:

All paths from any node in A to any node in B are blocked

A path is blocked if it includes at least one node s.t. either:the arrows on the path meet tail-to-tail or head-to-tail at thenode and it is in C, orthe arrows on the path meet head-to-head at the node andneither it nor any of its descendants is in C

d-separation implies conditional independency

The sets A and B are independent given C ( A ⊥⊥ B |C ) if theyare d-separated by C.

Graphical Models

Example of general d-separation

a>>b|c

Nodes a and b are not d-separatedby c:

Node f is tail-to-tail and not observedNode e is head-to-head and its childc is observed

f

e b

a

c

a ⊥⊥ b|f

Nodes a and b are d-separated by f :

Node f is tail-to-tail and observed

f

e b

a

c

Graphical Models

Example of i.i.d. samplesMaximum-likelihood

We are given a set of instancesD = {x1, . . . , xN} drawn from anunivariate Gaussian with unknownmean µAll paths between xi and xj areblocked if we condition on µThe examples are independent ofeach other given µ:

p(D|µ) =N∏

i=1

p(xi |µ)

µ

x1 xN

xn

N

N

µ

A set of nodes with the same variable type andconnections can be compactly represented usingthe plate notation

Graphical Models

Markov blanket (or boundary)

DefinitionGiven a directed graph with D nodesThe markov blanket of node xi is the minimal set of nodesmaking it xi independent on the rest of the graph:

p(xi |xj 6=i) =p(x1, . . . , xD)

p(xj 6=i)=

p(x1, . . . , xD)∫p(x1, . . . , xD)dxi

=

∏Dk=1 p(xk |pak )∫ ∏D

k=1 p(xk |pak )dxi

All components which do not include xi will cancel betweennumerator and denominatorThe only remaining components are:

p(xi |pai ) the probability of xi given its parentsp(xj |paj ) where paj includes xi ⇒ the children of xi with theirco-parents

Graphical Models

Markov blanket (or boundary)

d-separation

Each parent xj of xi will behead-to-tail or tail-to-tail in the pathbtw xi and any of xj otherneighbours⇒ blockedEach child xj of xi will be head-to-tailin the path btw xi and any of xjchildren⇒ blocked

xi

Each co-parent xk of a child xj of xi behead-to-tail or tail-to-tail in the path btw xj andany of xk other neighbours⇒ blocked

Graphical Models

Markov Networks (MN) or Markov random fields

Undirected graphical models

Graphically model the jointprobability of a set of variablesencoding independencyassumptions (as BN)Do not assign a directionality topairwise interactions

A

C

B

D

Do not assume that local interactions are properconditional probabilities (more freedom inparametrizing them)

Graphical Models

Markov Networks (MN)

A

CB

d-separation definitionGiven A,B,C arbitrary nonintersecting sets of nodesThe sets A and B are d-separated by C if:

All paths from any node in A to any node in B are blocked

A path is blocked if it includes at least one node from C(much simpler than in BN)The sets A and B are independent given C ( A ⊥⊥ B |C ) ifthey are d-separated by C.

Graphical Models


Markov blanketThe Markov blanket of a variable is simply the set of itsimmediate neighbour

Graphical Models


Factorization properties

We need to decompose the joint probability infactors (like P(xi |pai) in BN) in a way consistentwith the independency assumptions.Two nodes xi and xj not connected by a link areindependent given the rest of the network (anypath between them contains at least anothernode)⇒ they should stay in separate factors

Factors are associated withcliques in the network (i.e. fullyconnected subnetworks)A possible option is to associatefactors only with maximal cliques

x1

x2

x3

x4

Graphical Models


Joint probability (1)

p(x) =1Z

∏C

ΨC(xC)

ΨC(xC) : Val(xc)→ IR+ is a positive function of the valueof the variables in clique xC (called clique potential)The joint probability is the (normalized) product of cliquepotentials for all (maximal) cliques in the networkThe partition function Z is a normalization constantassuring that the result is a probability (i.e.

∑x p(x) = 1):

Z =∑x

∏C

ΨC(xC)

Graphical Models


Joint probability (2)In order to guarantee that potential functions are positive,they are typically represented as exponentials:

ΨC(xC) = exp (−E(xc))

E(xc) is called energy function: low energy means highlyprobable configurationThe joint probability becames a sum of exponentials:

p(x) =1Z

exp

(−∑

C

E(xc)

)

Graphical Models


Comparison to BNAdvantage:

More freedom in defining the clique potentials as they don’tneed to represent a probability distribution

Problem:

The partition function Z has to be computed summing overall possible values of the variables

Solutions:

Intractable in generalEfficient algorithms exists for certain types of models (e.g.linear chains)Otherwise Z is approximated

Graphical Models

MN Example: image de-noising

DescriptionWe are given an image as a grid of binary pixelsyi ∈ {−1,1}The image is a noisy version of the original one, whereeach pixel was flipped with a certain (low) probabilityThe aim is designing a model capable of recovering theclean version

Graphical Models


xi

yi

Markov networkWe represent the true image as grid of points xi , andconnect each point to its observed noisy version yi

The MN has two maximal clique types:

{xi , yi} for the true/observed pixel pair{xi , xj} for neighbouring pixels

The MN has one unobserved non-maximal clique type: thetrue pixel {xi}

Graphical Models


Energy functionsWe assume low noise→ the true pixel should usually havesame sign as the observed one:

E(xi , yi) = −αxiyi α > 0

same sign pixels have lower energy→ higher probabilityWe assume neighbouring pixels should usually have samesign (locality principle, e.g. background colour):

E(xi , xj) = −βxixj β > 0

We assume the image can have preference for a certainpixel value (either positive or negative):

E(xi) = γxi

Graphical Models


Joint probabilityThe joint probability becomes:

p(x,y) =1Z

exp∑(x,y)C

(−E((x,y)C))

=1Z

exp

α∑i,j

xixj + β∑

i

xiyi − γ∑

i

xi

α, β, γ are parameters of the model.

Graphical Models


Noteall cliques of the same type share the same parameter(e.g. α for cliques {xi , xj})This parameter sharing (or tying) allows to:

reduce the number of parameters, simplifying learningbuild a network template which will work on images ofdifferent sizes.

Graphical Models


InferenceAssuming parameters are given, the network can be usedto produce a clean version of the image:

1 we replace yi with its observed value for each i2 we compute the most most probable configuration for the

true pixels given the observed ones

The second computation is complex and can be done inapproximate or exact ways (we’ll see) with results ofdifferent quality.

Graphical Models


Graphical Models

Relationship between directed and undirected graphs

Independency assumptionsDirected an undirected graphs are different ways ofrepresenting conditional independency assumptionsSome assumptions can be perfectly represented in bothformalisms, some in only one, some in neither of the twoHowever there is a tight connection between the twoIt is rather simple to convert a specific directed graphicalmodel into an undirected one (but different directed graphscan be mapped to the same undirected graph)The converse is more complex because of normalizationissues

Graphical Models

From a directed to an undirected modelx1 x2 xN −1 xN

x1 x2 xN −1 xN

(a)

(b)

Linear chainThe joint probability in the directed model (a) is:

p(x) = p(x1)p(x2|x1)p(x3|x2) · · · p(xN |xN−1)

The joint probability in the undirected model (b) is:

p(x) =1Zψ1,2(x1, x2)ψ2,3(x2, x3) · · ·ψN−1,N(xN−1, xN)

In order for the latter to represent the same distribution asthe former, we set:

ψ1,2(x1, x2) = p(x1)p(x2|x1) ψ2,3(x2, x3) = p(x3|x2)

· · · ψN−1,N(xN−1, xN) = p(xN |xN−1) Z = 1

Graphical Models

From a directed to an undirected model

x1 x3

x4

x2

x1 x3

x4

x2

Multiple parentsIf a node has multiple parents, we need to construct aclique containing all of them, in order for its potential torepresent the conditional probability:

We add an undirected link between each pair of parents ofthe nodeThe process is called moralization (marrying the parents!old fashioned..)

Graphical Models

From a directed to an undirected model

Generic graph1 Build the moral graph connecting the parents in the

directed graphs and dropping all arrows2 initialize all clique potentials to 13 for each conditional probability, choose a clique containing

all its variables, and multiply its potential with theconditional probability

4 set Z = 1

Graphical Models

Inference in graphical models

DescriptionAssume we have evidence e on the state of a subset ofvariables in the model EInference amounts at computing the posterior probability ofa subset X of the non-observed variables given theobservations:

p(X |E = e)

NoteWhen we need to distinguish between variables and theirvalues, we will indicate random variables with uppercaseletters, and their values with lowercase ones.

Graphical Models


EfficiencyWe can always compute the posterior probability as theratio of two joint probabilities:

p(X |E = e) =p(X ,E = e)

p(E = e)

The problem consists of estimating such joint probabilitieswhen dealing with a large number of variablesDirectly working on the full joint probabilities requires timeexponential in the number of variablesFor instance, if all N variables are discrete and take one ofK possible values, a joint probability table has K N entriesWe would like to exploit the structure in graphical modelsto do inference more efficiently.

Graphical Models

Inference in graphical modelsInference on a chain (1)

p(x) =1Zψ1,2(x1, x2) · · ·ψxN−2,xN−1(xN−2, xN−1)ψN−1,N(xN−1, xN)

The marginal probability of an arbitrary xn is:

p(xn) =∑x1

∑x2

· · ·∑xn−1

∑xn+1

· · ·∑xN

p(x)

Only the ψN−1,N(xN−1, xN) is involved in the lastsummation which can be computed first:

µβ(xN−1) =∑xN

ψN−1,N(xN−1, xN)

giving a function of xN−1 which can be used to computethe successive summation:

µβ(xN−2) =∑xN−1

ψN−2,N−1(xN−2, xN−1)µβ(xN−1)

Graphical Models


Inference on a chain (2)The same procedure can be applied starting from the otherend of the chain, giving

µα(x2) =∑x1

ψ1,2(x1, x2)

up to µα(xn)

The marginal probability is thus computed from thenormalized contribution coming from both ends as:

p(xn) =1Zµα(xn)µβ(xn)

The partition function Z can now be computed justsumming µα(xn)µβ(xn) over all possible values of xn

Graphical Models


x1 xn−1 xn xn+1 xN

µα(xn−1) µα(xn) µβ(xn) µβ(xn+1)

Inference as message passing

We can think of µα(xn) as a message passing from xn−1 toxn

µα(xn) =∑xn−1

ψxn−1,xn (xn−1, xn)µα(xn−1)

We can think of µβ(xn) as a message passing from xn+1 toxn

µβ(xn) =∑xn+1

ψxn,xn+1(xn, xn+1)µβ(xn+1)

Each outgoing message is obtained multiplying theincoming message by the clique potential, and summingover the node values

Graphical Models


Full message passing

Suppose we want to know marginal probabilities for anumber of different variables xi :

1 We send a message from µα(x1) up to µα(xN)2 We send a message from µβ(xN) down to µβ(x1)

If all nodes store messages, we can compute any marginalprobability as

p(xi) = µα(xi)µβ(xi)

for any i having sent just a double number of messages wrta single marginal computation

Graphical Models


Adding evidenceIf some nodes xe are observed, we simply use theirobserved values instead of summing over all possiblevalues when computing their messagesAfter normalization by the partition function Z , this givesthe conditional probability given the evidence

Joint distribution for clique variablesIt is easy to compute the joint probability for variables in thesame clique:

p(xn−1, xn) =1Zµα(xn−1)ψxn−1,xn (xn−1, xn)µβ(xn)

Graphical Models

Inference

Directed modelsThe procedure applies straightforwardly to a directedmodel, replacing clique potentials with conditionalprobabilitiesThe marginal p(Xn) does not need to be normalized as itsalready a correct distributionWhen adding evidence, the message passing procedurecomputes the joint probability of the variable and theevidence, and it has to be normalized to obtain theconditional probability:

p(Xn|X e = xe) =p(Xn,X e = xe)

p(X e = xe)=

p(Xn,X e = xe)∑Xn

p(Xn,X e = xe)

Graphical Models

Inference

(a) (b) (c)

Inference on treesEfficient inference can be computed for the broaded familyof tree-structured models:

undirected trees (a) undirected graphs with a single pathfor each pair of nodes

directed trees (b) directed graphs with a single node (theroot) with no parents, and all other nodes witha single parent

directed polytrees (c) directed graphs with multiple parentsfor node and multiple roots, but still as single(undirected) path between each pair of nodes

Graphical Models

Factor graphs

DescriptionEfficient inference algorithms can be better explained usingan alternative graphical representation called factor graphA factor graph is a graphical representation of a (directedor undirected) graphical model highlighting its factorization(conditional probabilities or clique potentials)The factor graph has one node for each node in theoriginal graphThe factor graph has one additional node (of a differenttype) for each factorA factor node has undirected links to each of the nodevariables in the factor

Graphical Models

Factor graphs: examples

x1 x2

x3

x1 x2

x3

f

x1 x2

x3

f c

f a f b

p(x3|x1,x2)p(x1)p(x2) f(x1,x2,x3)=p(x3|x1,x2)p(x1)p(x2) fc(x1,x2,x3)=p(x3|x1,x2)

fa(x1)=p(x1)fb(x2)=p(x2)

Graphical Models

Factor graphs: examples

x1 x2

x3

x1 x2

x3

f

x1 x2

x3

f a

f b

Notef is associated to the whole clique potentialfa is associated to a maximal cliquefb is associated to a non-maximal clique

Graphical Models

Inference

The sum-product algorithmThe sum-product algorithm is an efficient algorithm forexact inference on tree-structured graphsIt is a message passing algorithm as its simpler version forchainsWe will present it on factor graphs, thus unifying directedand undirected modelsWe will assume a tree-structured graph, giving rise to afactor graph which is a tree

Graphical Models

Inference

Computing marginalsWe want to compute the marginal probability of x :

p(x) =∑x\x

p(x)

Generalizing the message passing scheme seen forchains, this can be computed as the product of messagescoming from all neighbouring factors fs:

p(x) =1Z

∏s∈ne(x)

µfs→x (x)xfs

µfs→x(x)F

s(x,X

s)

Graphical Models

Inference

xm

xM

xfs

µxM→fs(xM )

µfs→x(x)

Gm(xm, Xsm)

Factor messagesEach factor message is the product of messages comingfrom nodes other than x , times the factor, summed over allpossible values of the factor variables other than x(x1, . . . , xM ):

µfs→x (x) =∑x1

· · ·∑xM

fs(x , x1, . . . , xM)∏

m∈ne(fs)\xµxm→fs (xm)

Graphical Models

Inference

xm

fl

fL

fs

Fl(xm, Xml)

Node messagesEach message from node xm to factor fs is the product ofthe factor messages to xm coming from factors other thanfs:

µxm→fs (xm) =∏

l∈ne(xm)\fsµfl→xm (xm)

Graphical Models

Inference

InitializationMessage passing starts from leaves, either factors ornodesMessages from leaf factors are initialized to the factor itself(there will be no xm different from the destination on whichto sum over)

xf

µf→x(x) = f(x)

Messages from leaf nodes are initialized to 1

x f

µx→f (x) = 1

Graphical Models

Inference

Message passing schemeThe node x whose marginal has to be computed isdesigned as root.Messages are sent from all leaves to their neighboursEach internal node sends its message towards the root assoon as it received messages from all other neighboursOnce the root has collected all messages, the marginalcan be computed as the product of them

Graphical Models

Inference

Full message passing schemeIn order to be able to compute marginals for any node,messages need to pass in all directions:

1 Choose an arbitrary node as root2 Collect messages for the root starting from leaves3 Send messages from the root down to the leaves

All messages passed in all directions using only twice thenumber of computations used for a single marginal

Graphical Models

Inference examplex1 x2 x3

x4

fa fb

fc

Consider the unnormalized distribution

p̃(x) = fa(x1, x2)fb(x2, x3)fc(x2, x4)

Graphical Models


x4

fa fb

fc

Choose x3 as root

Graphical Models


x4

Send initial messages from leaves

µx1→fa(x1) = 1µx4→fc (x4) = 1

Graphical Models


x4

Send messages from factor nodes to x2

µfa→x2(x2) =∑x1

fa(x1, x2)

µfc→x2(x2) =∑x4

fc(x2, x4)

Graphical Models


x4

Send message from x2 to factor node fb

µx2→fb (x2) = µfa→x2(x2)µfc→x2(x2)

Graphical Models


x4

Send message from fb to x3

µfb→x3(x3) =∑x2

fb(x2, x3)µx2→fb (x2)

Graphical Models


x4

Send message from root x3

µx3→fb (x3) = 1

Graphical Models


x4

Send message from fb to x2

µfb→x2(x2) =∑x3

fb(x2, x3)

Graphical Models


x4

Send messages from x2 to factor nodes

µx2→fa(x2) = µfb→x2(x2)µfc→x2(x2)

µx2→fc (x2) = µfb→x2(x2)µfa→x2(x2)

Graphical Models


x4

Send messages from factor nodes to leaves

µfa→x1(x1) =∑x2

fa(x1, x2)µx2→fa(x2)

µfc→x4(x4) =∑x2

fc(x2, x4)µx2→fc (x2)

Graphical Models


x4

fa fb

fc

Compute for instance the marginal for x2

p̃(x2) = µfa→x2(x2)µfb→x2(x2)µfc→x2(x2)

=

[∑x1

fa(x1, x2)

][∑x3

fb(x2, x3)

][∑x4

fc(x2, x4)

]=∑x1

∑x3

∑x4

fa(x1, x2)fb(x2, x3)fc(x2, x4)

=∑x1

∑x3

∑x4

p̃(x)

Graphical Models

Inference

Normalization

p(x) =1Z

∏s∈ne(x)

µfs→x (x)

If the factor graph came from a directed model, a correctprobability is already obtained as the product of messages(Z = 1)If the factor graph came from an undirected model, wesimply compute the normalization Z summing the productof messages over all possible values of the target variable:

Z =∑

x

∏s∈ne(x)

µfs→x (x)

Graphical Models

Inference

Factor marginals

p(xs) =1Z

fs(xs)∏

m∈ne(fs)

µxm→fs (xm)

factor marginals can be computed from the product ofneighbouring nodes messages and the factor itself

Adding evidenceIf some nodes xe are observed, we simply use theirobserved values instead of summing over all possiblevalues when computing their messagesAfter normalization by the partition function Z , this givesthe conditional probability given the evidence

Graphical Models

Inference example

A

B

C D A

B

C DP(A)

P(B|A,C)

P(C|D) P(D)

P(B|A,C)P(A)P(C|D)P(D)

Directed modelTake a directed graphical modelsBuild a factor graph representing itCompute the marginal for a variable (e.g. B)

Graphical Models

Inference example

A

B

C D A

B

C DP(A)

P(B|A,C)

P(C|D) P(D)


Compute the marginal for B

Leaf factor nodes send messages:

µfA→A(A) = P(A)

µfD→D(D) = P(D)

Graphical Models

Inference example

A

B

C D A

B

C DP(A)

P(B|A,C)

P(C|D) P(D)


Compute the marginal for BA and D send messages:

µA→fA,B,C(A) = µfA→A(A) = P(A)

µD→fC,D(D) = µfD→D(D) = P(D)

Graphical Models

Inference example

A

B

C D A

B

C DP(A)

P(B|A,C)

P(C|D) P(D)



fC,D sends message:

µfC,D→C(C) =∑

D

P(C|D)µfD→D(D) =∑

D

P(C|D)P(D)

Graphical Models

Inference example

A

B

C D A

B

C DP(A)

P(B|A,C)

P(C|D) P(D)



C sends message:

µC→fA,B,C(C) = µfC,D→C(C) =

∑D

P(C|D)P(D)

Graphical Models

Inference example

A

B

C D A

B

C DP(A)

P(B|A,C)

P(C|D) P(D)


Compute the marginal for BThe desired marginal is obtained:

P(B) = µfA,B,C→B(B) =∑

A

∑C

P(B|A,C)P(A)∑

D

P(C|D)P(D)

=∑

A

∑C

∑D


=∑

A

∑C

∑D

P(A,B,C,D)

Graphical Models

Inference example

A

B

C D A

B

C DP(A)

P(B|A,C)

P(C|D) P(D)


Compute the joint for A,B,CSend messages to factor node fA,B,C

P(A,B,C) = fA,B,C(A,B,C) · µA→fA,B,C(A) · µC→fA,B,C

(C)·· µB→fA,B,C

(B)

= P(B|A,C) · P(A) ·∑

D

P(C|D)P(D) · 1

=∑

D


=∑

D

P(A,B,C,D)

Graphical Models

Inference

Finding the most probable configuration

Given a joint probability distribution p(x)

We wish to find the configuration for variables x having thehighest probability:

xmax = argmaxxp(x)

for which the probability is:

p(xmax) = maxx

p(x)

NoteWe want the configuration which is jointly maximal for allvariablesWe cannot simply compute p(xi) for each i (using thesum-product algorithm) and maximize it

Graphical Models

Inference

The max-product algorithm

p(xmax) = maxx

p(x) = maxx1· · ·max

xMp(x)

As for the sum-product algorithm, we can exploit thedistribution factorization to efficiently compute themaximumIt suffices to replace sum with max in the sum-productalgorithm

Linear chain

maxx

p(x) = maxx1· · ·max

xN

[1Zψ1,2(x1, x2) · · ·ψN−1,N(xN−1, xN)

]=

1Z

maxx1

[ψ1,2(x1, x2)

[· · ·max

xNψN−1,N(xN−1, xN)

]]Graphical Models

Inference

Message passingAs for the sum-product algorithm, the max-product can beseen as message passing over the graph.The algorithm is thus easily applied to tree-structuredgraphs via their factor trees:

µf→x (x) = maxx1,....xM

f (x , x1, . . . , xM)∏

m∈ne(f )\xµxm→f (xm)

µx→f (x) =

∏l∈ne(x)\f

µfl→x (x)

Graphical Models

Inference

Recoving maximal configurationMessages are passed from leaves to an arbitrarily chosenroot xr

The probability of maximal configuration is readily obtainedas:

p(xmax) = maxxr

∏l∈ne(xr )

µfl→xr (xr )

The maximal configuration for the root is obtained as:

xmaxr = argmaxxr

∏l∈ne(xr )

µfl→xr (xr )

We need to recover maximal configuration for the othervariables

Graphical Models

Inference

Recoving maximal configurationWhen sending a message towards x , each factor nodeshould store the configuration of the other variables whichgave the maximum:

φf→x (x) = argmaxx1,...,xM

f (x , x1, . . . , xM)∏

m∈ne(f )\xµxm→f (xm)

When the maximal configuration for the root node xr hasbeen obtained, it can be used to retrieve the maximalconfiguration for the variables in neighbouring factors from:

xmax1 , . . . , xmax

M = φf→xr (xmaxr )

The procedure can be repeated back-tracking to theleaves, retrieving maximal values for all variables

Graphical Models

Recoving maximal configuration

Example for linear chain

xmaxN = argmaxxN

µfN−1,N→xN (xN)

xmaxN−1 = φfN−1,N→xN (xmax

N )

xmaxN−2 = φfN−2,N−1→xN−1(xmax

N−1)

...xmax

1 = φf1,2→x2(xmax2 )

Graphical Models

Recoving maximal configuration

k = 1

k = 2

k = 3

n− 2 n− 1 n n + 1

Trellis for linear chainA trellis or lattice diagram shows the K possible states ofeach variable xn one per rowFor each state k of a variable xn, φfn−1,n→xn (xn) defines aunique (maximal) previous state, linked by an edge in thediagramOnce the maximal state for the last variable xN is chosen,the maximal states for other variables are recoveringfollowing the edges backward.

Graphical Models

Inference

Underflow issuesThe max-product algorithm relies on products (nosummation)Products of many small probabilities can lead to underflowproblemsThis can be addressed computing the logarithm of theprobability insteadThe logarithm is monotonic, thus the proper maximalconfiguration is recovered:

log(

maxx

p(x))

= maxx

log p(x)

The effect is replacing products with sums (of logs) in themax-product algorithm, giving the max-sum one

Graphical Models

Inference

Instances of inference algorithmsThe sum-product and max-product algorithms are genericalgorithms for exact inference in tree-structured models.Inference algorithms for certain specific types of modelscan be seen as special cases of themWe will see examples (forward-backward as sum-productand Viterbi as max-product) for Hidden Markov Models andConditional Random Fields

Graphical Models

InferenceExact inference on general graphs

The sum-product and max-product algorithms can beapplied to tree-structured graphsMany pratical application require graphs with loopsAn extension of this algorithms to generic graphs can beachieved with the junction tree algorithmThe algorithm does not work on factor graphs, but onjunction trees, tree-structured graphs with nodescontaining clusters of variables of the original graphA message passing scheme analogous to the sum-productand max-product algorithms is run on the junction tree

ProblemThe complexity on the algorithm is exponential on themaximal number of variables in a cluster, making itintractable for large complex graphs.

Graphical Models

Inference

Approximate inferenceIn cases in which exact inference is intractable, we resortto approximate inference techniquesA number of techniques for approximate inference exist:

loopy belief propagation message passing on the originalgraph even if it contains loops

variational methods deterministic approximations,assuming the posterior probability (given theevidence) factorizes in a particular way

sampling methods approximate posterior is obtainedsampling from the network

Graphical Models

InferenceLoopy belief propagation

Apply sum-product algorithm even if it is not guaranteed toprovide an exact solutionWe assume all nodes are in condition of sendingmessages (i.e. they already received a constant 1message from all neighbours)A message passing schedule is chosen in order to decidewhich nodes start sending messages (e.g. flooding, allnodes send messages in all directions at each time step)Information flows many times around the graph (becauseof the loops), each message on a link replaces theprevious one and is only based on the most recentmessages received from the other neighboursThe algorithm can eventually converge (no more changesin messages passing through any link) depending on thespecific model over which it is applied

Graphical Models

Approximate inference

Sampling methods

Given the joint probability distribution p(X )

A sample from the distribution is an instantiation of all thevariables X according to the probability p.Samples can be used to approximate the probability of acertain assignment Y = y for a subset Y ⊂ X of thevariables:

We simply count the fraction of samples which areconsistent with the desired assignment

We usually need to sample from a posterior distributiongiven some evidence E = e

Graphical Models

Sampling methods

Markov Chain Monte Carlo (MCMC)We usually cannot directly sample from the posteriordistribution (too expensive)We can instead build a random process which graduallysamples from distributions closer and closer to theposteriorThe state of the process is an instantiation of all thevariablesThe process can be seen as randomly traversing the graphof states moving from one state to another with a certainprobability.After enough time, the probability of being in any particularstate is the desired posterior.The random process is a Markov Chain

Graphical Models

Learning graphical models

Parameter estimationWe assume the structure of the model is givenWe are given a dataset of examples D = {x(1), . . . ,x(N)}Each example x(i) is a configuration for all (complete data)or some (incomplete data) variables in the modelWe need to estimate the parameters of the model (CPD orweights of clique potentials) from the dataThe simplest approach consists of learning the parametersmaximizing the likelihood of the data:

θmax = argmaxθp(D|θ) = argmaxθL(D,θ)

Graphical Models

Learning Bayesian Networks

X3(1)X2(1)

X1(1)

X3(N)

X1(N)

X2(N)

Θ3|1

X3(2)X2(2)

X1(2)

Θ1

Θ2|1

Maximum likelihood estimation, complete data

p(D|θ) =N∏

i=1

p(x(i)|θ) examples independent given θ

=N∏

i=1

m∏j=1

p(xj(i)|paj(i),θ) factorization for BN

Graphical Models


Maximum likelihood estimation, complete dataThe parameters of each CPD can be estimatedindependently:

θmaxXj |Paj

= argmaxθXj |Paj

N∏i=1

p(xj(i)|paj(i),θXj |Paj)︸︷︷︸

L(θXj |Paj,D)

A discrete CPD P(X |U), can be represented as a table,with:

a number of rows equal to the number Val(X ) ofconfigurations for Xa number of columns equal to the number Val(U) ofconfigurations for its parents Ueach table entry θx|u indicating the probability of a specificconfiguration of X = x and its parents U = u

Graphical Models


Maximum likelihood estimation, complete dataA column in the CPD table contains a multinomialdistribution over values of X for a certain configuration ofthe parents UThus each column should sum to one:

∑x θx |u = 1

Parameters of different columns can be estimatedindependentlyFor each multinomial distribution, zeroing the gradient ofthe maximum likelihood and considering the normalizationconstraint, we obtain:

θmaxx |u =

Nu,x∑x Nu,x

The maximum likelihood parameters are simply the fractionof times in which the specific configuration was observed inthe data

Graphical Models


Example

Training examples as (A,B,C) tuples:

(act,act,act),(act,inact,act),(act,inact,act),(act,inact,inact),(inact,act,act),(inact,act,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact).

Fill CPTs with counts

Graphical Models


Example




gene value countsA active 4A inactive 8

gene value countsB active 3B inactive 9

Graphical Models


Example



Normalize counts columnwise

gene value countsA active 4/12A inactive 8/12

gene value countsB active 3/12B inactive 9/12

Graphical Models


Example




gene value countsA active 0.33A inactive 0.67

gene value countsB active 0.25B inactive 0.75

Graphical Models


Example




Aactive inactive


C active 1 2 1 0C inactive 0 1 1 6

Graphical Models


Example




Aactive inactive


C active 1/1 2/3 1/2 0/6C inactive 0/1 1/3 1/2 6/6

Graphical Models


Example




Aactive inactive


C active 1 0.67 0.5 0C inactive 0 0.33 0.5 1

Graphical Models


Adding priorsML estimation tends to overfit the training setConfiguration not appearing in the training set will receivezero probabilityA common approach consists of combining ML with a priorprobability on the parameters, achieving amaximum-a-posteriori estimate:

θmax = argmaxθp(D|θ)p(θ)

Graphical Models


Dirichlet priorsThe conjugate (read natural) prior for a multinomialdistribution is a Dirichlet distribution with parameters αx |ufor each possible value of xThe resulting maximum-a-posteriori estimate is:

θmaxx |u =

Nu,x + αx |u∑x(Nu,x + αx |u

)The prior is like having observed αx |u imaginary sampleswith configuration X = x ,U = u

Graphical Models


Example



Fill CPTs with imaginary counts (priors)

Aactive inactive


C active 1 1 1 1C inactive 1 1 1 1

Graphical Models


Example



Add counts from training data

Aactive inactive


C active 1+1 1+2 1+1 1+0C inactive 1+0 1+1 1+1 1+6

Graphical Models


Example




Aactive inactive


C active 2/3 3/4 2/4 1/8C inactive 1/3 2/4 2/4 7/8

Graphical Models


Example




Aactive inactive


C active 0.67 0.75 0.5 0.125C inactive 0.33 0.25 0.5 0.875

Graphical Models


Incomplete dataWith incomplete data, some of the examples missevidence on some of the variablesCounts of occurrences of different configurations cannot becomputed if not all data are observedThe full Bayesian approach of integrating over missingvariables is often intractable in practiceWe need approximate methods to deal with the problem

Graphical Models

Learning with missing data: Expectation-Maximization

E-M for Bayesian nets in a nutshellSufficient statistics (counts) cannot be computed (missingdata)Fill-in missing data inferring them using current parameters(solve inference problem to get expected counts)Compute parameters maximizing likelihood (or posterior)of such expected countsIterate the procedure to improve quality of parameters

Graphical Models

Learning in Markov Networks

Maximum likelihood estimationFor general MN, the likelihood function cannot bedecomposed into independent pieces for each potentialbecause of the global normalization (Z )

L(E ,D) =N∏

i=1

1Z

exp

(−∑

C

Ec(xc(i))

)However the likelihood is concave in EC , the energyfunctionals whose parameters have to be estimatedThe problem is an unconstrained concave maximizationproblem solved by gradient ascent (or second ordermethods)

Graphical Models

Learning in Markov Networks

Maximum likelihood estimationFor each configuration uc ∈ Val(xc) we have a parameterEc,uc ∈ IR (ignoring parameter tying)The partial derivative of the log-likelihood wrt Ec,uc is:

∂ logL(E ,D)

∂Ec,uc

=N∑

i=1

[δ(xc(i),uc)− p(X c = uc |E)]

= Nuc − N p(X c = uc |E)

The derivative is zero when the counts of the datacorrespond to the expected counts predicted by the modelIn order to compute p(X c = uc |E), inference has to beperformed on the Markov network, making learning quiteexpensive

Graphical Models

Learning structure of graphical models

Approachesconstraint-based test conditional independencies on the data

and construct a model satisfying themscore-based assign a score to each possible structure, define

a search procedure looking for the structuremaximizing the score

model-averaging assign a prior probability to each structure,and average prediction over all possible structuresweighted by their probabilities (full Bayesian,intractable)

Graphical Models

graphical models - disi, university of...

Documents