graphical models - disi, university of...
TRANSCRIPT
Probability distributions
Bernoulli distributionTwo possible values (outcomes): 1 (success), 0 (failure).Parameters: p probability of success.Probability mass function:
P(x ; p) =
{p if x = 11− p if x = 0
Example: tossing a coinHead (success) and tail (failure) possible outcomesp is probability of head
Graphical Models
Probability distributions
Multinomial distribution (one sample)Models the probability of a certain outcome for an eventwith m possible outcomes {v1, . . . , vm}Parameters: p1, . . . ,pm probability of each outcomeProbability mass function:
P(vi ; p1, . . . ,pm) = pi
Tossing a dicem is the number of facespi is probability of obtaining face i
Graphical Models
Probability distributions
Gaussian (or normal) distribution
Bell-shaped curve.Parameters: µ mean, σ2
variance.Probability densityfunction:
p(x ;µ, σ) =1√2πσ
exp−(x − µ)2
2σ2
Standard normal distribution: N(0,1)
Standardization of a normal distribution N(µ, σ2)
z =x − µσ
Graphical Models
Conditional probabilities
conditional probability probability of x once y is observed
P(x |y) =P(x , y)
P(y)
statistical independence variables X and Y are statisticalindependent iff
P(x , y) = P(x)P(y)
implying:
P(x |y) = P(x) P(y |x) = P(y)
Graphical Models
Basic rules
law of total probability The marginal distribution of a variable isobtained from a joint distribution summing over allpossible values of the other variable (sum rule)
P(x) =∑y∈Y
P(x , y) P(y) =∑x∈X
P(x , y)
product rule conditional probability definition implies that
P(x , y) = P(x |y)P(y) = P(y |x)P(x)
Bayes’ rule
P(y |x) =P(x |y)P(y)
P(x)
Graphical Models
Playing with probabilities
Use rules!Basic rules allow to model a certain probability givenknowledge of some related onesAll our manipulations will be applications of the three basicrulesBasic rules apply to any number of varables:
P(y) =∑
x
∑z
P(x , y , z) (sum rule)
=∑
x
∑z
P(y |x , z)P(x , z) (product rule)
=∑
x
∑z
P(x |y , z)P(y |z)P(x , z)
P(x |z)(Bayes rule)
Graphical Models
Playing with probabilities
Example
P(y |x , z) =P(x , z|y)P(y)
P(x , z)(Bayes rule)
=P(x , z|y)P(y)
P(x |z)P(z)(product rule)
=P(x |z, y)P(z|y)P(y)
P(x |z)P(z)(product rule)
=P(x |z, y)P(z, y)
P(x |z)P(z)(product rule)
=P(x |z, y)P(y |z)P(z)
P(x |z)P(z)(product rule)
=P(x |z, y)P(y |z)
P(x |z)
Graphical Models
Graphical models
WhyAll probabilistic inference and learning amount at repeatedapplications of the sum and product rulesProbabilistic graphical models are graphicalrepresentations of the qualitative aspects of probabilitydistributions allowing to:
visualize the structure of a probabilistic model in a simpleand intuitive waydiscover properties of the model, such as conditionalindependencies, by inspecting the graphexpress complex computations for inference and learning interms of graphical manipulationsrepresent multiple probability distributions with the samegraph, abstracting from their quantitative aspects (e.g.discrete vs continuous distributions)
Graphical Models
Bayesian Networks (BN)
Description
Directed graphical modelsrepresenting joint distributions
p(a,b, c) = p(c|a,b)p(a,b)
= p(c|a,b)p(b|a)p(a)
Fully connected graphs containno independency assumption
a
b
c
Joint probabilities can be decomposed in manydifferent ways by the product ruleAssuming an ordering of the variables, aprobability distributions over K variables can bedecomposed into:
p(x1, . . . , xK ) = p(xK |x1, . . . , xK−1) . . . p(x2|x1)p(x1)
Graphical Models
Bayesian Networks
Modeling independencies
A graph not fully connectedcontains independencyassumptions.The joint probability of aBayesian network with K nodesis computed as the product of theprobabilities of the nodes giventheir parents:
p(x) =K∏
i=1
p(xi |pai)
x1
x2 x3
x4 x5
x6 x7
The joint probability for the BN in the figure iswritten as
p(x1)p(x2)p(x3)p(x4|x1, x2, x3)p(x5|x1, x3)p(x6|x4)p(x7|x4, x5)
Graphical Models
Conditional independence
IntroductionTwo variables a,b are conditionally independent (writtena ⊥⊥ b | ∅ ) if:
p(a,b) = p(a)p(b)
Two variables a,b are conditionally independent given c(written a ⊥⊥ b | c ) if:
p(a,b|c) = p(a|c)p(b|c)
Independency assumptions can be verified by repeatedapplications of sum and product rulesGraphical models allow to directly verify them through thed-separation criterion
Graphical Models
d-separation
Tail-to-tail
Joint distribution:
p(a,b, c) = p(a|c)p(b|c)p(c)
a and b are not conditionallyindependent (written a>>b | ∅ ):
p(a,b) =∑
c
p(a|c)p(b|c)p(c) 6= p(a)p(b)
c
a b
a and b are conditionallyindependent given c:
p(a,b|c) =p(a,b, c)
p(c)= p(a|c)p(b|c)
c
a b
c is tail-to-tail wrt to the path a→ b as it isconnected to the tails of the two arrows
Graphical Models
d-separation
Head-to-tail
Joint distribution:
p(a,b, c) = p(b|c)p(c|a)p(a) = p(b|c)p(a|c)p(c)
a and b are not conditionallyindependent:
p(a,b) = p(a)∑
c
p(b|c)p(c|a) 6= p(a)p(b)
a c b
a and b are conditionallyindependent given c:
p(a,b|c) =p(b|c)p(a|c)p(c)
p(c)= p(b|c)p(a|c)
a c b
c is head-to-tail wrt to the path a→ b as it isconnected to the head of an arrow and to the tailof the other one
Graphical Models
d-separation
Head-to-head
Joint distribution:
p(a,b, c) = p(c|a,b)p(a)p(b)
a and b are conditionallyindependent:
p(a,b) =∑
c
p(c|a,b)p(a)p(b) = p(a)p(b)
c
a b
a and b are not conditionallyindependent given c:
p(a,b|c) =p(c|a,b)p(a)p(b)
p(c)6= p(a|c)p(b|c)
c
a b
c is head-to-head wrt to the path a→ b as it isconnected to the heads of the two arrows
Graphical Models
d-separation
General Head-to-headLet a descendant of a node x be any node which can bereached from x with a path following the direction of thearrowsA head-to-head node c unblocks the dependency pathbetween its parents if either itself or any of its descendantsreceives evidence
Graphical Models
Example of head-to-head connectionA toy regulatory network: CPTs
Genes A and B have independent prior probabilities:
gene value P(value)A active 0.3A inactive 0.7
gene value P(value)B active 0.3B inactive 0.7
Gene C can be enhanced by both A and B:
Aactive inactive
B Bactive inactive active inactive
C active 0.9 0.6 0.7 0.1C inactive 0.1 0.4 0.3 0.9
Graphical Models
Example of head-to-head connection
Probability of A active (1)
Prior:
P(A = 1) = 1− P(A = 0) = 0.3
Posterior after observing activeC:
P(A = 1|C = 1) =P(C = 1|A = 1)P(A = 1)
P(C = 1)' 0.514
NoteThe probability that A is active increases from observing that itsregulated gene C is active
Graphical Models
Example of head-to-head connection
Derivation
P(C = 1|A = 1) =∑
B∈{0,1}P(C = 1,B|A = 1)
=∑
B∈{0,1}P(C = 1|B,A = 1)P(B|A = 1)
=∑
B∈{0,1}P(C = 1|B,A = 1)P(B)
P(C = 1) =∑
B∈{0,1}
∑A∈{0,1}
P(C = 1,B,A)
=∑
B∈{0,1}
∑A∈{0,1}
P(C = 1|B,A)P(B)P(A)
Graphical Models
Example of head-to-head connectionProbability of A active
Posterior after observing that B isalso active:
P(A = 1|C = 1,B = 1) =
P(C = 1|A = 1,B = 1)P(A = 1|B = 1)
P(C = 1|B = 1)' 0.355
NoteThe probability that A is active decreases after observingthat B is also activeThe B condition explains away the observation that C isactiveThe probability is still greater than the prior one (0.3),because the C active observation still gives some evidencein favour of an active A
Graphical Models
General d-separation criterion
d-separation definitionGiven a generic Bayesian networkGiven A,B,C arbitrary nonintersecting sets of nodesThe sets A and B are d-separated by C if:
All paths from any node in A to any node in B are blocked
A path is blocked if it includes at least one node s.t. either:the arrows on the path meet tail-to-tail or head-to-tail at thenode and it is in C, orthe arrows on the path meet head-to-head at the node andneither it nor any of its descendants is in C
d-separation implies conditional independency
The sets A and B are independent given C ( A ⊥⊥ B |C ) if theyare d-separated by C.
Graphical Models
Example of general d-separation
a>>b|c
Nodes a and b are not d-separatedby c:
Node f is tail-to-tail and not observedNode e is head-to-head and its childc is observed
f
e b
a
c
a ⊥⊥ b|f
Nodes a and b are d-separated by f :
Node f is tail-to-tail and observed
f
e b
a
c
Graphical Models
Example of i.i.d. samplesMaximum-likelihood
We are given a set of instancesD = {x1, . . . , xN} drawn from anunivariate Gaussian with unknownmean µAll paths between xi and xj areblocked if we condition on µThe examples are independent ofeach other given µ:
p(D|µ) =N∏
i=1
p(xi |µ)
µ
x1 xN
xn
N
N
µ
A set of nodes with the same variable type andconnections can be compactly represented usingthe plate notation
Graphical Models
Markov blanket (or boundary)
DefinitionGiven a directed graph with D nodesThe markov blanket of node xi is the minimal set of nodesmaking it xi independent on the rest of the graph:
p(xi |xj 6=i) =p(x1, . . . , xD)
p(xj 6=i)=
p(x1, . . . , xD)∫p(x1, . . . , xD)dxi
=
∏Dk=1 p(xk |pak )∫ ∏D
k=1 p(xk |pak )dxi
All components which do not include xi will cancel betweennumerator and denominatorThe only remaining components are:
p(xi |pai ) the probability of xi given its parentsp(xj |paj ) where paj includes xi ⇒ the children of xi with theirco-parents
Graphical Models
Markov blanket (or boundary)
d-separation
Each parent xj of xi will behead-to-tail or tail-to-tail in the pathbtw xi and any of xj otherneighbours⇒ blockedEach child xj of xi will be head-to-tailin the path btw xi and any of xjchildren⇒ blocked
xi
Each co-parent xk of a child xj of xi behead-to-tail or tail-to-tail in the path btw xj andany of xk other neighbours⇒ blocked
Graphical Models
Markov Networks (MN) or Markov random fields
Undirected graphical models
Graphically model the jointprobability of a set of variablesencoding independencyassumptions (as BN)Do not assign a directionality topairwise interactions
A
C
B
D
Do not assume that local interactions are properconditional probabilities (more freedom inparametrizing them)
Graphical Models
Markov Networks (MN)
A
CB
d-separation definitionGiven A,B,C arbitrary nonintersecting sets of nodesThe sets A and B are d-separated by C if:
All paths from any node in A to any node in B are blocked
A path is blocked if it includes at least one node from C(much simpler than in BN)The sets A and B are independent given C ( A ⊥⊥ B |C ) ifthey are d-separated by C.
Graphical Models
Markov Networks (MN)
Markov blanketThe Markov blanket of a variable is simply the set of itsimmediate neighbour
Graphical Models
Markov Networks (MN)
Factorization properties
We need to decompose the joint probability infactors (like P(xi |pai) in BN) in a way consistentwith the independency assumptions.Two nodes xi and xj not connected by a link areindependent given the rest of the network (anypath between them contains at least anothernode)⇒ they should stay in separate factors
Factors are associated withcliques in the network (i.e. fullyconnected subnetworks)A possible option is to associatefactors only with maximal cliques
x1
x2
x3
x4
Graphical Models
Markov Networks (MN)
Joint probability (1)
p(x) =1Z
∏C
ΨC(xC)
ΨC(xC) : Val(xc)→ IR+ is a positive function of the valueof the variables in clique xC (called clique potential)The joint probability is the (normalized) product of cliquepotentials for all (maximal) cliques in the networkThe partition function Z is a normalization constantassuring that the result is a probability (i.e.
∑x p(x) = 1):
Z =∑x
∏C
ΨC(xC)
Graphical Models
Markov Networks (MN)
Joint probability (2)In order to guarantee that potential functions are positive,they are typically represented as exponentials:
ΨC(xC) = exp (−E(xc))
E(xc) is called energy function: low energy means highlyprobable configurationThe joint probability becames a sum of exponentials:
p(x) =1Z
exp
(−∑
C
E(xc)
)
Graphical Models
Markov Networks (MN)
Comparison to BNAdvantage:
More freedom in defining the clique potentials as they don’tneed to represent a probability distribution
Problem:
The partition function Z has to be computed summing overall possible values of the variables
Solutions:
Intractable in generalEfficient algorithms exists for certain types of models (e.g.linear chains)Otherwise Z is approximated
Graphical Models
MN Example: image de-noising
DescriptionWe are given an image as a grid of binary pixelsyi ∈ {−1,1}The image is a noisy version of the original one, whereeach pixel was flipped with a certain (low) probabilityThe aim is designing a model capable of recovering theclean version
Graphical Models
MN Example: image de-noising
xi
yi
Markov networkWe represent the true image as grid of points xi , andconnect each point to its observed noisy version yi
The MN has two maximal clique types:
{xi , yi} for the true/observed pixel pair{xi , xj} for neighbouring pixels
The MN has one unobserved non-maximal clique type: thetrue pixel {xi}
Graphical Models
MN Example: image de-noising
Energy functionsWe assume low noise→ the true pixel should usually havesame sign as the observed one:
E(xi , yi) = −αxiyi α > 0
same sign pixels have lower energy→ higher probabilityWe assume neighbouring pixels should usually have samesign (locality principle, e.g. background colour):
E(xi , xj) = −βxixj β > 0
We assume the image can have preference for a certainpixel value (either positive or negative):
E(xi) = γxi
Graphical Models
MN Example: image de-noising
Joint probabilityThe joint probability becomes:
p(x,y) =1Z
exp∑(x,y)C
(−E((x,y)C))
=1Z
exp
α∑i,j
xixj + β∑
i
xiyi − γ∑
i
xi
α, β, γ are parameters of the model.
Graphical Models
MN Example: image de-noising
Noteall cliques of the same type share the same parameter(e.g. α for cliques {xi , xj})This parameter sharing (or tying) allows to:
reduce the number of parameters, simplifying learningbuild a network template which will work on images ofdifferent sizes.
Graphical Models
MN Example: image de-noising
InferenceAssuming parameters are given, the network can be usedto produce a clean version of the image:
1 we replace yi with its observed value for each i2 we compute the most most probable configuration for the
true pixels given the observed ones
The second computation is complex and can be done inapproximate or exact ways (we’ll see) with results ofdifferent quality.
Graphical Models
MN Example: image de-noising
Graphical Models
Relationship between directed and undirected graphs
Independency assumptionsDirected an undirected graphs are different ways ofrepresenting conditional independency assumptionsSome assumptions can be perfectly represented in bothformalisms, some in only one, some in neither of the twoHowever there is a tight connection between the twoIt is rather simple to convert a specific directed graphicalmodel into an undirected one (but different directed graphscan be mapped to the same undirected graph)The converse is more complex because of normalizationissues
Graphical Models
From a directed to an undirected modelx1 x2 xN −1 xN
x1 x2 xN −1 xN
(a)
(b)
Linear chainThe joint probability in the directed model (a) is:
p(x) = p(x1)p(x2|x1)p(x3|x2) · · · p(xN |xN−1)
The joint probability in the undirected model (b) is:
p(x) =1Zψ1,2(x1, x2)ψ2,3(x2, x3) · · ·ψN−1,N(xN−1, xN)
In order for the latter to represent the same distribution asthe former, we set:
ψ1,2(x1, x2) = p(x1)p(x2|x1) ψ2,3(x2, x3) = p(x3|x2)
· · · ψN−1,N(xN−1, xN) = p(xN |xN−1) Z = 1
Graphical Models
From a directed to an undirected model
x1 x3
x4
x2
x1 x3
x4
x2
Multiple parentsIf a node has multiple parents, we need to construct aclique containing all of them, in order for its potential torepresent the conditional probability:
We add an undirected link between each pair of parents ofthe nodeThe process is called moralization (marrying the parents!old fashioned..)
Graphical Models
From a directed to an undirected model
Generic graph1 Build the moral graph connecting the parents in the
directed graphs and dropping all arrows2 initialize all clique potentials to 13 for each conditional probability, choose a clique containing
all its variables, and multiply its potential with theconditional probability
4 set Z = 1
Graphical Models
Inference in graphical models
DescriptionAssume we have evidence e on the state of a subset ofvariables in the model EInference amounts at computing the posterior probability ofa subset X of the non-observed variables given theobservations:
p(X |E = e)
NoteWhen we need to distinguish between variables and theirvalues, we will indicate random variables with uppercaseletters, and their values with lowercase ones.
Graphical Models
Inference in graphical models
EfficiencyWe can always compute the posterior probability as theratio of two joint probabilities:
p(X |E = e) =p(X ,E = e)
p(E = e)
The problem consists of estimating such joint probabilitieswhen dealing with a large number of variablesDirectly working on the full joint probabilities requires timeexponential in the number of variablesFor instance, if all N variables are discrete and take one ofK possible values, a joint probability table has K N entriesWe would like to exploit the structure in graphical modelsto do inference more efficiently.
Graphical Models
Inference in graphical modelsInference on a chain (1)
p(x) =1Zψ1,2(x1, x2) · · ·ψxN−2,xN−1(xN−2, xN−1)ψN−1,N(xN−1, xN)
The marginal probability of an arbitrary xn is:
p(xn) =∑x1
∑x2
· · ·∑xn−1
∑xn+1
· · ·∑xN
p(x)
Only the ψN−1,N(xN−1, xN) is involved in the lastsummation which can be computed first:
µβ(xN−1) =∑xN
ψN−1,N(xN−1, xN)
giving a function of xN−1 which can be used to computethe successive summation:
µβ(xN−2) =∑xN−1
ψN−2,N−1(xN−2, xN−1)µβ(xN−1)
Graphical Models
Inference in graphical models
Inference on a chain (2)The same procedure can be applied starting from the otherend of the chain, giving
µα(x2) =∑x1
ψ1,2(x1, x2)
up to µα(xn)
The marginal probability is thus computed from thenormalized contribution coming from both ends as:
p(xn) =1Zµα(xn)µβ(xn)
The partition function Z can now be computed justsumming µα(xn)µβ(xn) over all possible values of xn
Graphical Models
Inference in graphical models
x1 xn−1 xn xn+1 xN
µα(xn−1) µα(xn) µβ(xn) µβ(xn+1)
Inference as message passing
We can think of µα(xn) as a message passing from xn−1 toxn
µα(xn) =∑xn−1
ψxn−1,xn (xn−1, xn)µα(xn−1)
We can think of µβ(xn) as a message passing from xn+1 toxn
µβ(xn) =∑xn+1
ψxn,xn+1(xn, xn+1)µβ(xn+1)
Each outgoing message is obtained multiplying theincoming message by the clique potential, and summingover the node values
Graphical Models
Inference in graphical models
Full message passing
Suppose we want to know marginal probabilities for anumber of different variables xi :
1 We send a message from µα(x1) up to µα(xN)2 We send a message from µβ(xN) down to µβ(x1)
If all nodes store messages, we can compute any marginalprobability as
p(xi) = µα(xi)µβ(xi)
for any i having sent just a double number of messages wrta single marginal computation
Graphical Models
Inference in graphical models
Adding evidenceIf some nodes xe are observed, we simply use theirobserved values instead of summing over all possiblevalues when computing their messagesAfter normalization by the partition function Z , this givesthe conditional probability given the evidence
Joint distribution for clique variablesIt is easy to compute the joint probability for variables in thesame clique:
p(xn−1, xn) =1Zµα(xn−1)ψxn−1,xn (xn−1, xn)µβ(xn)
Graphical Models
Inference
Directed modelsThe procedure applies straightforwardly to a directedmodel, replacing clique potentials with conditionalprobabilitiesThe marginal p(Xn) does not need to be normalized as itsalready a correct distributionWhen adding evidence, the message passing procedurecomputes the joint probability of the variable and theevidence, and it has to be normalized to obtain theconditional probability:
p(Xn|X e = xe) =p(Xn,X e = xe)
p(X e = xe)=
p(Xn,X e = xe)∑Xn
p(Xn,X e = xe)
Graphical Models
Inference
(a) (b) (c)
Inference on treesEfficient inference can be computed for the broaded familyof tree-structured models:
undirected trees (a) undirected graphs with a single pathfor each pair of nodes
directed trees (b) directed graphs with a single node (theroot) with no parents, and all other nodes witha single parent
directed polytrees (c) directed graphs with multiple parentsfor node and multiple roots, but still as single(undirected) path between each pair of nodes
Graphical Models
Factor graphs
DescriptionEfficient inference algorithms can be better explained usingan alternative graphical representation called factor graphA factor graph is a graphical representation of a (directedor undirected) graphical model highlighting its factorization(conditional probabilities or clique potentials)The factor graph has one node for each node in theoriginal graphThe factor graph has one additional node (of a differenttype) for each factorA factor node has undirected links to each of the nodevariables in the factor
Graphical Models
Factor graphs: examples
x1 x2
x3
x1 x2
x3
f
x1 x2
x3
f c
f a f b
p(x3|x1,x2)p(x1)p(x2) f(x1,x2,x3)=p(x3|x1,x2)p(x1)p(x2) fc(x1,x2,x3)=p(x3|x1,x2)
fa(x1)=p(x1)fb(x2)=p(x2)
Graphical Models
Factor graphs: examples
x1 x2
x3
x1 x2
x3
f
x1 x2
x3
f a
f b
Notef is associated to the whole clique potentialfa is associated to a maximal cliquefb is associated to a non-maximal clique
Graphical Models
Inference
The sum-product algorithmThe sum-product algorithm is an efficient algorithm forexact inference on tree-structured graphsIt is a message passing algorithm as its simpler version forchainsWe will present it on factor graphs, thus unifying directedand undirected modelsWe will assume a tree-structured graph, giving rise to afactor graph which is a tree
Graphical Models
Inference
Computing marginalsWe want to compute the marginal probability of x :
p(x) =∑x\x
p(x)
Generalizing the message passing scheme seen forchains, this can be computed as the product of messagescoming from all neighbouring factors fs:
p(x) =1Z
∏s∈ne(x)
µfs→x (x)xfs
µfs→x(x)F
s(x,X
s)
Graphical Models
Inference
xm
xM
xfs
µxM→fs(xM )
µfs→x(x)
Gm(xm, Xsm)
Factor messagesEach factor message is the product of messages comingfrom nodes other than x , times the factor, summed over allpossible values of the factor variables other than x(x1, . . . , xM ):
µfs→x (x) =∑x1
· · ·∑xM
fs(x , x1, . . . , xM)∏
m∈ne(fs)\xµxm→fs (xm)
Graphical Models
Inference
xm
fl
fL
fs
Fl(xm, Xml)
Node messagesEach message from node xm to factor fs is the product ofthe factor messages to xm coming from factors other thanfs:
µxm→fs (xm) =∏
l∈ne(xm)\fsµfl→xm (xm)
Graphical Models
Inference
InitializationMessage passing starts from leaves, either factors ornodesMessages from leaf factors are initialized to the factor itself(there will be no xm different from the destination on whichto sum over)
xf
µf→x(x) = f(x)
Messages from leaf nodes are initialized to 1
x f
µx→f (x) = 1
Graphical Models
Inference
Message passing schemeThe node x whose marginal has to be computed isdesigned as root.Messages are sent from all leaves to their neighboursEach internal node sends its message towards the root assoon as it received messages from all other neighboursOnce the root has collected all messages, the marginalcan be computed as the product of them
Graphical Models
Inference
Full message passing schemeIn order to be able to compute marginals for any node,messages need to pass in all directions:
1 Choose an arbitrary node as root2 Collect messages for the root starting from leaves3 Send messages from the root down to the leaves
All messages passed in all directions using only twice thenumber of computations used for a single marginal
Graphical Models
Inference examplex1 x2 x3
x4
fa fb
fc
Consider the unnormalized distribution
p̃(x) = fa(x1, x2)fb(x2, x3)fc(x2, x4)
Graphical Models
Inference examplex1 x2 x3
x4
fa fb
fc
Choose x3 as root
Graphical Models
Inference examplex1 x2 x3
x4
Send initial messages from leaves
µx1→fa(x1) = 1µx4→fc (x4) = 1
Graphical Models
Inference examplex1 x2 x3
x4
Send messages from factor nodes to x2
µfa→x2(x2) =∑x1
fa(x1, x2)
µfc→x2(x2) =∑x4
fc(x2, x4)
Graphical Models
Inference examplex1 x2 x3
x4
Send message from x2 to factor node fb
µx2→fb (x2) = µfa→x2(x2)µfc→x2(x2)
Graphical Models
Inference examplex1 x2 x3
x4
Send message from fb to x3
µfb→x3(x3) =∑x2
fb(x2, x3)µx2→fb (x2)
Graphical Models
Inference examplex1 x2 x3
x4
Send message from root x3
µx3→fb (x3) = 1
Graphical Models
Inference examplex1 x2 x3
x4
Send message from fb to x2
µfb→x2(x2) =∑x3
fb(x2, x3)
Graphical Models
Inference examplex1 x2 x3
x4
Send messages from x2 to factor nodes
µx2→fa(x2) = µfb→x2(x2)µfc→x2(x2)
µx2→fc (x2) = µfb→x2(x2)µfa→x2(x2)
Graphical Models
Inference examplex1 x2 x3
x4
Send messages from factor nodes to leaves
µfa→x1(x1) =∑x2
fa(x1, x2)µx2→fa(x2)
µfc→x4(x4) =∑x2
fc(x2, x4)µx2→fc (x2)
Graphical Models
Inference examplex1 x2 x3
x4
fa fb
fc
Compute for instance the marginal for x2
p̃(x2) = µfa→x2(x2)µfb→x2(x2)µfc→x2(x2)
=
[∑x1
fa(x1, x2)
][∑x3
fb(x2, x3)
][∑x4
fc(x2, x4)
]=∑x1
∑x3
∑x4
fa(x1, x2)fb(x2, x3)fc(x2, x4)
=∑x1
∑x3
∑x4
p̃(x)
Graphical Models
Inference
Normalization
p(x) =1Z
∏s∈ne(x)
µfs→x (x)
If the factor graph came from a directed model, a correctprobability is already obtained as the product of messages(Z = 1)If the factor graph came from an undirected model, wesimply compute the normalization Z summing the productof messages over all possible values of the target variable:
Z =∑
x
∏s∈ne(x)
µfs→x (x)
Graphical Models
Inference
Factor marginals
p(xs) =1Z
fs(xs)∏
m∈ne(fs)
µxm→fs (xm)
factor marginals can be computed from the product ofneighbouring nodes messages and the factor itself
Adding evidenceIf some nodes xe are observed, we simply use theirobserved values instead of summing over all possiblevalues when computing their messagesAfter normalization by the partition function Z , this givesthe conditional probability given the evidence
Graphical Models
Inference example
A
B
C D A
B
C DP(A)
P(B|A,C)
P(C|D) P(D)
P(B|A,C)P(A)P(C|D)P(D)
Directed modelTake a directed graphical modelsBuild a factor graph representing itCompute the marginal for a variable (e.g. B)
Graphical Models
Inference example
A
B
C D A
B
C DP(A)
P(B|A,C)
P(C|D) P(D)
P(B|A,C)P(A)P(C|D)P(D)
Compute the marginal for B
Leaf factor nodes send messages:
µfA→A(A) = P(A)
µfD→D(D) = P(D)
Graphical Models
Inference example
A
B
C D A
B
C DP(A)
P(B|A,C)
P(C|D) P(D)
P(B|A,C)P(A)P(C|D)P(D)
Compute the marginal for BA and D send messages:
µA→fA,B,C(A) = µfA→A(A) = P(A)
µD→fC,D(D) = µfD→D(D) = P(D)
Graphical Models
Inference example
A
B
C D A
B
C DP(A)
P(B|A,C)
P(C|D) P(D)
P(B|A,C)P(A)P(C|D)P(D)
Compute the marginal for B
fC,D sends message:
µfC,D→C(C) =∑
D
P(C|D)µfD→D(D) =∑
D
P(C|D)P(D)
Graphical Models
Inference example
A
B
C D A
B
C DP(A)
P(B|A,C)
P(C|D) P(D)
P(B|A,C)P(A)P(C|D)P(D)
Compute the marginal for B
C sends message:
µC→fA,B,C(C) = µfC,D→C(C) =
∑D
P(C|D)P(D)
Graphical Models
Inference example
A
B
C D A
B
C DP(A)
P(B|A,C)
P(C|D) P(D)
P(B|A,C)P(A)P(C|D)P(D)
Compute the marginal for B
fA,B,C sends message:
µfA,B,C→B(B) =∑
A
∑C
P(B|A,C)µC→fA,B,C(C)µA→fA,B,C
(A)
=∑
A
∑C
P(B|A,C)P(A)∑
D
P(C|D)P(D)
Graphical Models
Inference example
A
B
C D A
B
C DP(A)
P(B|A,C)
P(C|D) P(D)
P(B|A,C)P(A)P(C|D)P(D)
Compute the marginal for BThe desired marginal is obtained:
P(B) = µfA,B,C→B(B) =∑
A
∑C
P(B|A,C)P(A)∑
D
P(C|D)P(D)
=∑
A
∑C
∑D
P(B|A,C)P(A)P(C|D)P(D)
=∑
A
∑C
∑D
P(A,B,C,D)
Graphical Models
Inference example
A
B
C D A
B
C DP(A)
P(B|A,C)
P(C|D) P(D)
P(B|A,C)P(A)P(C|D)P(D)
Compute the joint for A,B,CSend messages to factor node fA,B,C
P(A,B,C) = fA,B,C(A,B,C) · µA→fA,B,C(A) · µC→fA,B,C
(C)·· µB→fA,B,C
(B)
= P(B|A,C) · P(A) ·∑
D
P(C|D)P(D) · 1
=∑
D
P(B|A,C)P(A)P(C|D)P(D)
=∑
D
P(A,B,C,D)
Graphical Models
Inference
Finding the most probable configuration
Given a joint probability distribution p(x)
We wish to find the configuration for variables x having thehighest probability:
xmax = argmaxxp(x)
for which the probability is:
p(xmax) = maxx
p(x)
NoteWe want the configuration which is jointly maximal for allvariablesWe cannot simply compute p(xi) for each i (using thesum-product algorithm) and maximize it
Graphical Models
Inference
The max-product algorithm
p(xmax) = maxx
p(x) = maxx1· · ·max
xMp(x)
As for the sum-product algorithm, we can exploit thedistribution factorization to efficiently compute themaximumIt suffices to replace sum with max in the sum-productalgorithm
Linear chain
maxx
p(x) = maxx1· · ·max
xN
[1Zψ1,2(x1, x2) · · ·ψN−1,N(xN−1, xN)
]=
1Z
maxx1
[ψ1,2(x1, x2)
[· · ·max
xNψN−1,N(xN−1, xN)
]]Graphical Models
Inference
Message passingAs for the sum-product algorithm, the max-product can beseen as message passing over the graph.The algorithm is thus easily applied to tree-structuredgraphs via their factor trees:
µf→x (x) = maxx1,....xM
f (x , x1, . . . , xM)∏
m∈ne(f )\xµxm→f (xm)
µx→f (x) =
∏l∈ne(x)\f
µfl→x (x)
Graphical Models
Inference
Recoving maximal configurationMessages are passed from leaves to an arbitrarily chosenroot xr
The probability of maximal configuration is readily obtainedas:
p(xmax) = maxxr
∏l∈ne(xr )
µfl→xr (xr )
The maximal configuration for the root is obtained as:
xmaxr = argmaxxr
∏l∈ne(xr )
µfl→xr (xr )
We need to recover maximal configuration for the othervariables
Graphical Models
Inference
Recoving maximal configurationWhen sending a message towards x , each factor nodeshould store the configuration of the other variables whichgave the maximum:
φf→x (x) = argmaxx1,...,xM
f (x , x1, . . . , xM)∏
m∈ne(f )\xµxm→f (xm)
When the maximal configuration for the root node xr hasbeen obtained, it can be used to retrieve the maximalconfiguration for the variables in neighbouring factors from:
xmax1 , . . . , xmax
M = φf→xr (xmaxr )
The procedure can be repeated back-tracking to theleaves, retrieving maximal values for all variables
Graphical Models
Recoving maximal configuration
Example for linear chain
xmaxN = argmaxxN
µfN−1,N→xN (xN)
xmaxN−1 = φfN−1,N→xN (xmax
N )
xmaxN−2 = φfN−2,N−1→xN−1(xmax
N−1)
...xmax
1 = φf1,2→x2(xmax2 )
Graphical Models
Recoving maximal configuration
k = 1
k = 2
k = 3
n− 2 n− 1 n n + 1
Trellis for linear chainA trellis or lattice diagram shows the K possible states ofeach variable xn one per rowFor each state k of a variable xn, φfn−1,n→xn (xn) defines aunique (maximal) previous state, linked by an edge in thediagramOnce the maximal state for the last variable xN is chosen,the maximal states for other variables are recoveringfollowing the edges backward.
Graphical Models
Inference
Underflow issuesThe max-product algorithm relies on products (nosummation)Products of many small probabilities can lead to underflowproblemsThis can be addressed computing the logarithm of theprobability insteadThe logarithm is monotonic, thus the proper maximalconfiguration is recovered:
log(
maxx
p(x))
= maxx
log p(x)
The effect is replacing products with sums (of logs) in themax-product algorithm, giving the max-sum one
Graphical Models
Inference
Instances of inference algorithmsThe sum-product and max-product algorithms are genericalgorithms for exact inference in tree-structured models.Inference algorithms for certain specific types of modelscan be seen as special cases of themWe will see examples (forward-backward as sum-productand Viterbi as max-product) for Hidden Markov Models andConditional Random Fields
Graphical Models
InferenceExact inference on general graphs
The sum-product and max-product algorithms can beapplied to tree-structured graphsMany pratical application require graphs with loopsAn extension of this algorithms to generic graphs can beachieved with the junction tree algorithmThe algorithm does not work on factor graphs, but onjunction trees, tree-structured graphs with nodescontaining clusters of variables of the original graphA message passing scheme analogous to the sum-productand max-product algorithms is run on the junction tree
ProblemThe complexity on the algorithm is exponential on themaximal number of variables in a cluster, making itintractable for large complex graphs.
Graphical Models
Inference
Approximate inferenceIn cases in which exact inference is intractable, we resortto approximate inference techniquesA number of techniques for approximate inference exist:
loopy belief propagation message passing on the originalgraph even if it contains loops
variational methods deterministic approximations,assuming the posterior probability (given theevidence) factorizes in a particular way
sampling methods approximate posterior is obtainedsampling from the network
Graphical Models
InferenceLoopy belief propagation
Apply sum-product algorithm even if it is not guaranteed toprovide an exact solutionWe assume all nodes are in condition of sendingmessages (i.e. they already received a constant 1message from all neighbours)A message passing schedule is chosen in order to decidewhich nodes start sending messages (e.g. flooding, allnodes send messages in all directions at each time step)Information flows many times around the graph (becauseof the loops), each message on a link replaces theprevious one and is only based on the most recentmessages received from the other neighboursThe algorithm can eventually converge (no more changesin messages passing through any link) depending on thespecific model over which it is applied
Graphical Models
Approximate inference
Sampling methods
Given the joint probability distribution p(X )
A sample from the distribution is an instantiation of all thevariables X according to the probability p.Samples can be used to approximate the probability of acertain assignment Y = y for a subset Y ⊂ X of thevariables:
We simply count the fraction of samples which areconsistent with the desired assignment
We usually need to sample from a posterior distributiongiven some evidence E = e
Graphical Models
Sampling methods
Markov Chain Monte Carlo (MCMC)We usually cannot directly sample from the posteriordistribution (too expensive)We can instead build a random process which graduallysamples from distributions closer and closer to theposteriorThe state of the process is an instantiation of all thevariablesThe process can be seen as randomly traversing the graphof states moving from one state to another with a certainprobability.After enough time, the probability of being in any particularstate is the desired posterior.The random process is a Markov Chain
Graphical Models
Learning graphical models
Parameter estimationWe assume the structure of the model is givenWe are given a dataset of examples D = {x(1), . . . ,x(N)}Each example x(i) is a configuration for all (complete data)or some (incomplete data) variables in the modelWe need to estimate the parameters of the model (CPD orweights of clique potentials) from the dataThe simplest approach consists of learning the parametersmaximizing the likelihood of the data:
θmax = argmaxθp(D|θ) = argmaxθL(D,θ)
Graphical Models
Learning Bayesian Networks
X3(1)X2(1)
X1(1)
X3(N)
X1(N)
X2(N)
Θ3|1
X3(2)X2(2)
X1(2)
Θ1
Θ2|1
Maximum likelihood estimation, complete data
p(D|θ) =N∏
i=1
p(x(i)|θ) examples independent given θ
=N∏
i=1
m∏j=1
p(xj(i)|paj(i),θ) factorization for BN
Graphical Models
Learning Bayesian Networks
X3(1)X2(1)
X1(1)
X3(N)
X1(N)
X2(N)
Θ3|1
X3(2)X2(2)
X1(2)
Θ1
Θ2|1
Maximum likelihood estimation, complete data
p(D|θ) =N∏
i=1
m∏j=1
p(xj(i)|paj(i),θ) factorization for BN
=N∏
i=1
m∏j=1
p(xj(i)|paj(i),θXj |paj) disjoint CPD parameters
Graphical Models
Learning graphical models
Maximum likelihood estimation, complete dataThe parameters of each CPD can be estimatedindependently:
θmaxXj |Paj
= argmaxθXj |Paj
N∏i=1
p(xj(i)|paj(i),θXj |Paj)︸ ︷︷ ︸
L(θXj |Paj,D)
A discrete CPD P(X |U), can be represented as a table,with:
a number of rows equal to the number Val(X ) ofconfigurations for Xa number of columns equal to the number Val(U) ofconfigurations for its parents Ueach table entry θx|u indicating the probability of a specificconfiguration of X = x and its parents U = u
Graphical Models
Learning graphical models
Maximum likelihood estimation, complete data
Replacing p(x(i)|pa(i)) with θx(i)|u(i), the local likelihood ofa single CPD becames:
L(θX |Pa,D) =N∏
i=1
p(x(i)|pa(i),θX |Paj)
=N∏
i=1
θx(i)|u(i)
=∏
u∈Val(U)
∏x∈Val(X)
θNu,xx |u
where Nu,x is the number of times the specificconfiguration X = x ,U = u was found in the data
Graphical Models
Learning graphical models
Maximum likelihood estimation, complete dataA column in the CPD table contains a multinomialdistribution over values of X for a certain configuration ofthe parents UThus each column should sum to one:
∑x θx |u = 1
Parameters of different columns can be estimatedindependentlyFor each multinomial distribution, zeroing the gradient ofthe maximum likelihood and considering the normalizationconstraint, we obtain:
θmaxx |u =
Nu,x∑x Nu,x
The maximum likelihood parameters are simply the fractionof times in which the specific configuration was observed inthe data
Graphical Models
Learning Bayesian Networks
Example
Training examples as (A,B,C) tuples:
(act,act,act),(act,inact,act),(act,inact,act),(act,inact,inact),(inact,act,act),(inact,act,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact).
Fill CPTs with counts
Graphical Models
Learning Bayesian Networks
Example
Training examples as (A,B,C) tuples:
(act,act,act),(act,inact,act),(act,inact,act),(act,inact,inact),(inact,act,act),(inact,act,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact).
Fill CPTs with counts
gene value countsA active 4A inactive 8
gene value countsB active 3B inactive 9
Graphical Models
Learning Bayesian Networks
Example
Training examples as (A,B,C) tuples:
(act,act,act),(act,inact,act),(act,inact,act),(act,inact,inact),(inact,act,act),(inact,act,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact).
Normalize counts columnwise
gene value countsA active 4/12A inactive 8/12
gene value countsB active 3/12B inactive 9/12
Graphical Models
Learning Bayesian Networks
Example
Training examples as (A,B,C) tuples:
(act,act,act),(act,inact,act),(act,inact,act),(act,inact,inact),(inact,act,act),(inact,act,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact).
Normalize counts columnwise
gene value countsA active 0.33A inactive 0.67
gene value countsB active 0.25B inactive 0.75
Graphical Models
Learning Bayesian Networks
Example
Training examples as (A,B,C) tuples:
(act,act,act),(act,inact,act),(act,inact,act),(act,inact,inact),(inact,act,act),(inact,act,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact).
Fill CPTs with counts
Aactive inactive
B Bactive inactive active inactive
C active 1 2 1 0C inactive 0 1 1 6
Graphical Models
Learning Bayesian Networks
Example
Training examples as (A,B,C) tuples:
(act,act,act),(act,inact,act),(act,inact,act),(act,inact,inact),(inact,act,act),(inact,act,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact).
Fill CPTs with counts
Aactive inactive
B Bactive inactive active inactive
C active 1 2 1 0C inactive 0 1 1 6
Graphical Models
Learning Bayesian Networks
Example
Training examples as (A,B,C) tuples:
(act,act,act),(act,inact,act),(act,inact,act),(act,inact,inact),(inact,act,act),(inact,act,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact).
Normalize counts columnwise
Aactive inactive
B Bactive inactive active inactive
C active 1/1 2/3 1/2 0/6C inactive 0/1 1/3 1/2 6/6
Graphical Models
Learning Bayesian Networks
Example
Training examples as (A,B,C) tuples:
(act,act,act),(act,inact,act),(act,inact,act),(act,inact,inact),(inact,act,act),(inact,act,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact).
Normalize counts columnwise
Aactive inactive
B Bactive inactive active inactive
C active 1 0.67 0.5 0C inactive 0 0.33 0.5 1
Graphical Models
Learning graphical models
Adding priorsML estimation tends to overfit the training setConfiguration not appearing in the training set will receivezero probabilityA common approach consists of combining ML with a priorprobability on the parameters, achieving amaximum-a-posteriori estimate:
θmax = argmaxθp(D|θ)p(θ)
Graphical Models
Learning graphical models
Dirichlet priorsThe conjugate (read natural) prior for a multinomialdistribution is a Dirichlet distribution with parameters αx |ufor each possible value of xThe resulting maximum-a-posteriori estimate is:
θmaxx |u =
Nu,x + αx |u∑x(Nu,x + αx |u
)The prior is like having observed αx |u imaginary sampleswith configuration X = x ,U = u
Graphical Models
Learning Bayesian Networks
Example
Training examples as (A,B,C) tuples:
(act,act,act),(act,inact,act),(act,inact,act),(act,inact,inact),(inact,act,act),(inact,act,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact).
Fill CPTs with imaginary counts (priors)
Aactive inactive
B Bactive inactive active inactive
C active 1 1 1 1C inactive 1 1 1 1
Graphical Models
Learning Bayesian Networks
Example
Training examples as (A,B,C) tuples:
(act,act,act),(act,inact,act),(act,inact,act),(act,inact,inact),(inact,act,act),(inact,act,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact).
Fill CPTs with imaginary counts (priors)
Aactive inactive
B Bactive inactive active inactive
C active 1 1 1 1C inactive 1 1 1 1
Graphical Models
Learning Bayesian Networks
Example
Training examples as (A,B,C) tuples:
(act,act,act),(act,inact,act),(act,inact,act),(act,inact,inact),(inact,act,act),(inact,act,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact).
Add counts from training data
Aactive inactive
B Bactive inactive active inactive
C active 1+1 1+2 1+1 1+0C inactive 1+0 1+1 1+1 1+6
Graphical Models
Learning Bayesian Networks
Example
Training examples as (A,B,C) tuples:
(act,act,act),(act,inact,act),(act,inact,act),(act,inact,inact),(inact,act,act),(inact,act,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact).
Normalize counts columnwise
Aactive inactive
B Bactive inactive active inactive
C active 2/3 3/4 2/4 1/8C inactive 1/3 2/4 2/4 7/8
Graphical Models
Learning Bayesian Networks
Example
Training examples as (A,B,C) tuples:
(act,act,act),(act,inact,act),(act,inact,act),(act,inact,inact),(inact,act,act),(inact,act,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact),(inact,inact,inact).
Normalize counts columnwise
Aactive inactive
B Bactive inactive active inactive
C active 0.67 0.75 0.5 0.125C inactive 0.33 0.25 0.5 0.875
Graphical Models
Learning graphical models
Incomplete dataWith incomplete data, some of the examples missevidence on some of the variablesCounts of occurrences of different configurations cannot becomputed if not all data are observedThe full Bayesian approach of integrating over missingvariables is often intractable in practiceWe need approximate methods to deal with the problem
Graphical Models
Learning with missing data: Expectation-Maximization
E-M for Bayesian nets in a nutshellSufficient statistics (counts) cannot be computed (missingdata)Fill-in missing data inferring them using current parameters(solve inference problem to get expected counts)Compute parameters maximizing likelihood (or posterior)of such expected countsIterate the procedure to improve quality of parameters
Graphical Models
Learning in Markov Networks
Maximum likelihood estimationFor general MN, the likelihood function cannot bedecomposed into independent pieces for each potentialbecause of the global normalization (Z )
L(E ,D) =N∏
i=1
1Z
exp
(−∑
C
Ec(xc(i))
)However the likelihood is concave in EC , the energyfunctionals whose parameters have to be estimatedThe problem is an unconstrained concave maximizationproblem solved by gradient ascent (or second ordermethods)
Graphical Models
Learning in Markov Networks
Maximum likelihood estimationFor each configuration uc ∈ Val(xc) we have a parameterEc,uc ∈ IR (ignoring parameter tying)The partial derivative of the log-likelihood wrt Ec,uc is:
∂ logL(E ,D)
∂Ec,uc
=N∑
i=1
[δ(xc(i),uc)− p(X c = uc |E)]
= Nuc − N p(X c = uc |E)
The derivative is zero when the counts of the datacorrespond to the expected counts predicted by the modelIn order to compute p(X c = uc |E), inference has to beperformed on the Markov network, making learning quiteexpensive
Graphical Models
Learning structure of graphical models
Approachesconstraint-based test conditional independencies on the data
and construct a model satisfying themscore-based assign a score to each possible structure, define
a search procedure looking for the structuremaximizing the score
model-averaging assign a prior probability to each structure,and average prediction over all possible structuresweighted by their probabilities (full Bayesian,intractable)
Graphical Models