copyright (c) 2002 by snu cse biointelligence lab. 1 survey: foundations of bayesian networks o,...

SURVEY: Foundations of Bayesian Networks

O, Jangmin

2002/10/29

Last modified 2002/10/29

Contents

• From DAG to Junction TreeFrom DAG to Junction Tree• From Elimination Tree to Junction Tree• Junction Tree Algorithms• Learning Bayesian Networks

Typical Example of DAG

Simple DAG

1. Topological Sort

Algorithm 4.1 [Topological sort]• Begin with all vertices unnumbered.• Set counter i = 1.• While any vertices remain:

– Select any vertex that has no parents;– number the selected vertex as i;– delete the numbered vertex and all its adjacent edges from

the graph;– increment i by 1.

Objective: acquiring well-orderingWell-ordering: predecessors of any node have lower number than .

1. Topological Sort (1)

Simple DAG

2. Moral Graph

• Making moral graph of DAG– Add undirected edge between the nodes which

have same child.– Remove directions

2. Moral Graph (1)

Simple DAG

2. Moral Graph (2)

Simple DAG

Junction tree

• Definition– Tree from nodes C1, C2,...

– Intersection of C1 and C2 is contained in every node on path between C1 and C2.

• Corollaries– Decomposable, chordal, junction tree of cliques,

perfect numbering: all are equal in undirected graph.

Perfect numbering: ne(vj) {v1, ..., vj-1} induce complete subgraph.

3. Maximum Cardinality Search (1)

Algorithm 4.9 [Maximum Cardinality Search]• Set Output := ‘G is chordal’.• Set counter i := 1.• Set L = .• For all v V, set c(v) := 0.• While L V:

– Set U := V \ L.– Select any vertex v maximizing c(v) over v V and label it i.– If vi :=ne(vi) L is not complete in G:

Set Output :=‘G is not chordal’.– Otherwise, set c(w) = c(w) + 1 for each vertex w ne(vi) U.– Set L = L {vi}.– Increment i by 1.

• Report Output.

Simple DAG

1, ={}

2, ={A}

3, ={A, B}

2, ={A}

3, ={A, B}

4, ={A, B}

2, ={A}

3, ={A, B}

4, ={A, B}

5, ={B, C}

2, ={A} 3, ={A, B}

4, ={A, B}

5, ={B, C}

6, ={F}

2, ={A} 3, ={A, B}

4, ={A, B}

5, ={B, C}

6, ={F}

Output = “G is chordal”

4. Cliques of Chordal Graph (1)

Algorithm 4.11 [Finding the Cliques of a Chordal Graph]• From numbering (v1,..., vk) obtained by maximum cardinality s

earch i = cardinality of vi

• Make ladder nodes. i = ladder node if i = k

or i = ladder node if i < k and i+1 < 1 + i

• Define cliques– Cj = {j} j

C1, C2... Posess RIP (running intersection property).

4. Cliques of Chordal Graph (2)

2, ={A} 3, ={A, B}

4, ={A, B}

5, ={B, C}

6, ={F}

C1 = {A, B, C}

C2 = {A, B, D}

C3 = {B, C, F}

C4 = {F, G}

Running Intersection Property

• RIP : definition– Given (C1, C2, ..., Ck),– For all 1 < j k, there is an i < j such that Cj (C1 ... Cj-1) Ci.

5. Junction Tree Construction (1)

Algorithm 4.8 [Junction Tree Construction]• From the cliques (C1, ..., Cp) of a chordal graph ordered with

RIP,• Associate a node of the tree with each clique Cj.

• For j = 2, ..., p, add an edge between Cj and Ci where i is any one value in {1, ..., j-1} such that Cj (C1 ... Cj-1) Ci.

2, ={A} 3, ={A, B}

4, ={A, B}

5, ={B, C}

6, ={F}

C1 = {A, B, C}

C2 = {A, B, D}

C3 = {B, C, F}

C4 = {F, G}

2, ={A} 3, ={A, B}

4, ={A, B}

5, ={B, C}

6, ={F}

C1 = {A, B, C}

C2 = {A, B, D}

C3 = {B, C, F}

C4 = {F, G}

2, ={A} 3, ={A, B}

4, ={A, B}

5, ={B, C}

6, ={F}

C1 = {A, B, C}

C2 = {A, B, D}

C3 = {B, C, F}

C4 = {F, G}

2, ={A} 3, ={A, B}

4, ={A, B}

5, ={B, C}

6, ={F}

C1 = {A, B, C}

C2 = {A, B, D}

C3 = {B, C, F}

C4 = {F, G}

Contents

• From DAG to Junction Tree• From Elimination Tree to Junction From Elimination Tree to Junction

TreeTree• Junction Tree Algorithms• Learning Bayesian Networks

Triangulation (1)

• When need triangulation?– If MCS (Maximum Cardinality Search)

failed.

• Triangulation– introduces Fill-in.– produces perfect numbering.

• Optimal triangulation: NP-hard– Size of each cliques matters...

Triangulation (2)

Algorithm 4.13 [One-step Look Ahead Triangulation]• Start with all vertices unnumbered, set counter i := k.• While there are still some unnumbered vertices:

– Select an unnumbered vertex v to optimize the criterion c(v). or– Select v = (i) [ is an order].– Label it with the number i.– Form the set Ci consisting of vi and its unnumbered neighbours.

– Fill in edges where none exist between all pairs of vertices in Ci.

– Eliminate vi and decrement i by 1.

Triangulation (3)

= (A,B,C,D,F,G)

6, C6 = {F, G}

Triangulation (4)

= (A,B,C,D,F,G)

6, C6 = {F, G}

5, C5 = {B,C,F}

Triangulation (5)

= (A,B,C,D,F,G)

6, C6 = {F, G}

5, C5 = {B,C,F}

4, C4 = {A,B,D}

Triangulation (6)

= (A,B,C,D,F,G)

6, C6 = {F, G}

5, C5 = {B,C,F}

4, C4 = {A,B,D}

3, C3 = {A,B,C}

Triangulation (7)

= (A,B,C,D,F,G)

6, C6 = {F, G}

5, C5 = {B,C,F}

4, C4 = {A,B,D}

3, C3 = {A,B,C}

2, C2 = {A,B}

Triangulation (8)

= (A,B,C,D,F,G)

6, C6 = {F, G}

5, C5 = {B,C,F}

4, C4 = {A,B,D}

3, C3 = {A,B,C}

2, C2 = {A,B}

1, C1 = {A} Elimination set• Cj contains vj.

• vj Cl for all l < j.

• (C1,..., Ck) has RIP.• The cliques of the triangulat

ed graph G’ are contained in (C1,..., Ck).

Elimination Tree Construction (1)

Algorithm 4.14 [Elimination Tree Construction]• Associate a node of the tree with each set Ci.

• For j = 1, ..., k, if Cj contains more than one vertex, add an edge between Cj and Ci where i is the largest index of a vertex in Cj \ {vj}

B:A C:AB

From etree to jtree (1)

Lemma 4.16– Let C1,..., Ck be a sequence of sets with RIP

– Assume that Ct Cp for some t p and that p is minimal with this property for fixed t. Then:

(i) If t > p, then C1, ..., Ct-1, Ct+1, ..., Ck has the running intersection property

(ii) If t < p, then C1,..., Ct-1, Cp, Ct+1, ..., Cp-1, Cp+1,..., Ck has the RIP.

Simple removal of redundant elimination set might lead to destroy RIP.

B:A C:AB

Condition (ii): t = 1, p = 2

B:A C:AB

Condition (ii): t = 2, p = 3

B:A C:AB

MST for making jtree (1)

Algorithm• From Elimination set (C1, ..., Ck)

• Remove redundant Cis• Make junction graph.

– If |Ci Cj | > 0 add edge between Ci and Cj.

– Set weight of the edge as |Ci Cj |.

• Construct MST (Maximum Weight Spanning Tree)

The resulting tree is junction tree. Also the clique set has RIP.

BCFABD

Junction graph MST

• Optimal jtree (for a fixed elimination ordering)– cost of edge e = (v, w)

– Use cost of edge to break tie when constructing MST. (minimum preferred)

on. can take valuesdiscrete of # :

Contents

• From DAG to Junction Tree• From Elimination Tree to Junction Tree• Junction Tree AlgorithmsJunction Tree Algorithms• Learning Bayesian Networks

Collect phase

jij μ

)(childjkjkj Sμ

Ci Ci’

• From leaf to root

separator

projection

Initial potential

Updated potential

Distribute phase

• From root to leaf j

* contains marginal distribution of clique j.

ijjijjk

iijchildijiij

SSμμ

'),(''

Ci Ci’

Contents

• From DAG to Junction Tree• From Elimination Tree to Junction Tree• Junction Tree Algorithms• Learning Bayesian NetworksLearning Bayesian Networks

Learning Paradigm

• Known structure or unknown structure• Full observability or partial observability• Frequentist or Bayesian

Ks, Fo, Fr (1)

• Given training set D = {D1, ..., DM}

• MLE of parameters of each CPD– MLE (Maximum likelihood Estimates)– CPD (Conditional Probability Distribution)

mmiim DXPaXPGDL

)),(|(log)|Pr(log

Decomposition, for each node# of nodes

# of data

Ks, Fo, Fr (2)

• Multinomial distributions– , for tabular CPD– Log-likelihood

– MLE

))(|(def

jXPakXP iiijk

ijkijk

i m kjijkijkm

i m kj

L ijkm

,)|)(,(

miiijkm DjXPakXII

miiijk DjXPakXIN )|)(,(def

ijkijk N

N constraint: ji

k ijk , allfor 1

Ks, Fo, Fr (3)

• MLE of Multinomial distr.– Constrained optimization

ijkijijkijk

ijkNO )1(log

ijkijijkN

ijkijk

ijkijk N

Derivatives of ijk

Setting Derivatives of ijk zero

Ks, Fo, Fr (4)

• Conditional linear Gaussian distributions

Ks, Fo, Ba (1)

• Frequentist: point estimation• Bayesian: distributional estimation

Ks, Fo, Ba (2)

• Multinomial distributions– Two assumptions on prior

• Global independence:

• Local independence:

– Global independence + likelihood equivalence leads to Dirichlet prior: Conjugate prior for multinomial

},...,1,,...,1,{ ,)(1 iiijki

i i rkqjP

},...,1,{ ,)(1 iijkij

j iji rkP i

Ks, Fo, Ba (3)

• Remark on Bayesian– P(|D) P(D| )*P()

– Conjugate priors• Posterior has same form with prior distribution.• Many exponential family belongs to conjugate

priors.

PosteriorLikelihood

Ks, Fo, Ba (4)

• Multinomial distributions– Dirichlet prior on tabular CPDs

ij: multinomial r.v. with ri possible values

• Posterior distribution

• Posterior mean

))(|( jXPaXP iiij

),...,(~ 1 iijrijij Dirichlet

k ijrijijkijij B

),...,(

1 ),...,(

)!1()( nn

),...,(~| 11 ii ijrijrijijij NNDirichletD

l ijlijl

ijkijkijk

Ks, Fo, Ba (5)

• Dirichlet distribution– Hyper parameter ijk

• Positive number • Pseudo count• # of imaginary cases ijk - 1

– Posterior distribution• Combined count between pseudo count and # of obser

ved data• Simple sum

),...,(~ 1 iijrijij Dirichlet

),...,(~| 11 ii ijrijrijijij NNDirichletD

Ks, Fo, Ba (6)

• Gaussian distributions

Ks, Po, Fr (1)

• Log likelihood

• Not decomposable into a sum of local terms, one per node– EM algorithm

),(log

)(loghidden

visible (observed)

Ks, Po, Fr (2)

• EM algorithm– From Jensen’s inequality

1),log()log( j

j jjjj

VhqVhqVhHPVhq

VhHPVhq

)|(log)|(),(log)|(

),(log)|(

),()|(log

),(log

1)|( h mVhqconstraint:

Ks, Po, Fr (3)

– Maximizing w.r.t. q (E-step)

m hmmh

m hmhm

VhqVhqVHPVhqO

))|(1(

)|(log)|(),(log)|(

mhmmhm

VhqVHPVhdq

dO 1)|(log),(log)|(

VHPVhq mh

),()|(

m VHPe

)(),(1m

hmh VPVHPe mh

)|()|( mm VhPVhq

Ks, Po, Fr (4)

– Maximizing w.r.t (M-step)• After q is maximized to p(h|Vm)• Maximizing Expected complete-data log-likelihood

• Iteration until convergence– E-step

• Calculate expected complete-data log-likelihood– M-step

• Get * maximizing expected complete-data log-likelihood

mm VhHPVhpQ )'|,(log),|()|'(

)|'(maxarg*'

Ks, Po, Fr (5)

• Multinomial distribution– E-step

– M-step

ijkijkNEQ 'log][)|'( ijkijk

ijkNL log

)|)(,(def

miiijkm DjXPakXII

miiijk DjXPakXIN )|)(,(def

mmiiijk DjXPakXPNE ),|)(,(][

)|'(maxarg'

ijkijk NE

Ks, Po, Ba (1)

• Gibbs sampling: stochastic version of EM• Variational Bayes: P(, H|V) q(|V)q(H|V)

Us, Fo, Fr (1)

• Issues– Hypothesis space– Evaluation function– Search algorithm

Us, Fo, Fr (2)

• Search space– DAG

• # of DAGs ~ O(2n^2)• 10 nodes ~ O(1018) DAGs• Finding optimal DAG: doomed to failure

Us, Fo, Fr (3)

• Search algorithm– Local search

• Operators: adding, deleting, reversing a single arcChoose G somehow

While not convergedFor each G’ in nbd(G)

Compute score(G’)G* := arg maxG’ score(G’)

If score(G*) > score(G)then G :=G*

else converged := true Psedo-code for hill-climbing. nbd(G) is the neighborhood of G, i.e., the

models that can be reached by applying a single local change operator.

Us, Fo, Fr (4)

• Search algorithm– PC algorithm

• Starts with fully connected undirected graph• CI (conditional independence) test

– If X Y|S, arc between X and Y is removed.

Us, Fo, Fr (5)

• Scoring function– MLE selects fully connected graph.– score(G) P(D|G)P(G)

– Automatically penalizing effect on complex model.• has more parameters.• Not much probability mass to the space where data act

ually lies.

)()|()|( model MAP

GPGDPDGP

penalizing complex models

)|(),|()|()(score GPGDPGDPG

Us, Fo, Fr (6)

• Scoring function– Under global independences, and

conjugate priors

– Integration at closed form

PXPaXPGDPi

)),((score

)()),(|()|(

Decomposition as factored form

Us, Fo, Fr (7)

• Scoring function– Under not conjugate priors: approximation– Laplace approximation: BIC (Bayesian Information

Criterioin)

– Case of multinomial distribution

GDPGDP G log2

)ˆ,|(log)|(log

dim. of the model

ML estimate of params.

DXPaXPG

i jkijkijk

i miii

),ˆ),(|(log)(scoreBIC

Us, Fo, Fr (8)

• Scoring function– Advantage of decomposed score– Marginal likelihood at most two different

terms in single link mismatched graphs.• Ex) G1:X1X2 X3 X4, G2:X1 X2X3 X4

),(score),(score),(score)(score

),(score)(score),,(score)(score

4332211

4333211

XXXXXXX

Us, Fo, Fr (9)

• Scoring function– Marginal likelihood for the multinomial distributio

n with Dirichlet prior – Bayesian Dirichlet (BD) score

i iijkGDPGDP

),|()|(

ijkijkn

j ijij

j ijrij

ijrijrijij

NNBGDP

),...,(

),...,()|(

posterior mean

Us, Fo, Ba (1)

• Posterior over all models is intractable– Focusing on some features

• Bayesian model averaging

• Needs to calculate P(G|D)

– Solution MCMC: Metropolis-Hastings algorithm• Only need to ratio R. Integration is avoided.

GfDGPDfP )()|()|( f(G)=1 if G contains a certain edge

')'()'|(

)()|()|(

GGPGDP

GPGDPDGP

Integration is intractable.

Us, Fo, Ba (2)

• Calculation of P(G|D)– Sampling GChoose G somehowWhile not converged

Pick a G’ u.a.r. from nbd(G)Compute R = P(G’|D)q(G|G’)/P(G|D)q(G’|G)Sample u ~ uniform(0,1)If u < min{1, R}

then G := G’

Psedo-code for MC3 algorithm. u.a.r. means uniformly at random.

Us, Po, Fr (1)

• Partially observable– Computation of marginal likelihood:

Intractable– Not decomposable to the product of local

– Solutions• Approximating the marginal likelihood• Structural EM

GPGZVPGVP

)|(),|,()|(

hidden variables

Us, Po, Fr (2)

• Approximating the marginal likelihood– Candidate’s method

)|(),|()|(

GPGDPGDP

from Gibbs sampling

from BN’s inference algorithm

trivial

MLE of params.

Us, Po. Fr (3)

• Structural EM– Idea: decomposition of expected complete-

data log-likelihood (BIC-score)– Search inside EM

• (EM inside Search is high cost process)

i jkijkijk log

2log)(BICscore

i jkijkijk log

2ˆlog)(EBICscore

MLE of params.

miiijk DjXPakXPN ),|)(,(

Us, Po, Ba (1)

• Combined MCMC– MCMC for Bayesian model averaging– MCMC over the values of the unobserved

nodes.

Conclusion

• Has learning of structure important meaning?– In paper, Yes.– In engineering, No.

• What can AI do for human?• What can human do for Machine

learning algorithm?

copyright (c) 2002 by snu cse biointelligence lab. 1 survey: foundations of bayesian networks o,...

Documents

snu 046.016 ôŁ0üyt...

thegodparticleforatheists snu

welcome to snu biointelligence lab!! · 2017. 6. 1. ·...

© 2004, snu biointelligence lab, joining and rotating data...

pt tomas total management solution diponed biointelligence...

genomic mapping and mapping databases - snu ·...

the biointelligence explosion

snu oopsla lab. 7. functions © copyright 2001 snu oopsla...

(c) 2000, 2001 snu cse biointelligence lab1 6.4.4 finding...

blueprint for nacst/sim 2002.10.25 신수용. © 2002, snu...

discrete mathematics - snu · 2017-03-30 · discrete...

company logo - welcome to snu biointelligence...

(c) 2004, snu biointelligence lab, dna extraction by cross...

snu hlc/nca accreditation update snu graduate & professional...

terminology - seoul national university · 2015-11-24 ·...

biointelligence laboratory school of computer science and...

clinical implications of apex1 and jagged1 ... - ::...

2010. 3. 29. - snu-dhpm.ac.krc1%f6%bf%aa...1990: 2000. 2001:...

snu inc lab 2015-10-261 multicast 2 2002 년 4 월 2 일...

snu cheatsheet