chapter 2 bayesian networks:...

Chapter 2 Bayesian Networks:

Representation

2016 Fall

Jin Gu, Michael Zhang

Reviews

2

Outlines

• Conditional independence

• Conditional parameterization

• Naïve Bayes model

• Bayesian networks

–BNs and local independences

– I-map and factorization

–d-separation

–From distribution to BNs

3

Decision with Probability

• When the variables you need to consider are

very large, human brain will struggle to get an

“optimal” decision.

• Why?

4

The parameters increase exponentially with the

number of variables!

For binary variables: ~2n parameters

Representing Joint Distributions

• Random variables: 𝑋1, … , 𝑋𝑛

• 𝑃 is a joint distribution over 𝑋1, … , 𝑋𝑛

If 𝑋1, . . , 𝑋𝑛 binary, need 2n-1 parameters to describe 𝑃

Can we represent 𝑷 more compactly?

Key: Exploit independence properties

5

Independent Random Variables

• Two variables X and Y are independent if – 𝑃 𝑋 = 𝑥|𝑌 = 𝑦 = 𝑃 𝑋 = 𝑥 for all values 𝑥, 𝑦

– Equivalently, knowing 𝑌 does not change predictions of 𝑋

• If X and Y are independent then:

– 𝑃(𝑋, 𝑌) = 𝑃(𝑋|𝑌)𝑃(𝑌) = 𝑃(𝑋)𝑃(𝑌)

• If 𝑋1, … , 𝑋𝑛 are independent then:

– 𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋1) … 𝑃(𝑋𝑛)

– 𝑂(𝑛) parameters

– All 2𝑛 probabilities are implicitly defined

– Cannot represent many types of distributions

6

Conditional Independence

• Two variables 𝑋 and 𝑌 are conditionally

independent given 𝑍, if:

– 𝑃(𝑋 = 𝑥|𝑌 = 𝑦, 𝑍 = 𝑧) = 𝑃(𝑋 = 𝑥|𝑍 = 𝑧) for all

values 𝑥, 𝑦, 𝑧

– Equivalently, if we know 𝑍, then knowing 𝑌 does

not change predictions of 𝑋

– Notation: 𝐼𝑛𝑑(𝑋; 𝑌 | 𝑍) or (𝑋 ⊥ 𝑌 | 𝑍)

7

Student Example

• D = Course difficulty, Val(D)= {d0,d1}

• I = Intelligence, Val(I) = {i0,i1}

• S = Score on SAT, Val(S) = {s0,s1}

• G = Course grade, Val(G) = {g0,g1,g2}

• L = Recommendation letter, Val(L) = {l0,l1}

• Assume that G and S are independent given I

I D

G

L

S

8

A concrete example to show that

independences can reduce the required

parameters for representing a distribution

Conditional Parameterization

• S = Score on SAT, Val(S) = {s0,s1}

• I = Intelligence, Val(I) = {i0,i1}

I S P(I,S)

i0 s0 0.665

i0 s1 0.035

i1 s0 0.06

i1 s1 0.24

S

I s0 s1

i0 0.95 0.05

i1 0.2 0.8

I

i0 i1

0.7 0.3

𝑃(𝑆|𝐼) 𝑃(𝐼) 𝑃(𝐼, 𝑆)

Joint parameterization Conditional parameterization

3 parameters 3 parameters

Alternative conditional parameterization: 𝑃(𝑆) and 𝑃(𝐼|𝑆)

9

=

Conditional Parameterization

• 𝑆 = Score SAT, Val(S) = {s0,s1}

• 𝐼 = Intelligence, Val(I) = {i0,i1}

• 𝐺 = Grade, Val(G) = {g0,g1,g2}

• Assume that 𝐺 and 𝑆 are independent given 𝐼

Joint parameterization

223-1=12-1=11 independent parameters

Conditional parameterization has

P(I,S,G) = P(I)P(S|I)P(G|I,S) = P(I)P(S|I)P(G|I)

P(I) – 1 independent parameter

P(S|I) – 21 independent parameters

P(G|I) - 22 independent parameters

7 independent parameters

10

Independences can reduce

the required parameters

Naïve Bayes Model

• Class variable 𝐶, 𝑉𝑎𝑙(𝐶) = *𝑐1, … , 𝑐𝑘+

• Evidence variables 𝑋1, … , 𝑋𝑛

• Naïve Bayes assumption: evidence variables are conditionally

independent given C

• Applications in medical diagnosis, text classification

• Used as a classifier (i.e. k=2):

• Problem: Double counting correlated evidence

n

i

in CXPCPXXCP1

1 )|()(),...,,(

n

i i

i

n

n

cCxP

cCxP

cCP

cCP

xxcCP

xxcCP

1 2

1

2

1

12

11

)|(

)|(

)(

)(

),...,|(

),...,|(

11

Bayesian Networks (Intuitive)

• Directed acyclic graph (DAG) 𝐺 – Nodes 𝑋1, … , 𝑋𝑛 represent random variables

• 𝐺 encodes local independence assumptions – 𝑋𝑖 is independent of its non-descendants given its parents

– Formally: (𝑋𝑖 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐(𝑋𝑖) | 𝑃𝑎(𝑋𝑖)) A

B C

E

G

D F

𝐸 ⊥ *𝐴, 𝐶, 𝐷, 𝐹+ | 𝐵

I D

G

L

S

12

Can we find a simple graph model to equally or partially represent the probability with the same independences?

Independency Mappings (I-Maps)

• I-Maps (Independence Maps)

– Let 𝑃 be a distribution over 𝑿

– Let 𝐼(𝑃) be the independencies in 𝑃

– A Bayesian network is an I-map of 𝑃 if 𝐼(𝐺)𝐼(𝑃)

I

S

I S P(I,S)

i0 s0 0.25

i0 s1 0.25

i1 s0 0.25

i1 s1 0.25

I

S

I S P(I,S)

i0 s0 0.4

i0 s1 0.3

i1 s0 0.2

i1 s1 0.1

I(P)={IS} I(G)={IS} I(G)= I(P)=

13

Factorization Theorem ***

• 𝐺 is an I-Map of 𝑃


n

i

iin XPaXPXXP1

1 ))(|(),...,(

n

i

iin XPaXPXXP1

1 ))(|(),...,(

14

G is a given graph. If G is an I-Map of P, P can be

factorized according to G.

G is a given graph. If P can be factorized according

to G, G is an I-Map of P.

If we define the independences in G as

𝑋𝑖 ⊥𝑁𝑜𝑛𝐷𝑒𝑠𝑐(𝑋𝑖)| 𝑃𝑎(𝑋𝑖)

Proof: I-Map to Factorization

• If 𝐺 is an I-Map of 𝑃, then

Proof:

• wlog. 𝑋1, … , 𝑋𝑛 is an ordering consistent with G

• By chain rule:

• From assumption:

• Since 𝐺 is an I-Map (𝑋𝑖 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐(𝑋𝑖)| 𝑃𝑎(𝑋𝑖)) ⊆ 𝐼(𝑃)

n

i

iin XPaXPXXP1

1 ))(|(),...,(

)()(},{

},{)(

1,1

1,1

iii

ii

XNonDescXPaXX

XXXPa

n

i

iin XXXPXXP1

111 ),...,|(),...,(

))(|(),...,|( 11 iiii XPaXPXXXP

15

Proof: Factorization Implies I-Map


• Need to prove (𝑋𝑖 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐(𝑋𝑖)| 𝑃𝑎(𝑋𝑖)) ⊆ 𝐼(𝑃)

or that 𝑃 𝑋𝑖 | 𝑁𝑜𝑛𝐷𝑒𝑠𝑐(𝑋𝑖) = 𝑃(𝑋𝑖 | 𝑃𝑎(𝑋𝑖))

Proof:

• wlog. 𝑋1, … , 𝑋𝑛 is an ordering consistent with 𝐺

n

i

iin XPaXPXXP1

1 ))(|(),...,(

))(|(

))(|(

))(|(

))((

))(,())(|(

1

1

1

ii

i

k

kk

i

k

kk

i

iiii

XPaXP

XPaXP

XPaXP

XNonDescP

XNonDescXPXNonDescXP

16

Formal Bayesian Network Definition

• A Bayesian network is a pair (𝐺, 𝑃)

– 𝑃 factorizes over 𝐺

– 𝑃 is specified as set of conditional probability

dependences (CPDs) associated with 𝐺’s nodes

• Parameters

– Joint distribution: ~2𝑛

– Bayesian network (bounded in-degree 𝑘): ~𝑛2𝑘

17

d-Separation in BNs

• 𝐺 encodes local independence assumptions

– 𝑋𝑖 is independent of its non-descendants given its

parents

– Formally: (𝑋𝑖 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐(𝑋𝑖) | 𝑃𝑎(𝑋𝑖))

18

If the variables in upward closure

are given, ....?

d-Separation in BNs

• 𝐺 encodes local independence assumptions

– 𝑋𝑖 is independent of its non-descendants given its

parents

– Formally: (𝑋𝑖 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐(𝑋𝑖) | 𝑃𝑎(𝑋𝑖))

Does 𝐺 encode other independence assumptions that hold

in every distribution 𝑃 that factorizes over 𝐺?

Devise a procedure to find all independencies in 𝐺

19

Not Separated: Direct Connection

• 𝑋 and 𝑌 directly connected in 𝐺 no 𝑍 exists

for which 𝐼𝑛𝑑(𝑋; 𝑌 | 𝑍) holds in any factorizing

distribution

– Example: deterministic function

X

Y

20

Not Separated: Indirect Connection

Z

Y

X

Blocked

Z

X

Y

Blocked

Y X

Z

Blocked

Y X

Z

Blocked

Z

Y

Case 1: Indirect causal effect

X

Active

Z

X

Case 2: Indirect evidential effect

Y

Active

Y X

Case 3: Common cause

Z

Active

Y X

Z

Case 4: Common effect

Active

21

Not Separated: the General Case

• Let 𝐺 be a Bayesian network structure

• Let 𝑋1…𝑋𝑛 be a trail in 𝐺

• Let 𝑬 be a subset of evidence nodes in 𝐺

The trail 𝑋1…𝑋𝑛 is active given evidence 𝐸 if:

For every V-structure 𝑋𝑖−1 → 𝑋𝑖 ← 𝑋𝑖+1, 𝑋𝑖 or one of its descendants is observed

No other nodes along the trail are in 𝑬

22

d-Separation

• 𝑿 and 𝒀 are d-separated in 𝐺 given 𝒁, denoted

d-𝑠𝑒𝑝𝐺(𝑿; 𝒀|𝒁) if there is no active trail between

any node 𝑋𝑿 and any node 𝑌𝒀 in 𝐺

• Get all independences from d-separation

– 𝐼(𝐺) = *(𝑿𝒀|𝒁) ∶ 𝑑 − 𝑠𝑒𝑝𝐺(𝑿; 𝒀 | 𝒁)+

23

d-Separation Examples

D

E

A

D-sep(B,C|D)=no

B

C

D

E

A B

C

D

E

A

D-sep(B,C)=yes

B

C X

D

E

A

D-sep(B,C|A,D)=yes

B

C

X

24

Algorithm for d-Separation

25

Aim: find all reachable nodes

from X given Z

I-Equivalence

• Two graphs 𝐺1 and 𝐺2 are I-equivalent, if 𝐼(𝐺1) = 𝐼(𝐺2). It means that the independences encoded by the two graphs should be the same.

• 𝐺1 and 𝐺2 have the same skeleton and the same set of immoralities (v-structures) if and only if they are I-equivalent.

26

From Distributions to BNs

• If P factorizes over G, we can derive a rich set

of independence assertions that hold for P by

simply examining G.

• Given a distribution P (a complex distribution

hard to get the encoded independencies), to

what extent can we construct a graph G whose

independencies are a reasonable surrogate for

the independencies in P?

27

Minimal I-Maps

• A graph G is a minimal I-map for a set of independences I if it is an I-map for I, and if the removal of even a single edge from G renders it not an I-map.

Removal of an edge means additional independences!

Basic idea: for i-th variable Xi, find the minimal sets of parents of Xi from the previous variables.

28

Note: different initial orderings may generate different networks.

Perfect Maps

• A graph G is a perfect map (P-map) for a set of

independences I if I(G) = I or I(G) = I(P). P is

a distribution.

• Enumerate all independences in G and P to see

weather G is a P-map of P.

• How a find a graph which is a P-map of

distribution? (Please read textbook 3.4.3)

29

Summary

• Independences can reduce the required

parameters to represent a distribution

• Factorization theorem establish a mapping

from a distribution and a graph

• Minimal I-Maps provide a possible way to find

a graph representation of a distribution

30

Assignment #2

31

chapter 2 bayesian networks:...

Documents