chapter 2 bayesian networks:...
TRANSCRIPT
Chapter 2 Bayesian Networks:
Representation
2016 Fall
Jin Gu, Michael Zhang
Reviews
2
Outlines
• Conditional independence
• Conditional parameterization
• Naïve Bayes model
• Bayesian networks
–BNs and local independences
– I-map and factorization
–d-separation
–From distribution to BNs
3
Decision with Probability
• When the variables you need to consider are
very large, human brain will struggle to get an
“optimal” decision.
• Why?
4
The parameters increase exponentially with the
number of variables!
For binary variables: ~2n parameters
Representing Joint Distributions
• Random variables: 𝑋1, … , 𝑋𝑛
• 𝑃 is a joint distribution over 𝑋1, … , 𝑋𝑛
If 𝑋1, . . , 𝑋𝑛 binary, need 2n-1 parameters to describe 𝑃
Can we represent 𝑷 more compactly?
Key: Exploit independence properties
5
Independent Random Variables
• Two variables X and Y are independent if – 𝑃 𝑋 = 𝑥|𝑌 = 𝑦 = 𝑃 𝑋 = 𝑥 for all values 𝑥, 𝑦
– Equivalently, knowing 𝑌 does not change predictions of 𝑋
• If X and Y are independent then:
– 𝑃(𝑋, 𝑌) = 𝑃(𝑋|𝑌)𝑃(𝑌) = 𝑃(𝑋)𝑃(𝑌)
• If 𝑋1, … , 𝑋𝑛 are independent then:
– 𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋1) … 𝑃(𝑋𝑛)
– 𝑂(𝑛) parameters
– All 2𝑛 probabilities are implicitly defined
– Cannot represent many types of distributions
6
Conditional Independence
• Two variables 𝑋 and 𝑌 are conditionally
independent given 𝑍, if:
– 𝑃(𝑋 = 𝑥|𝑌 = 𝑦, 𝑍 = 𝑧) = 𝑃(𝑋 = 𝑥|𝑍 = 𝑧) for all
values 𝑥, 𝑦, 𝑧
– Equivalently, if we know 𝑍, then knowing 𝑌 does
not change predictions of 𝑋
– Notation: 𝐼𝑛𝑑(𝑋; 𝑌 | 𝑍) or (𝑋 ⊥ 𝑌 | 𝑍)
7
Student Example
• D = Course difficulty, Val(D)= {d0,d1}
• I = Intelligence, Val(I) = {i0,i1}
• S = Score on SAT, Val(S) = {s0,s1}
• G = Course grade, Val(G) = {g0,g1,g2}
• L = Recommendation letter, Val(L) = {l0,l1}
• Assume that G and S are independent given I
I D
G
L
S
8
A concrete example to show that
independences can reduce the required
parameters for representing a distribution
Conditional Parameterization
• S = Score on SAT, Val(S) = {s0,s1}
• I = Intelligence, Val(I) = {i0,i1}
I S P(I,S)
i0 s0 0.665
i0 s1 0.035
i1 s0 0.06
i1 s1 0.24
S
I s0 s1
i0 0.95 0.05
i1 0.2 0.8
I
i0 i1
0.7 0.3
𝑃(𝑆|𝐼) 𝑃(𝐼) 𝑃(𝐼, 𝑆)
Joint parameterization Conditional parameterization
3 parameters 3 parameters
Alternative conditional parameterization: 𝑃(𝑆) and 𝑃(𝐼|𝑆)
9
=
Conditional Parameterization
• 𝑆 = Score SAT, Val(S) = {s0,s1}
• 𝐼 = Intelligence, Val(I) = {i0,i1}
• 𝐺 = Grade, Val(G) = {g0,g1,g2}
• Assume that 𝐺 and 𝑆 are independent given 𝐼
Joint parameterization
223-1=12-1=11 independent parameters
Conditional parameterization has
P(I,S,G) = P(I)P(S|I)P(G|I,S) = P(I)P(S|I)P(G|I)
P(I) – 1 independent parameter
P(S|I) – 21 independent parameters
P(G|I) - 22 independent parameters
7 independent parameters
10
Independences can reduce
the required parameters
Naïve Bayes Model
• Class variable 𝐶, 𝑉𝑎𝑙(𝐶) = *𝑐1, … , 𝑐𝑘+
• Evidence variables 𝑋1, … , 𝑋𝑛
• Naïve Bayes assumption: evidence variables are conditionally
independent given C
• Applications in medical diagnosis, text classification
• Used as a classifier (i.e. k=2):
• Problem: Double counting correlated evidence
n
i
in CXPCPXXCP1
1 )|()(),...,,(
n
i i
i
n
n
cCxP
cCxP
cCP
cCP
xxcCP
xxcCP
1 2
1
2
1
12
11
)|(
)|(
)(
)(
),...,|(
),...,|(
11
Bayesian Networks (Intuitive)
• Directed acyclic graph (DAG) 𝐺 – Nodes 𝑋1, … , 𝑋𝑛 represent random variables
• 𝐺 encodes local independence assumptions – 𝑋𝑖 is independent of its non-descendants given its parents
– Formally: (𝑋𝑖 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐(𝑋𝑖) | 𝑃𝑎(𝑋𝑖)) A
B C
E
G
D F
𝐸 ⊥ *𝐴, 𝐶, 𝐷, 𝐹+ | 𝐵
I D
G
L
S
12
Can we find a simple graph model to equally or partially represent the probability with the same independences?
Independency Mappings (I-Maps)
• I-Maps (Independence Maps)
– Let 𝑃 be a distribution over 𝑿
– Let 𝐼(𝑃) be the independencies in 𝑃
– A Bayesian network is an I-map of 𝑃 if 𝐼(𝐺)𝐼(𝑃)
I
S
I S P(I,S)
i0 s0 0.25
i0 s1 0.25
i1 s0 0.25
i1 s1 0.25
I
S
I S P(I,S)
i0 s0 0.4
i0 s1 0.3
i1 s0 0.2
i1 s1 0.1
I(P)={IS} I(G)={IS} I(G)= I(P)=
13
Factorization Theorem ***
• 𝐺 is an I-Map of 𝑃
• 𝐺 is an I-Map of 𝑃
n
i
iin XPaXPXXP1
1 ))(|(),...,(
n
i
iin XPaXPXXP1
1 ))(|(),...,(
14
G is a given graph. If G is an I-Map of P, P can be
factorized according to G.
G is a given graph. If P can be factorized according
to G, G is an I-Map of P.
If we define the independences in G as
𝑋𝑖 ⊥𝑁𝑜𝑛𝐷𝑒𝑠𝑐(𝑋𝑖)| 𝑃𝑎(𝑋𝑖)
Proof: I-Map to Factorization
• If 𝐺 is an I-Map of 𝑃, then
Proof:
• wlog. 𝑋1, … , 𝑋𝑛 is an ordering consistent with G
• By chain rule:
• From assumption:
• Since 𝐺 is an I-Map (𝑋𝑖 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐(𝑋𝑖)| 𝑃𝑎(𝑋𝑖)) ⊆ 𝐼(𝑃)
n
i
iin XPaXPXXP1
1 ))(|(),...,(
)()(},{
},{)(
1,1
1,1
iii
ii
XNonDescXPaXX
XXXPa
n
i
iin XXXPXXP1
111 ),...,|(),...,(
))(|(),...,|( 11 iiii XPaXPXXXP
15
Proof: Factorization Implies I-Map
• 𝐺 is an I-Map of 𝑃
• Need to prove (𝑋𝑖 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐(𝑋𝑖)| 𝑃𝑎(𝑋𝑖)) ⊆ 𝐼(𝑃)
or that 𝑃 𝑋𝑖 | 𝑁𝑜𝑛𝐷𝑒𝑠𝑐(𝑋𝑖) = 𝑃(𝑋𝑖 | 𝑃𝑎(𝑋𝑖))
Proof:
• wlog. 𝑋1, … , 𝑋𝑛 is an ordering consistent with 𝐺
n
i
iin XPaXPXXP1
1 ))(|(),...,(
))(|(
))(|(
))(|(
))((
))(,())(|(
1
1
1
ii
i
k
kk
i
k
kk
i
iiii
XPaXP
XPaXP
XPaXP
XNonDescP
XNonDescXPXNonDescXP
16
Formal Bayesian Network Definition
• A Bayesian network is a pair (𝐺, 𝑃)
– 𝑃 factorizes over 𝐺
– 𝑃 is specified as set of conditional probability
dependences (CPDs) associated with 𝐺’s nodes
• Parameters
– Joint distribution: ~2𝑛
– Bayesian network (bounded in-degree 𝑘): ~𝑛2𝑘
17
d-Separation in BNs
• 𝐺 encodes local independence assumptions
– 𝑋𝑖 is independent of its non-descendants given its
parents
– Formally: (𝑋𝑖 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐(𝑋𝑖) | 𝑃𝑎(𝑋𝑖))
18
If the variables in upward closure
are given, ....?
d-Separation in BNs
• 𝐺 encodes local independence assumptions
– 𝑋𝑖 is independent of its non-descendants given its
parents
– Formally: (𝑋𝑖 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐(𝑋𝑖) | 𝑃𝑎(𝑋𝑖))
Does 𝐺 encode other independence assumptions that hold
in every distribution 𝑃 that factorizes over 𝐺?
Devise a procedure to find all independencies in 𝐺
19
Not Separated: Direct Connection
• 𝑋 and 𝑌 directly connected in 𝐺 no 𝑍 exists
for which 𝐼𝑛𝑑(𝑋; 𝑌 | 𝑍) holds in any factorizing
distribution
– Example: deterministic function
X
Y
20
Not Separated: Indirect Connection
Z
Y
X
Blocked
Z
X
Y
Blocked
Y X
Z
Blocked
Y X
Z
Blocked
Z
Y
Case 1: Indirect causal effect
X
Active
Z
X
Case 2: Indirect evidential effect
Y
Active
Y X
Case 3: Common cause
Z
Active
Y X
Z
Case 4: Common effect
Active
21
Not Separated: the General Case
• Let 𝐺 be a Bayesian network structure
• Let 𝑋1…𝑋𝑛 be a trail in 𝐺
• Let 𝑬 be a subset of evidence nodes in 𝐺
The trail 𝑋1…𝑋𝑛 is active given evidence 𝐸 if:
For every V-structure 𝑋𝑖−1 → 𝑋𝑖 ← 𝑋𝑖+1, 𝑋𝑖 or one of its descendants is observed
No other nodes along the trail are in 𝑬
22
d-Separation
• 𝑿 and 𝒀 are d-separated in 𝐺 given 𝒁, denoted
d-𝑠𝑒𝑝𝐺(𝑿; 𝒀|𝒁) if there is no active trail between
any node 𝑋𝑿 and any node 𝑌𝒀 in 𝐺
• Get all independences from d-separation
– 𝐼(𝐺) = *(𝑿𝒀|𝒁) ∶ 𝑑 − 𝑠𝑒𝑝𝐺(𝑿; 𝒀 | 𝒁)+
23
d-Separation Examples
D
E
A
D-sep(B,C|D)=no
B
C
D
E
A B
C
D
E
A
D-sep(B,C)=yes
B
C X
D
E
A
D-sep(B,C|A,D)=yes
B
C
X
24
Algorithm for d-Separation
25
Aim: find all reachable nodes
from X given Z
I-Equivalence
• Two graphs 𝐺1 and 𝐺2 are I-equivalent, if 𝐼(𝐺1) = 𝐼(𝐺2). It means that the independences encoded by the two graphs should be the same.
• 𝐺1 and 𝐺2 have the same skeleton and the same set of immoralities (v-structures) if and only if they are I-equivalent.
26
From Distributions to BNs
• If P factorizes over G, we can derive a rich set
of independence assertions that hold for P by
simply examining G.
• Given a distribution P (a complex distribution
hard to get the encoded independencies), to
what extent can we construct a graph G whose
independencies are a reasonable surrogate for
the independencies in P?
27
Minimal I-Maps
• A graph G is a minimal I-map for a set of independences I if it is an I-map for I, and if the removal of even a single edge from G renders it not an I-map.
Removal of an edge means additional independences!
Basic idea: for i-th variable Xi, find the minimal sets of parents of Xi from the previous variables.
28
Note: different initial orderings may generate different networks.
Perfect Maps
• A graph G is a perfect map (P-map) for a set of
independences I if I(G) = I or I(G) = I(P). P is
a distribution.
• Enumerate all independences in G and P to see
weather G is a P-map of P.
• How a find a graph which is a P-map of
distribution? (Please read textbook 3.4.3)
29
Summary
• Independences can reduce the required
parameters to represent a distribution
• Factorization theorem establish a mapping
from a distribution and a graph
• Minimal I-Maps provide a possible way to find
a graph representation of a distribution
30
Assignment #2
31