probabilistic graphical models -...

39
Probabilistic Graphical Models David Sontag New York University Lecture 1, January 26, 2012 David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 1 / 37

Upload: others

Post on 22-Aug-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Probabilistic Graphical Models

David Sontag

New York University

Lecture 1, January 26, 2012

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 1 / 37

Page 2: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

One of the most exciting advances in machine learning (AI, signalprocessing, coding, control, . . .) in the last decades

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 2 / 37

Page 3: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

How can we gain global insight based on local observations?

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 3 / 37

Page 4: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Key idea

1 Represent the world as a collection of random variables X1, . . . ,Xn

with joint distribution p(X1, . . . ,Xn)

2 Learn the distribution from data

3 Perform “inference” (compute conditional distributionsp(Xi | X1 = x1, . . . ,Xm = xm))

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 4 / 37

Page 5: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Reasoning under uncertainty

As humans, we are continuously making predictions under uncertainty

Classical AI and ML research ignored this phenomena

Many of the most recent advances in technology are possible becauseof this new, probabilistic, approach

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 5 / 37

Page 6: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Applications: Deep question answering

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 6 / 37

Page 7: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Applications: Machine translation

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 7 / 37

Page 8: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Applications: Speech recognition

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 8 / 37

Page 9: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Applications: Stereo vision

output: disparity!input: two images!

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 9 / 37

Page 10: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Key challenges

1 Represent the world as a collection of random variables X1, . . . ,Xn

with joint distribution p(X1, . . . ,Xn)

How does one compactly describe this joint distribution?Directed graphical models (Bayesian networks)Undirected graphical models (Markov random fields, factor graphs)

2 Learn the distribution from data

Maximum likelihood estimation. Other estimation methods?How much data do we need?How much computation does it take?

3 Perform “inference” (compute conditional distributionsp(Xi | X1 = x1, . . . ,Xm = xm))

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 10 / 37

Page 11: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Syllabus overview

We will study Representation, Inference & Learning

First in the simplest case

Only discrete variablesFully observed modelsExact inference & learning

Then generalize

Continuous variablesPartially observed data during learning (hidden variables)Approximate inference & learning

Learn about algorithms, theory & applications

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 11 / 37

Page 12: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Logistics

Class webpage:http://cs.nyu.edu/~dsontag/courses/pgm12/

Sign up for mailing list!Draft slides posted before each lecture

Book: Probabilistic Graphical Models: Principles and Techniques byDaphne Koller and Nir Friedman, MIT Press (2009)

Office hours: Tuesday 5-6pm and by appointment. 715 Broadway,12th floor, Room 1204

Grading: problem sets (70%) + final exam (30%)

Grader is Chris Alberti ([email protected])6-7 assignments (every 2 weeks). Both theory and programmingFirst homework out today, due Feb. 9 at 5pmSee collaboration policy on class webpage

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 12 / 37

Page 13: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Quick review of probabilityReference: Chapter 2 and Appendix A

What are the possible outcomes?Coin toss: Ω = ”heads”, “tails”Die: Ω = 1, 2, 3, 4, 5, 6An event is a subset of outcomes S ⊆ Ω:Examples for die: 1, 2, 3, 2, 4, 6, . . .We measure each event using a probability function

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 13 / 37

Page 14: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Probability function

Assign non-negative weight, p(ω), to each outcome such that∑ω∈Ω

p(ω) = 1

Coin toss: p(“head”) + p(“tail”) = 1Die: p(1) + p(2) + p(3) + p(4) + p(5) + p(6) = 1

Probability of event S ⊆ Ω:

p(S) =∑ω∈S

p(ω)

Example for die: p(2, 4, 6) = p(2) + p(4) + p(6)

Claim: p(S1 ∪ S2) = p(S1) + p(S2)− p(S1 ∩ S2)

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 14 / 37

Page 15: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Independence of events

Two events S1,S2 are independent if

p(S1 ∩ S2) = p(S1)p(S2)

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 15 / 37

Page 16: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Conditional probability

Let S1,S2 be events, p(S2) > 0.

p(S1 | S2) =p(S1 ∩ S2)

p(S2)

Claim 1:∑

ω∈S p(ω | S) = 1

Claim 2: If S1 and S2 are independent, then p(S1 | S2) = p(S1)

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 16 / 37

Page 17: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Two important rules

1 Chain rule Let S1, . . .Sn be events, p(Si ) > 0.

p(S1 ∩ S2 ∩ · · · ∩ Sn) = p(S1)p(S2 | S1) · · · p(Sn | S1, . . . ,Sn−1)

2 Bayes’ rule Let S1,S2 be events, p(S1) > 0 and p(S2) > 0.

p(S1 | S2) =p(S1 ∩ S2)

p(S2)=

p(S2 | S1)p(S1)

p(S2)

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 17 / 37

Page 18: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Discrete random variables

Often each outcome corresponds to a setting of various attributes(e.g., “age”, “gender”, “hasPneumonia”, “hasDiabetes”)

A random variable X is a mapping X : Ω→ D

D is some set (e.g., the integers)Induces a partition of all outcomes Ω

For some x ∈ D, we say

p(X = x) = p(ω ∈ Ω : X (ω) = x)

“probability that variable X assumes state x”

Notation: Val(X ) = set D of all values assumed by X(will interchangeably call these the “values” or “states” of variable X )

p(X ) is a distribution:∑

x∈Val(X ) p(X = x) = 1

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 18 / 37

Page 19: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Multivariate distributions

Instead of one random variable, have random vector

X(ω) = [X1(ω), . . . ,Xn(ω)]

Xi = xi is an event. The joint distribution

p(X1 = x1, . . . ,Xn = xn)

is simply defined as p(X1 = x1 ∩ · · · ∩ Xn = xn)

We will often write p(x1, . . . , xn) instead of p(X1 = x1, . . . ,Xn = xn)

Conditioning, chain rule, Bayes’ rule, etc. all apply

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 19 / 37

Page 20: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Working with random variables

For example, the conditional distribution

p(X1 | X2 = x2) =p(X1,X2 = x2)

p(X2 = x2).

This notation meansp(X1 = x1 | X2 = x2) = p(X1=x1,X2=x2)

p(X2=x2) ∀x1 ∈ Val(X1)

Two random variables are independent, X1 ⊥ X2, if

p(X1 = x1,X2 = x2) = p(X1 = x1)p(X2 = x2)

for all values x1 ∈ Val(X1) and x2 ∈ Val(X2).

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 20 / 37

Page 21: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Example

Consider three binary-valued random variables

X1,X2,X3 Val(Xi ) = 0, 1

Let outcome space Ω be the cross-product of their states:

Ω = Val(X1)×Val(X2)×Val(X3)

Xi (ω) is the value for Xi in the assignment ω ∈ ΩSpecify p(ω) for each outcome ω ∈ Ω by a big table:

x1 x2 x3 p(x1, x2, x3)0 0 0 .110 0 1 .02

...1 1 1 .05

How many parameters do we need to specify?

23 − 1David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 21 / 37

Page 22: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Marginalization

Suppose X and Y are random variables with distribution p(X ,Y )X : Intelligence, Val(X ) = “Very High”, “High”Y : Grade, Val(Y ) = “a”, “b”

Joint distribution specified by:

X

Y

vh h

a 0.7 0.15

b 0.1 0.05

p(Y = a) = ?= 0.85

More generally, suppose we have a joint distribution p(X1, . . . ,Xn).Then,

p(Xi = xi ) =∑x1

∑x2

· · ·∑xi−1

∑xi+1

· · ·∑xn

p(x1, . . . , xn)

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 22 / 37

Page 23: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Conditioning

Suppose X and Y are random variables with distribution p(X ,Y )X : Intelligence, Val(X ) = “Very High”, “High”Y : Grade, Val(Y ) = “a”, “b”

X

Y

vh h

a 0.7 0.15

b 0.1 0.05

Can compute the conditional probability

p(Y = a | X = vh) =p(Y = a,X = vh)

p(X = vh)

=p(Y = a,X = vh)

p(Y = a,X = vh) + p(Y = b,X = vh)

=0.7

0.7 + 0.1= 0.875.

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 23 / 37

Page 24: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Example: Medical diagnosis

Variable for each symptom (e.g. “fever”, “cough”, “fast breathing”,“shaking”, “nausea”, “vomiting”)

Variable for each disease (e.g. “pneumonia”, “flu”, “common cold”,“bronchitis”, “tuberculosis”)

Diagnosis is performed by inference in the model:

p(pneumonia = 1 | cough = 1, fever = 1, vomiting = 0)

One famous model, Quick Medical Reference (QMR-DT), has 600diseases and 4000 findings

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 24 / 37

Page 25: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Representing the distribution

Naively, could represent multivariate distributions with table ofprobabilities for each outcome (assignment)

How many outcomes are there in QMR-DT? 24600

Estimation of joint distribution would require a huge amount of data

Inference of conditional probabilities, e.g.

p(pneumonia = 1 | cough = 1, fever = 1, vomiting = 0)

would require summing over exponentially many variables’ values

Moreover, defeats the purpose of probabilistic modeling, which is tomake predictions with previously unseen observations

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 25 / 37

Page 26: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Structure through independence

If X1, . . . ,Xn are independent, then

p(x1, . . . , xn) = p(x1)p(x2) · · · p(xn)

2n entries can be described by just n numbers (if |Val(Xi )| = 2)!

However, this is not a very useful model – observing a variable Xi

cannot influence our predictions of Xj

If X1, . . . ,Xn are conditionally independent given Y , denoted asXi ⊥ X−i | Y , then

p(y , x1, . . . , xn) = p(y)p(x1 | y)n∏

i=2

p(xi | x1, . . . , xi−1, y)

= p(y)p(x1 | y)n∏

i=2

p(xi | y).

This is a simple, yet powerful, model

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 26 / 37

Page 27: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Example: naive Bayes for classification

Classify e-mails as spam (Y = 1) or not spam (Y = 0)Let 1 : n index the words in our vocabulary (e.g., English)Xi = 1 if word i appears in an e-mail, and 0 otherwiseE-mails are drawn according to some distribution p(Y ,X1, . . . ,Xn)

Suppose that the words are conditionally independent given Y . Then,

p(y , x1, . . . xn) = p(y)n∏

i=1

p(xi | y)

Estimate the model with maximum likelihood. Predict with:

p(Y = 1 | x1, . . . xn) =p(Y = 1)

∏ni=1 p(xi | Y = 1)∑

y=0,1 p(Y = y)∏n

i=1 p(xi | Y = y)

Are the independence assumptions made here reasonable?

Philosophy: Nearly all probabilistic models are “wrong”, but many arenonetheless useful

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 27 / 37

Page 28: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Bayesian networksReference: Chapter 3

A Bayesian network is specified by a directed acyclic graphG = (V ,E ) with:

1 One node i ∈ V for each random variable Xi

2 One conditional probability distribution (CPD) per node, p(xi | xPa(i)),specifying the variable’s probability conditioned on its parents’ values

Corresponds 1-1 with a particular factorization of the jointdistribution:

p(x1, . . . xn) =∏i∈V

p(xi | xPa(i))

Powerful framework for designing algorithms to perform probabilitycomputations

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 28 / 37

Page 29: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Example

Consider the following Bayesian network:

Grade

Letter

SAT

IntelligenceDifficulty

d1d0

0.6 0.4

i1i0

0.7 0.3

i0

i1

s1s0

0.95

0.2

0.05

0.8

g1

g2

g2

l1l 0

0.1

0.4

0.99

0.9

0.6

0.01

i0,d0

i0,d1

i0,d0

i0,d1

g2 g3g1

0.3

0.05

0.9

0.5

0.4

0.25

0.08

0.3

0.3

0.7

0.02

0.2

What is its joint distribution?

p(x1, . . . xn) =∏i∈V

p(xi | xPa(i))

p(d , i , g , s, l) = p(d)p(i)p(g | i , d)p(s | i)p(l | g)

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 29 / 37

Page 30: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

More examples

naive Bayes Medical diagnosis

Y

X1 X2 X3 Xn. . .

Features

Label

!"#$%#$#&

'(!"()#&

diseases

findings

d1 dn

f1 fm

Evidence is denoted by shading in a node

Can interpret Bayesian network as a generative process. Forexample, to generate an e-mail, we

1 Decide whether it is spam or not spam, by samping y ∼ p(Y )2 For each word i = 1 to n, sample xi ∼ p(Xi | Y = y)

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 30 / 37

Page 31: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Bayesian network structure implies conditionalindependencies!

Grade

Letter

SAT

IntelligenceDifficulty

The joint distribution corresponding to the above BN factors as

p(d , i , g , s, l) = p(d)p(i)p(g | i , d)p(s | i)p(l | g)

However, by the chain rule, any distribution can be written as

p(d , i , g , s, l) = p(d)p(i | d)p(g | i , d)p(s | i , d , g)p(l | g , d , i , g , s)

Thus, we are assuming the following additional independencies:D ⊥ I , S ⊥ D,G | I , L ⊥ I ,D,S | G . What else?

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 31 / 37

Page 32: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Bayesian network structure implies conditionalindependencies!

Generalizing the above arguments, we obtain that a variable isindependent from its non-descendants given its parents

Common parent – fixing B decouples A and C

!

" #$%&'(%)*'+',-./'011!20113 3

Qualitative Specification

456$6'7869':56';<=>%:=:%?6'9@6&%A%&=:%8)'&8B6'A$8BC

D$%8$'E)8F>67*6'8A'&=<9=>'$6>=:%8)95%@9

D$%8$'E)8F>67*6'8A'B87<>=$'$6>=:%8)95%@9

G99699B6):'A$8B'6H@6$:9

I6=$)%)*'A$8B'7=:=

46'9%B@>J'>%)E'='&6$:=%)'=$&5%:6&:<$6'K6L*L'='>=J6$67'*$=@5M'

N

" #$%&'(%)*'+',-./'011!20113 O1

Local Structures & Independencies

,8BB8)'@=$6):P%H%)*'Q'76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R

,=9&=76S)8F%)*'Q 76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'

6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R

T29:$<&:<$6S)8F%)*','&8<@>69'G'=)7'Q

U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R

W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X

! "#

!

"

#

!

#

"

Cascade – knowing B decouples A and C

!

" #$%&'(%)*'+',-./'011!20113 3

Qualitative Specification

456$6'7869':56';<=>%:=:%?6'9@6&%A%&=:%8)'&8B6'A$8BC

D$%8$'E)8F>67*6'8A'&=<9=>'$6>=:%8)95%@9

D$%8$'E)8F>67*6'8A'B87<>=$'$6>=:%8)95%@9

G99699B6):'A$8B'6H@6$:9

I6=$)%)*'A$8B'7=:=

46'9%B@>J'>%)E'='&6$:=%)'=$&5%:6&:<$6'K6L*L'='>=J6$67'*$=@5M'

N

" #$%&'(%)*'+',-./'011!20113 O1

Local Structures & Independencies

,8BB8)'@=$6):P%H%)*'Q'76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R

,=9&=76S)8F%)*'Q 76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'

6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R

T29:$<&:<$6S)8F%)*','&8<@>69'G'=)7'Q

U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R

W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X

! "#

!

"

#

!

#

"

V-structure – Knowing C couples A and B

!

" #$%&'(%)*'+',-./'011!20113 3

Qualitative Specification

456$6'7869':56';<=>%:=:%?6'9@6&%A%&=:%8)'&8B6'A$8BC

D$%8$'E)8F>67*6'8A'&=<9=>'$6>=:%8)95%@9

D$%8$'E)8F>67*6'8A'B87<>=$'$6>=:%8)95%@9

G99699B6):'A$8B'6H@6$:9

I6=$)%)*'A$8B'7=:=

46'9%B@>J'>%)E'='&6$:=%)'=$&5%:6&:<$6'K6L*L'='>=J6$67'*$=@5M'

N

" #$%&'(%)*'+',-./'011!20113 O1

Local Structures & Independencies

,8BB8)'@=$6):P%H%)*'Q'76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R

,=9&=76S)8F%)*'Q 76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'

6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R

T29:$<&:<$6S)8F%)*','&8<@>69'G'=)7'Q

U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R

W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X

! "#

!

"

#

!

#

"

This important phenomona is called explaining away and is whatmakes Bayesian networks so powerful

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 32 / 37

Page 33: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

A simple justification (for common parent)

!

" #$%&'(%)*'+',-./'011!20113 3

Qualitative Specification

456$6'7869':56';<=>%:=:%?6'9@6&%A%&=:%8)'&8B6'A$8BC

D$%8$'E)8F>67*6'8A'&=<9=>'$6>=:%8)95%@9

D$%8$'E)8F>67*6'8A'B87<>=$'$6>=:%8)95%@9

G99699B6):'A$8B'6H@6$:9

I6=$)%)*'A$8B'7=:=

46'9%B@>J'>%)E'='&6$:=%)'=$&5%:6&:<$6'K6L*L'='>=J6$67'*$=@5M'

N

" #$%&'(%)*'+',-./'011!20113 O1

Local Structures & Independencies

,8BB8)'@=$6):P%H%)*'Q'76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R

,=9&=76S)8F%)*'Q 76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'

6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R

T29:$<&:<$6S)8F%)*','&8<@>69'G'=)7'Q

U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R

W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X

! "#

!

"

#

!

#

"

We’ll show that p(A,C | B) = p(A | B)p(C | B) for any distributionp(A,B,C ) that factors according to this graph structure, i.e.

p(A,B,C ) = p(B)p(A | B)p(C | B)

Proof.

p(A,C | B) =p(A,B,C )

p(B)= p(A | B)p(C | B)

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 33 / 37

Page 34: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

D-separation (“directed separated”) in Bayesian networks

Algorithm to calculate whether X ⊥ Z | Y by looking at graphseparation

Look to see if there is active path between X and Y when variablesY are observed:

(a)

X

Y

Z X

Y

Z

(b)

(a)

X

Y

Z

(b)

X

Y

Z

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 34 / 37

Page 35: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

D-separation (“directed separated”) in Bayesian networks

Algorithm to calculate whether X ⊥ Z | Y by looking at graphseparation

Look to see if there is active path between X and Y when variablesY are observed:

(a)

X

Y

Z X

Y

Z

(b)

(a)

X

Y

Z

(b)

X

Y

Z

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 35 / 37

Page 36: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

D-separation (“directed separated”) in Bayesian networks

Algorithm to calculate whether X ⊥ Z | Y by looking at graphseparation

Look to see if there is active path between X and Y when variablesY are observed:

X Y Z X Y Z

(a) (b)

If no such path, then X and Z are d-separated with repsect to Y

d-separation reduces statistical independencies (hard) to connectivityin graphs (easy)

Important because it allows us to quickly prune the Bayesian network,finding just the relevant variables for answering a query

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 36 / 37

Page 37: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

D-separation example 1

1X

2X

3X

X 4

X 5

X6

1X

2X

3X

X 4

X 5

X6

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 37 / 37

Page 38: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

D-separation example 2

1X

2X

3X

X 4

X 5

X6

1X

2X

3X

X 4

X 5

X6

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 38 / 37

Page 39: Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

Summary

Bayesian networks given by (G ,P) where P is specified as a set oflocal conditional probability distributions associated with G ’s nodes

One interpretation of a BN is as a generative model, where variablesare sampled in topological order

Local and global independence properties identifiable viad-separation criteria

Computing the probability of any assignment is obtained bymultiplying CPDs

Bayes’ rule is used to compute conditional probabilitiesMarginalization or inference is often computationally difficult

David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 39 / 37