probabilistic graphical models - radboud universiteit · probabilistic graphical models (pgms) ......

Lecture NotesProbabilistic GraphicalModelsFall 2017January 8, 2018

Johannes Textor, Perry Groot, Marcos Bueno

ii Contents

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . 1

1 Introduction1.1 Hello, World! 2

1.2 Optimal Bayes 6

1.3 Naive Bayes 8

1.4 Bayesian Networks 9

Problems for This Chapter 11

2 Conditional Independence2.1 Conditional Independence 14

2.2 d-Separation 15

2.3 Testing Conditional Independence 17


3 Managing NetworkComplexity

3.1 Recap: Interpretation of BayesianNetworks 21

3.2 Manageable Networks: Tricks of the Trade 23

3.3 Parametric Representations 24


4 Exact Inference4.1 Exact Inference is NP-hard 28

4.2 Elimination Orderings 32


5 Approximate Inference5.1 Factor Graphs 36

5.2 Message Passing on Factor Graphs 40

5.3 Loopy Message Passing 42


6 Structural Equation Models6.1 Gaussian Distributions 47

6.2 Structural Equation Models (SEMs) 50

6.3 Testing SEMs 52


7 Latent Variables7.1 Latent Variables 55

7.2 Implied Covariance Matrices 58

7.3 Estimating Latent Variables 60

7.4 Examples 62


Contents iii

8 Markov Equivalence8.1 Faithfulness 66

8.2 Markov Equivalence 67

8.3 The IC Algorithm 69


9 Structure Learning9.1 Testing Conditional Independence 71

9.2 The PC Algorithm 75


10 Causality10.1 Motivation 80

10.2 Covariate Adjustment 83

10.3 Instrumental Variables 90


AppendixSolutions . . . . . . . . . . . . . . . . . . . . . . 95

iv Contents

Preface 1

PrefaceProbabilistic graphical models (PGMs) are powerful, expressive, and intuitive tools forknowledge representation, reasoning under uncertainty, inference, prediction, and classi-fication. There are different kinds of PGMs, each with their own strengths and weaknesses.This lecture will start out from the perhaps most commonly known kind: directed PGMs,also known as Bayesian networks. From there, we will explore other kinds of PGMs suchas Gaussian PGMs, undirected PGMs, PGMs with latent variables, and ancestral PGMs. Ihope that a student who has successfully participated in this course will throughout theirlater carreer remember these valuable tools and recognize situations in which they will beapplicable.

The history of Bayesian Networks dates back to the groundbreaking work of Judea Pearl andothers in the late 1980s, for which Pearl was given the Turing Award in 2012. Since then,Bayesian networks have evolved to become key parts of the data scientist’s toolbox and areused in many application domains, notably medicine and molecular biology. This coursewill cover the necessary theory to understand, build, and work with Bayesian networks.Practical work will focus on implementing Bayesian networks in real-world domains.

This course builds upon the past “Bayesian networks” courses given at Radboud UniversityNijmegen by Peter Lucas, who is now at Leiden University. The LaTeX templates usedfor this course were originally developed by Till Tantau, University of Lübeck – who isalso well-known as the author of the beamer and TikZ packages. I thank Till for these veryuseful packages and templates, but also, evenmore importantly, for his inspirational teachingmethods.

2 1 Introduction

1-1 1-1

Lecture 1Introduction

Today’stopic

Today’stopic

Welcome to this course and to the world of probabilistic graphical models (PGMs)! Wehope that you will thoroughly enjoy our journey which will start off with the venerablesubject of Bayesian networks, take us to neighbouring lands such as undirected networksand ancestral graphs, and even bring us to a final frontier of today’s science – the intricateand fascinating subject of Causality, currently one of the most important domains of use forBayesian networks and other PGMs.This first lecture starts off off gently by showing some real-world applications of PGMs fromdifferent application domains. Then, we will get down to the core of things by explaininghow a directed PGM, or Bayesian network, represents a joint probability distribution – akey fundamental property, and the technical foundation for the inference algorithms andpractical applications that we will discover later.

1-2 1-2Chapter Objectives

1. Get to know each other!2. Know what a Bayesian network looks like.3. Understand what a Bayesian network represents.

Chapter Contents1.1 Hello, World! 2

1.2 Optimal Bayes 6

1.3 Naive Bayes 8

1.4 Bayesian Networks 9


1.1 Hello, World!

1-4 1-4Introducing Ourselves

– Johannes Textor – postdoc at Tumor Immunology, RUMC [email protected]– or Blackboard

– Perry Groot – postdoc at iCIS, student advisor Computing Sciences & Information Sci-ences [email protected]

– Marcos Bueno – PhD student at iCIS, currently absent, may be involved later [email protected]

We are all active researchers in the field of probabilistic graphical models and their applica-tions.

1-5 1-5Introducing the Course

– Lectures You will be asked questions during lectures!– Exercises I may sometimes use the second hour of a lecture for an exercise. There willalso be a few supervised exercise blocks.

– Assignment I Implement a Bayesian network for a real-world domain.– Assignment II Learn a Bayesian network from data.

You can use a language of your choice.


Assessment

Exam 50%Assignment I 25%Assignment II 25%

1-6 1-6Introducing the Resources

– Blackboard: Official communication channel between you and us. Handing in assign-ments. At blackboard.ru.nl

– OpenCourseWare: Everything else (slides, assignments, literature, schedule, . . .) Atocw.cs.ru.nl/NWI-IMC012/

LiteratureKorb & Nicholson: Bayesian Artificial Intelligence

1-7 1-7Bayesian Networks

– Bayesian networks are models of reality.– Bayesian networks are models of knowledge.– Bayesian networks are models of probability distributions.

Visually, Bayesian networks are graphs consisting of nodes and edges.

1-8 1-8What Constitutes a Liberal Democracy?

This famous study byBollen constructs a so-called latent variablemodel in which the health of politicalliberties in a society, which is a conceptual variable that cannot be observed (represented by an ellipse)observed, is estimated through several measurable indicator variables (represented by squares). Suss-man, Gastil and Banks refer to other established methods for measuring degrees of liberal democracy.

Bollen, American Journal of Political Science 37(4):1207–30, 1993

http://blackboard.ru.nl

http://ocw.cs.ru.nl/NWI-IMC012/

4 1 Introduction1.1 Hello, World!

1-9 1-9How Does Schizophrenic Disease Unfold?

SAN: Social Anxiety; AIS: Active Isolation; AFF: Affective Flattening; SUS: Suspiciousness; EGC: Egocen-trism; FTW: Living in a Fantasy World; ALN: Alienation; APA: Apathy; HOS: Hostility; CDR: CognitiveDerailment; PER: Perceptual Aberrations; DET: Delusional Thinking.

This is a graphicalmodel of causal influences between various symptomatic indicators of schizophrenicdisease. Each node represents a psychological concept, which is measured using a questionnaire. Theedges of the model are labelled with numbers indicating the relative strengths of the causal effects.These numbers are estimated from the given model structure as well as a dataset of patient ques-tionnaire responses. An important distinction from the previous model is that no distinction betweenmeasured and unmeasured variables is made.

van Kampen, European Psychiatry 29(7):437–48, 2014


1-10 1-10Is There a Benefit of Replaced Teeth?

This so-called “causal diagram” was drawn to help interpret the results of a study on the effect of toothloss on later mortality, in which the implicit question was whether replacing someone’s teeth wouldbe beneficial for their later life expectancy. As with all observational studies, there are a number ofconfounding factors (red) which all might partly explain any observed correlation between tooth lossand mortality. In the presence of such confounding factors, it is important to adjust for confounding,and causal diagrams can be used to inform the researcher which variables they should adjust for. Again,this graph makes no distinction between observed or unobserved variables, and its edges to not carryany numbers, as only the presence or absence of an effect but not its strength is modeled.

Polzer et al, Clinical Oral Investigations 16(2):333–351, 2012

1-11 1-11Which Genes Are Relevant in Alzheimer’s Disease?

This directed probabilistic graphical model has not been hand-constructed like the previous three, butrather has been created by a computer algorithm from a data set containing several hundred Alzheimer

6 1 Introduction1.2 Optimal Bayes

patients. The researchers wanted to understand which gene networks might be involved in the devel-opment of the disease. The real network contains several hundreds of nodes and only some parts aredepicted here. The researchers also summarised several network nodes in what they called “modules”(squares). Line thickness depicts influence strength.

Zhang et al, Cell 153(3):702–720, 2013

1-12 1-12Applications Probabilistic Graphical Models

Once built, PGMs can be used for many purposes. Here are some of them:

– Classification, e.g. for diagnosis– Prediction, e.g. the result of an operation– Planning, e.g. an optimal schedule– Reasoning, e.g. answering “what-if” queries

Today we’ll focus on classification, and introduce three approaches.

1.2 Optimal Bayes

1-13 1-13Example: Diagnosing Lung Cancer

A patient walks into a doctor’s office with breathing problems. The doctor wishes to di-agnose whether the patient might have lung cancer. The doctor knows that the followingvariables are relevant for her diagnosis:

Variable DomainRecent visit to asia {yes,no}X-ray shows shadow on lung {yes,no}Pollution exposure {weak,medium,strong}Patient smokes {yes,no}Patient has Bronchitis {yes,no}Patient has tuberculosis {yes,no}

1-14 1-14Some Notation: Probabilities

We consider (discrete) random variables (e.g. X , Y ) that can take on certain values (e.g.X ∈ {x1,x2};Y ∈ {y1,y2,y3}).

Symbol MeaningP(X = x1) Probability of the event X = x1P(x) Probability density function of XP(x,y) Joint probability density of X and YP(x | y) Conditional probability density of X given Y

Important identities: ∑a

P(a) =∑

a

P(a | b) = 1

P(a) =∑

b

P(a,b)

P(a,b) = P(a | b)P(b)

P(a,b,c) = P(a | b,c)P(b | c)P(c)

1-15 1-15. ExerciseProve Bayes’ theorem:

P(h | e) = P(e | h)P(h)P(e)

1 Introduction1.3 Optimal Bayes 7

1-16 1-16Formalizing the Classification TaskGiven the evidence e, choose “best” hypothesis h from the hypothesis spaceH.If we read “best” as “most likely”, then this means

argmaxh∈HP(h | e)

(Bayes’ rule) = argmaxh∈HP(e | h)P(h)

P(e)= argmaxh∈HP(e | h)P(h)

(by def’n) = argmaxh∈HP(e,h) .

All we need to perform this classification is the joint probability distribution P(e,h).

1-17 1-17Joint Probability DistributionsOne way to represent a (discrete) joint probability distribution is by using a table.asia poll’n smoke TB bronch xRay dyspn cancer Pyes low no no yes no no no .01no low no no yes no no no .012yes med no no yes no no no .011no med no no yes no no no .009yes high no no yes no no no .02no high no no yes no no no .015

......

1-18 1-18. ExerciseHowmany parameters (=table rows) are needed to describe the joint probability distributionof the trip to asia example?

Asia

TB

XRay

Pollution

Smoker

Cancer Bronchitis

Dyspnoea

Reminder: “pollution” is ternary, others are binary.

1-19 1-19Optimal BayesApplication of Bayes’ theorem leads to the maximum-a-posteriori (MAP)-classifier

argmaxh∈HP(e,h) .

In some sense (the 0-1-loss function), this is an optimal classifier – it can’t be outperformedby any other classifier.

But: the number of parameters of this classifier is exponential in the number of variables –we allow that everything is linked to everything else.

Asia

TB

XRay

Pollution

Smoker

Cancer Bronchitis

Dyspnoea

8 1 Introduction1.4 Naive Bayes

1.3 Naive Bayes

1-20 1-20A Simplistic Classifier

By specifying a probability for every combination of evidence data, we take an extremeapproach: everything depends on everything.

The other extreme approach is to assume that everything is (conditionally) independent.That is, for n evidence variables E1, . . . ,En, we assume that

P(e1, . . . ,en | h) =∏

i

P(ei | h) .

This allows us to rewrite the MAP classifier as

argmaxh∈H = P(h)∏

i

P(ei | h) ,

which is called the naive Bayes classifier.

1-21 1-21. ExerciseHow many parameters do we need for the naive Bayes classifier

argmaxh∈HP(h)∏

i

P(ei | h)

with our example?

Asia

TB

XRay

Pollution

Smoker

Cancer Bronchitis

Dyspnoea

1-22 1-22Naive Bayes

In practice, we take the logarithm to arrive at

argmaxh∈H log(P(h))+∑

i

log(P(ei | h)) ,

which avoids numerical underflow. The probabilitiesP(ei | h) can be estimated from trainingdata by simple counting.

Advantages of the naive Bayes classifier

– Easy to train– Easy to implement– Easy to evaluate– Often quite hard to beat!

However, the assumption of complete (conditional) independence of all evidence is oftentoo extreme.

1 Introduction1.4 Bayesian Networks 9

1.4 Bayesian Networks

1-23 1-23Graphs and DAGs

Bayesian networks interpolate between optimal and naive Bayes. They specify the depen-dencies between the involved variables in terms of a directed acyclic graph (DAG).

– A graph consists of nodes (vertices) and edges (arrows).Z

A D

E– We describe node relations using kinship terminology.

– Z is a parent of E,D.– D is a child of Z,E.– D is a descendant of A.– A is an ancestor of D.

By convention, each node is a “trivial” ancestor and descendant of itself.A DAG is a graph in which there is no directed path from a node to itself (cycle; e.g.,X → Y → Z→ X).

1-24 1-24Bayesian Networks

I DefinitionGiven a set of variables V = X1, . . . ,Xn, their probability density P(x1, . . . ,xn), and a DAGG = (V,E) (whose nodes are the variables), let paG(xi) denote the set of all parents of Xi inG. Then P is said to factorize according to G if

P(x1, . . . ,xn) =∏

i

P(xi | paG(xi)) .

he1e2

e3 e4

P(e1,e2,e3,e4,h) = P(h)P(e1 | h)P(e2 | h)P(e3 | h)P(e4 | h)

b

c

a

d

P(a,b,c,d) = P(a)P(b | a)P(c | a)P(d | a,b)

1-25 1-25. ExerciseHow many parameters do we need to represent this Bayesian network?

Asia

TB

XRay

Pollution

Smoker

Cancer Bronchitis

Dyspnoea

1 IntroductionChapter Summary 11

1-28 1-28A General Inference Strategy

This suggests the following general strategy to answer queries of type P(a) and P(a | b):

– Convert conditional to marginal probabilities using the identity

P(a | b) = P(a,b)P(b)

.

– For each marginal probability, determine the ancestral network and write down its fac-torization.

– Compute the required probabilities from the factorization.

Although this works, it’s not exactly an efficient strategy. We’ll return to this point later.

1-29 1-29Bayesian Networks and Causality

The formal definition of Bayesian networks obscures the fact that they are intuitively thoughtof as causal diagrams, in which arrows point from causes to their effect.

This intuition can also itself be formalized – we will return to this topic at the end of thislecture.

For now, it suffices to say that thinking of what causes what is one way to construct aBayesian network for a given problem domain.

Chapter Summary

1-30 1-301. Bayesian networks are graphical models of joint probability distributions.2. Joint probability distributions can be used for classification and many other inference

tasks.3. There is no better classifier than optimal Bayes.4. Naive Bayes is worse, but often pretty good.5. Bayesian networks are the complex middle ground between naive and optimal Bayes.

For Further Reading[1] Kevin B. Korb and Ann E. Nicholson Bayesian artificial intelligence. Chapters 1 and

2 Chapman & Hall, 2004

Problems for This Chapter

Problem 1.1 Joint DistributionsLetP be a joint probability distribution for the variablesA∈{0,1}, B∈{0,1}, defined by the followingtable:

A B P(a,b)0 0 0.31 0 0.20 1 0.41 1 0.1

1. Compute P(A = 0) and P(B = 0).2. Compute P(A = 0 | B = 0).3. Compute P(a | B = 0).4. Using the last results, use Bayes’ theorem to compute P(B = 0 | A = 0).

12 1 IntroductionProblems for This Chapter

Problem 1.2 Conditional ProbabilitiesThe famous Monty Hall problem is a classic example of how conditional probabilities often defy ourintuition. In the Monty Hall game, there are three doors. Behind one of the doors is a car, and behindthe other two doors are goats. The player chooses one of the doors; if this is the door with the carbehind it, the player wins. The host now opens one of the other doors, showing a goat. The player isnow given the choice to either keep the current door, or switch to the other one.Prove that it is to the player’s advantage to switch the door. Hint: Formulate this problem using threevariables X ,Y,Z, indicating the door chosen by the player (X), the door hiding the car (Y ), and thedoor opened by the host (Z).

Problem 1.3 Evidence-Based DiagnosticsThe new company “42andYou” introduces a new, cheap test for a rare disease, which affects only0.01% of the population. The test is based on a certain genetic mutation. People having the diseasehave a 99% chance of having the mutation. However, even in those people who do not have the disease,the same mutation can still occur with a 1% probability. People having the mutation have a positivetest result with a 99% probability. If the mutation is absent, the test is never positive.The tests competes in themarket with another, established test, which is unrelated to the genetics-basedtest and has a false-positive rate of 1% and a false-negative rate of 1%.

– Draw a Bayesian network with four variables describing the disease, the two tests, and the natu-rally occuring mutation. Annotate the nodes of the network with the respective probability tables.

– Compute from your Bayesian network themarginal probability for being tested positive by 42andYou.– Compute from your Bayesian network the probability of actually having the disease, given that

you were tested positive by 42andYou.– Argue why the number computed in the previous exercise could be an underestimation of the real

probability to bear the disease if you have taken a test and it was positive.– Suppose you were tested positive by 42andYou and take the established test in addition. The es-

tablished test is negative. How likely are you now to have the disease, according to your Bayesiannetwork?

Problem 1.4 Constructing a Bayesian NetworkThis exercise is taken from Korb & Nicholson, “Bayesian Artificial Intelligence”.Fred is debugging a LISP program. He just typed an expression to the LISP interpreter and now it willnot respond to any further typing. He can’t see the visual prompt that usually indicates the interpreteris waiting for further input. As far as Fred knows, there are only two situations that could cause theLISP interpreter to stop running: (1) there are problems with the computer hardware; (2) there is abug in Fred’s code. Fred is also running an editor in which he is writing and editing his LISP code; ifthe hardware is functioning properly, then the text editor should still be running. And if the editor isrunning, the editor’s cursor should be flashing. Additional information is that the hardware is prettyreliable, and is OK about 99% of the time, whereas Fred’s LISP code is often buggy, say 40% of thetime.

– Build a Bayesian Network that describes Fred’s situation! (5 or 6 variables should be sufficient).For the probability tables that do not follow from the description, assign values that appear rea-sonable to you.

– Compute the marginal probability that the LISP interpreter is running.– Compute the conditional probability that Fred has a bug in his code given that the interpreter is

not running.– Compute the conditional probability that Fred has a bug in his code given that the interpreter is

not running and the editor’s cursor is not flashing.

2 Conditional Independence 13

2-1 2-1

Lecture 2Conditional Independence

Today’stopic

Today’stopic

You are now familiar with one way to view Bayesian networks (Bayes nets): as a map ofhow a probability distribution factorizes. Now we will turn to a different (but equivalent)view: as a map of the conditional independencies that hold in a probability distribution.

The independence map view leads to a method by which we can derive testable implicationsfrom Bayesian network to evaluete their consistency with a given dataset that the networkis intended to represent. Ultimately, by automating this testing procedure and automati-cally “repairing” failed tests, we can construct algorithms that can learn a possible Bayesiannetwork structure from a dataset – a topic that we will return to later on in the lecture.


1. Understand the concept of conditionalindependence.

2. Understand and be able to apply the d-separationcriterion.

3. Be able to test Bayesian network structure againstdata.

Chapter Contents2.1 Conditional Independence 14

2.2 d-Separation 15

2.3 Testing Conditional Independence 17


2-4 2-4Motivation

– Bayes nets are models of variable relationships in a certain domain.– Sparse Bayes net models encode certain assumptions about these relationships.– Incorrect assumptions may lead to incorrect inferences.– Once a Bayes net is constructed, we can test some of the assumptions it encodes againstdata.

No free lunch!Model testing never guarantees a correct model! It can only refute, but never prove it.

14 2 Conditional Independence2.2 Conditional Independence

2.1 Conditional Independence

2-5 2-5Conditional IndependenceTwo variables X ,Y are called independent if

P(x,y) = P(x)P(y) .

Two variables X ,Y are called conditionally independent given a set of variables Z if

P(x,y | z) = P(x | z)P(y | z) .Equivalently, X and Y are called independent given Z if

P(x | y,z) = P(x | z) .Interpretation: Once we know Z, Y provides us no additional information about X .Notation– X y Y means: X and Y are independent.– X y Y | Z means: X and Y are independent given Z.

2-6 2-6ConsistencyWe call a probability density P consistent with G if P factorizes according to G.

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

X

Y

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●● ●

●

●

●●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

X

Y

X YZ X YZ

It appears that consistency is intimately linked to statistical dependencies.

2-7 2-7Example: The Mediation ModelTake the following Bayes net, known as the “full mediation model”:

X M Y

The factorization is:

P(x,m,y) = P(x)P(m | x)P(y | m) = P(x | m)P(y | m)P(m)

Therefore,

P(x,y | m) =P(x,y,m)

P(m)= P(x | m)P(y | m)

and this means that X y Y |M.

2-8 2-8. ExerciseThe Bayes net X M Y “claims” that X and Y are conditionally independent given{M}. Fill in the missing probabilities into the table below such that this claim is violated.

M X Y P0 0 00 0 10 1 00 1 1 1/81 0 0 1/81 0 1 1/81 1 0 1/81 1 1 1/8

2 Conditional Independence2.2 d-Separation 15

2.2 d-Separation

2-9 2-9d-Separation

In the mediation example, we derived a conditional independence by re-arranging the fac-torization. This always works, but it is tedious.d-separation is a graphical criterion that allows us to derive all conditional independenciesfrom the Bayes net graph.

Logic of the d-separation criterionLet G be a DAG and let P be a probability density that factorizes according to G.

– If X and Y are “d-separated” by Z, then X y Y | Z is guaranteed to hold.– If X and Y are not “d-separated” by Z, then X y Y | Z may or may not hold.

In other words, d-separation is a sufficient, but not a necessary criterion.

2-10 2-10Paths in Bayesian Networks

I DefinitionA path in a Bayes net is a sequence of variables in which each adjacent pair is connected byan edge.

Note that this differs from the classic graph-theoretical concept of a path: we allow to moveagainst arrow directions.X →M← Y is a path in the Bayes-net sense, but not in the classic sense.By convention, we consider only those paths that contain each variable at most once.

2-11 2-11The 3-Variable Case

Let us consider the simplest four Bayes nets with two unconnected variables.Independence implication

Network Name Unconditional ConditionalX →M→ Y chain none X⊥Y | {M}X ←M← Y inverse chain none X⊥Y | {M}X ←M→ Y fork none X⊥Y | {M}X →M← Y collider X⊥Y none

For a path π = (X ,M,Y ), we define:

1. If π is a collider X →M← Y :

– {} d-separates X and Y– {M} d-connects X and Y

2. If π is not a collider:

– {} d-connects X and Y– {M} d-separates X and Y

2-12 2-12Interpretation of the Collider Case

ExampleThere is no striking correlation between IQ and wealth in the general population. But sur-veying students on a campus of a private elite university, one might well find a strikinginverse correlation – smarter students tend to be poorer. Why could this happen?

Pay tuition(wealth)

Get scholarship(IQ)

Go to university

16 2 Conditional Independence2.2 d-Separation

2-13 2-13d-Separation for Paths

Consider a Bayes net that consists of a single path π = (X1,X2, . . . ,Xn), and a set Z ⊆{X2, . . . ,Xn−1}. We say that Z d-separates X1 and Xn if:

– π contains a collider Xi−1→ Xi← Xi+1, where Xi < Z ;– π contains a non-collider

– Xi−1→ Xi→ Xi+1,– Xi−1← Xi→ Xi+1, or– Xi−1← Xi← Xi+1,

where Xi ∈ Z.

2-14 2-14. ExerciseTake the Bayes net below.

A

B

C

D

E

F

– Give two sets Z that d-separate A and F .– Give a set Z containing D that d-separates A and F .

2-15 2-15d-Separation for Paths in Graphs

An additional rule is required for colliders that appear in larger graphs. In this Bayes net,we know that conditioning on M can render X and Y dependent.

X M Y

D

But: conditioning on D can also render X and Y dependent.To see this, consider that D could contain similar information about X and Y as M itselfdoes. Then conditioning on D or M would have quite similar effects.

2-16 2-16The Full d-Separation Criterion

Consider a Bayes net G with variables V = {V1, . . . ,Vn}. We say that Z ⊆ V \ {Vi,Vj} d-separates Vi and Vj in G if for every path π = (Vi,Vk1 , . . . ,Vkn ,Vj),n≥ 0,

– π contains a collider Xi−1→ Xi← Xi+1, such that Xi is not an ancestor of any node inZ; or

– π contains a non-collider– Xi−1→ Xi→ Xi+1,– Xi−1← Xi→ Xi+1, or– Xi−1← Xi← Xi+1,

where Xi ∈ Z.

I Theorem: Verma & Pearl, 1984If Z d-separates X and Y in a DAG G, then X y Y | Z in every probability density P thatfactorizes according to G.

2-17 2-17d-Separation For Sets

We can extend the definition of d-separation from single variables X and Y to sets X and Y.Consider a Bayes net G with variables V = {V1, . . . ,Vn}. Given three pairwise disjoint setsX,Y,Z ⊆ V, we say that Z d-separates X and Y if for all X ∈ X,Y ∈ Y, Z d-separates Xand Y .(In short, this works exactly like you would expect.)

2 Conditional Independence2.3 Testing Conditional Independence 17

2-18 2-18The Ancestral Graph

The d in d-separation stands for “directed”. But surprisingly, the somewhat intricate d-separation criterion can be quite elegantly reduced to standard separation in undirectedgraphs (à la max-flow-min-cut). We need two definitions for this.

I DefinitionGiven a DAGG= (V,E) and a variable subsetZ⊆V, the ancestral graph Ga(Z) is obtainedas follows: Delete all variables except those that are ancestors of any variable in Z.

G =

AT

HL

RB

DXM

Ga({H,B}) =A

TH

LR

B

2-19 2-19The Moral Graph

I DefinitionGiven a DAG D, the moral graph Gm is the undirected graph created as follows:

1. Connect all variables that have common children with an undirected edge.

This ensures no extramarital children, hence moral graph.

2. Replace all directed by undirected edges between the same nodes.

G =A

TH

LR

B

Gm =

AT

HL

RB

I Theorem: Lauritzen & SpiegelhalterA set Z d-separates X andY in the DAG G if and only if Z separates X andY in the ancestormoral graph (Ga({X ,Y}∪Z))m.

2.3 Testing Conditional Independence

2-20 2-20Discovering Model Misspecifications

Failed d-separation implications inform us about errors in the model structure. For instance,in the example below, our Bayes net fails to take into account an important variableU1. Howcould we detect such a mistake?

X

M1

M2

Y X

M1

M2

Y

U1

assumed model true modelM1 yM2 | X M1 yM2 | X

X y Y | {M1,M2}

18 2 Conditional Independence2.3 Testing Conditional Independence

2-21 2-21Generating Data from a True Model

Let us generate some binary data that follows the “true” Bayes net structure. We use a smallsimulation in R for this.

The exact rationale behind this simulation will be explained in a later lecture, when we will discusshow Bayesian networks can also be viewed as models of data-generating processes.

set.seed(123)# A utility function that converts log-odds to probabilities.odds2p <- function(o) exp(o)/(exp(o)+1)n <- 10000

# X does not depend on anything.X <- rbinom(n,1,.5)# X=1 increases the odds of U1=1 and M2=1.U1 <- rbinom(n,1,odds2p(4*X-2))M2 <- rbinom(n,1,odds2p(4*X-2))# U1=1 increases the odds of M1=1.M1 <- rbinom(n,1,odds2p(2*U1-1))# U1=1 and M2=1 both increase the odds of Y=1.Y <- rbinom(n,1,odds2p(2*U1+2*M2))

X

M1

M2

Y

U1

2-22 2-22Testing the Assumed Model (I)

Let us start with the first implied independence: M1 yM2 | X .A general strategy for testing this is performing standard independence tests between M1and M2 in every stratum (for every value) of X , and combining the results.

chisq.test( M1[X==0], M2[X==0] )

#### Pearson’s Chi-squared test with Yates’ continuity correction#### data: M1[X == 0] and M2[X == 0]## X-squared = 0.66084, df = 1, p-value = 0.4163

chisq.test( M1[X==1], M2[X==1] )

#### Pearson’s Chi-squared test with Yates’ continuity correction#### data: M1[X == 1] and M2[X == 1]## X-squared = 1.0784, df = 1, p-value = 0.2991

There is no evidence for dependence here.

2 Conditional IndependenceChapter Summary 19

2-23 2-23Testing the Assumed Model (II)

Now let us use the same approach to test the second implied independence: X yY | {M1,M2}.

chisq.test( X[M1==1 & M2==1], Y[M1==1 & M2==1] )

#### Pearson’s Chi-squared test with Yates’ continuity correction#### data: X[M1 == 1 & M2 == 1] and Y[M1 == 1 & M2 == 1]## X-squared = 53.846, df = 1, p-value = 2.169e-13

If a dependence can be found for any stratum of the conditioning variables, the conditionalindependence is refuted.

NoteObviously, for large strata, we will have to perform a large number of tests. This gives rise toboth computational (runtime) and statistical issues (multiple testing problem). Those issueswill not be covered in detail today.

2-24 2-24Summary of Test Results

– We assumed the following Bayes net:

X

M1

M2

Y

– Using d-Separation, we derived two conditional independencies from the net:1. M1 yM2 | X2. X y Y | {M1,M2}

– We could not refute the first independence.– We did refute the second independence.

Chapter Summary

2-25 2-251. Bayesian networks put conditional independence constraints on compatible probabilitydistributions.

2. The d-separation criterion allows to read off these constraints from the graphical modelstructure.

3. The constraints can be tested statistically.


Problem 2.1 d-Separation in PathsConsider again the following Bayesian network from the lecture:

A

B

C

D

E

F

1. Which pairs of variables are d-separated by the empty set?2. Which is the largest set Z of variables for which {D,E} d-separates A from Z?3. Which pairs of variables are d-separated by the set {B,E}?4. For each pair of non-adjacent variables, which is the smallest set that d-separates them?

20 2 Conditional IndependenceProblems for This Chapter

Problem 2.2 d-Separation in DAGsConsider the following Bayesian network:A

B

C

X Y Z

1. For each variable V in the network, find the smallest set Z that d-separates V from all othervariables in the network.

2. Suppose you wish to test the network against data, but only the variables A, B, X , and Z canactually be measured. How do you proceed?

Problem 2.3 Model Modifications After Failed TestGo back to slide 2-24. In general, we do of course not know the “true” model. Propose at least fourmodifications to the assumed Bayesian network that would “fix” the error we discovered, in the sensethat the modified network would no longer imply the conditional independence that we falsified usingthe simulated data. Would any of your modified networks imply new independencies that were notimplied by the original network?

Problem 2.4 Testing d-SeparationDescribe an algorithmwith polynomial time complexity that tests whether a set of nodes Z d-separatestwo nodes X and Y in a given DAG G. Do not use moralization, but work directly on the DAG.

Problem 2.5 Difficulty of Testing Conditional IndependenceIn a certain sense, conditional independence is much harder to test than unconditional independence.Why could this be? Could you think of a way to create a “worst-case” probability distribution in whicha conditional dependence is “hidden” to escape a test?

3 Managing Network Complexity 21

3-1 3-1

Lecture 3Managing Network Complexity

Today’stopic

Today’stopic

We now complete our tour of possible interpretations of Bayesian networks by noting thatthey can also be seen as maps of data generating processes, in which each variable’s valuesare generated from the values of its parents and a random variable called residual.While the functions that generate the variables can in principle be arbitrary, one often choosesto restrict those functions to certain families. We will consider the case of logistic regres-sion, and in the future this path will lead us to Gaussian networks – also known as structuralequation models.Further, we will discuss some “tricks” that are commonly used in practice when encoun-tering a notorious problem: the amount of data is barely ever enough to estimate an entireconditional probability distribution, even if that distributions contains only a few variables.


1. Understand how Bayesian networks representdata-generating processes.

2. Keep Bayesian networks manageable.3. Be able to interpret and fit parametric Bayesian

networks.

Chapter Contents3.1 Recap: Interpretation of Bayesian

Networks 21

3.2 Manageable Networks: Tricks of the Trade 23

3.3 Parametric Representations 24


3.1 Recap: Interpretation of Bayesian Networks

3-4 3-4Bayesian Networks as Factorizations

A Bayes net is a graphical representation of a joint probability density as a factorization intoconditional PDFs.

ExampleConsider the probability density P(A,Z,D,E) with the following Bayes net:

Z

A D

E

P(A,Z,D,E)

=P(Z)P(A | Z)P(E | A,Z)P(D | E,Z)

22 3 Managing Network Complexity3.2 Recap: Interpretation of Bayesian Networks

3-5 3-5Bayesian Networks as Independence Maps

Instead of a factorization, we can also characterize the compatible densities by a basis setof conditional independencies.For instance, the following set of statements is a basis set:

X y non-descendants(X) | parents(X)

ExampleZ

A D

EDy A | E,Z

(D is the only node with non-descendants that are not parents)

3-6 3-6Structural Models

The last equivalent view of a Bayes net is as a layout of a data-generating process – a struc-tural model.A structural model consists of a graph G = (V,E) and a set of functions { fX | x ∈ V} suchthat, for each variable X :

X := fX (paX ,εX )

– fX is any deterministic function.– paX is the set of all parents of X in G.– εX is a random variable.– All εX are mutually independent.

ExampleZ

A D

E

Z := fZ(εZ)

E := fE(A,Z,εE)

A := fA(Z,εA)

D := fD(Z,E,εD)

3 Managing Network Complexity3.2 Manageable Networks: Tricks of the Trade 23

3.2 Manageable Networks: Tricks of the Trade

3-7 3-7Bayesan Networks Aren’t Always EfficientWe motivated Bayesian networks as a way to store a probability density P as many smalltables instead of one big table. But we may still encounter scalability issues.– “Rich” discrete values imply large probability tables: P(Salary | Age,Gender)– Continuous values cannot be stored in tables at all.– Tables can still be large for nodes with many parents.

3-8 3-8BinningContinuous or “rich” discrete variables are often binned into ordinal variables.– Age 7→ {child,adolescent,adult,middle age,old age}

age category0-10 child11-20 adolescent21-40 adult41-60 middle age>60 old age

– Temperature 7→ {cold,mild,hot}Binning avoids having to estimate too many parameters, especially if some combinations ofvalues are not observed in the data (zero probabilities).

3-9 3-9Pruning: The Markov BlanketFor some use cases of a Bayes net, such as predicting the value of a specific node from theother nodes, it may turn out that not all nodes are actually necessary.

I DefinitionThe Markov blanket ∂X of a node X ∈ V in a DAG G = (V,E) contains the parents of X ,the children of X , and the children’s other parents.

In the network below, the gray nodes are the Markov blanket of the encircled node X .

A B

DE

X

FG

H

IJ

Nodes that are not in the Markov blanket of X can be ignored for predicting X .

3-10 3-10. ExerciseProve that the Markov blanket ∂X d-separates X from all other nodes in G!

3-11 3-11Markov Blanket: Example

Asia

TB

XRay

Pollution

Smoker

Cancer Bronchitis

Dyspnoea

Here we wish to predict the value of the node Cancer from the other nodes. It turns out that, whenall information is available, the node Asia is not required for this task as it is not part of the Markovblanket (gray).

24 3 Managing Network Complexity3.3 Parametric Representations

3-12 3-12Divorcing Too Many Parents

Networks in which some nodes have too many parents frequently pose problems.For example, 2n parameters are necessary to represent a conditional density P(y | x1, . . . ,xn)where all variables are binary.

Y

X1 X2 X3 X4 X5

5+25 = 37 parameters

By divorcing, one creates a new node that is a deterministic combination of some other nodes(often, a logical disjunction). All of the combined nodes’ former children now depend onthis single node.

Y

AnyX

X1 X2 X3 X4 X5

5+21 = 7 parameters

3-13 3-13Divorcing: Example

Let’s revisit the “asia” network, where all variables are binary, except for “Pollution”, whichis ternary. This network requires 26 unique parameters.Asia (1)

TB (2)

XRay (4)

Pollution (2)Smoker (1)

Cancer (6) Bronchitis (2)

Dyspnoea (8)

Divorcing “TB” and Cancer” reduces this to 20 parameters.Asia (1)

TB (2)

XRay (2)

Pollution (2)Smoker (1)

Cancer (6)

TB/C (0)

Bronchitis (2)

Dyspnoea (4)

Node “TB/C” is the disjunction of “TB” and “Cancer”.

3.3 Parametric Representations

3-14 3-14Issues With Binning and Divorcing

So far, we have annotated the nodes of Bayesian networks with simple probability tables.To reduce the size of such tables, we have introduced binning and divorcing. But theseapproaches bring new problems:

– Binning loses information, and relies on (often arbitrary) cut-offs.– Divorcing loses flexibility and induces new independencies.

A further approach is to replace probability tables by parametric probability distributions(e.g., ninomial, negative binomial, geometric, hypergeometric, Poisson, ...)

Such parametric networks are often called directed probabilistic graphical models ratherthan Bayes nets.

3 Managing Network Complexity3.3 Parametric Representations 25

3-15 3-15Example: The Logit ModelFor n parent variables,

Y

X1 X2 X3 X4 X5

5+25 parameters

For 0 < p < 1, define

logit(p)= logp

1− p

Then

logP(Y1 = 1 | x1, . . . ,xn)

1−P(Y1 = 1 | x1, . . . ,xn)= β0 +β1X1 + . . .+βnXn

Note that the Xi do not need to be binary for this model – they can even be continuous!The logit model is also called the logistic regression model.

3-16 3-16Parameter Estimation for Bayesian NetworksFormally, we are replacing the population probability density P by an estimate Pθ with pa-rameters θ . The parameters θ are usually found by maximizing the likelihood for a sampleS = (x(1)1 , . . . ,x(1)n ), . . . ,(x(k)1 , . . . ,x(k)n ):

L(S | θ) =∏

i

Pθ (X1 = x(i)1 , . . . ,Xn = x(i)n )

Typically, we instead use the log-likelihood

logL(S | θ) =∑

i

logPθ (X1 = x(i)1 , . . . ,Xn = x(i)n )

By inserting the factorization, we can express this as

logL(S | θ) =∑

j

∑i

logPθ (X j = x(i)j | paX j= pa(i)x j

)

This means we can estimate θ separately for each node.

3-17 3-17Example: The Sprinkler NetworkWe observe that the grass is wet. There are two possible reasons: it could have rained, or thesprinkler could have been turned on. Both possible reasons depend on whether it is cloudyon that day.

Cloudy

Sprinkler Rain

WetGrass

C P1 0.5

C R P0 1 0.21 1 0.8

C S P0 1 0.51 1 0.1

S R W P0 0 1 0.010 1 1 0.91 0 1 0.91 1 1 0.99

3-18 3-18Estimating the Sprinkler Network ParametersIn R, we can find the parameters of a logistic regression model by maximum likelihood asfollows:

26 3 Managing Network ComplexityChapter Summary

d <- read.csv("sprinkler.csv")m.cloudy <- glm( cloudy~1, d, family="binomial" )m.sprinkler <- glm( sprinkler~cloudy, d, family="binomial" )m.rain <- glm( rain~cloudy, d, family="binomial" )m.wetgrass <- glm( wetgrass~sprinkler+rain, d, family="binomial" )coef( m.wetgrass )

## (Intercept) sprinkler rain## -3.076410 6.815945 5.459369

For each node, we get one intercept value (=base log odds-ratio) and a coefficient for eachparent (=log odds-ratio). The coefficient represents the strength of the influence.

3-19 3-19The Fitted Sprinkler Network

Cloudy

Sprinkler Rain

WetGrass

-0.06

-3.1

-0.06 -1.5

-2

2.44

6.8 5.5This representation of the parametric network is much more readable than the probability tables inthe regular Bayesian network. Several important features are directly discernible. For instance, themarginal probabilities of cloudiness and rain are around 50%, given the base log odds ratios that areclose to 0. From the signs of the coefficients, we can readily discern that, for instance, clouded skiesmake rain more likely but make it less likely that we will turn on the sprinkler. Both the sprinklerand rain vastly increase the probability of having a wet lawn (their coefficients are the biggest, andremember that the scale is logarithmic).

Chapter Summary

3-20 3-201. We now know three interpretations of Bayesian networks: factorization maps, indepen-dence maps, and structural models.

2. These interpretations are different, but equivalent to each other.3. We learnt some “tricks of the trade” to keep Bayesian network size manageable.4. Parametric distributions are often used in real-world Bayesian networks instead of raw

probability tables.

3 Managing Network Complexity3.4 Problems for This Chapter 27


Problem 3.1 DivorcingThe following Bayesian network is proposed for a given problem domain. All variables are binary.

X1 X2

Y1 Y2 Y3

Unfortunately, it turns out that too little data is available to properly estimate the conditional probabilitydistributions for X1 and X2. Therefore, the divorcing technique is used to arrive at the following net-work, where M is the logical disjunction of its parents, i.e., M = 1 holds if and only ifY1+Y2+Y3 > 0.

X1 X2

M

Y1 Y2 Y3

1. Howmany parameters need to be inferred for the original network and howmany for the divorcednetwork?

2. Construct an explicit probability density that is compatible with the original, but not with thedivorced network.

Problem 3.2 Logistic Graphical ModelSome (artificial) data for our beloved Asia network is found at johannes-textor.name/asia.csv. Down-load this dataset and find logistic regression parameters for the Asia network that best explain thisdataset! Use a software of your own choice.

http://johannes-textor.name/asia.csv

28 4 Exact Inference

4-1 4-1

Lecture 4Exact Inference

Today’stopic

Today’stopic

By now, we have worked our way through various flavours of Bayesian networks, and learntsome ways to keep their complexity manageable. In other words, we are in pretty good shapewhen it comes to building our own Bayesian networks. Naturally, this means that we nowwant to know what we can do with such a network once we are finished building it.

As mentioned in the beginning, one of the key tasks in Bayesian networks is inference: de-ducing probabilities for some variables given some “evidence” (values for some of the othervariables). By virtue of Bayes’ theorem, this boils down to computing marginal probabili-ties (which we can then convert into conditional probabilities). Today’s lecture starts withsome bad news: Even when restricted to the apparently simple case of computing a marginalprobability for a single variable in an all-binary network, inference is an NP-hard problem.In other words, no efficient (polynomial-runtime) algorithm for exactly solving this problemin all cases can be built using the techniques known today, and perhaps this will never bepossible.

Still, this does not mean that inference is infeasible in practice. This lecture will introducea concept called eliminiation ordering that we will use in the next lecture to design an infer-ence algorithm that “works well” (in the sense of being fast and accurate) in many relevantscenarios.


1. Understand why exact inference in Bayesiannetworks is a hard problem.

2. Know what factor graphs are and how to derivethem from Bayesian networks (next week).

3. Be able to apply the message passing algorithm tofactor graphs (next week).

Chapter Contents4.1 Exact Inference is NP-hard 28

4.2 Elimination Orderings 32


4.1 Exact Inference is NP-hard

4-4 4-4Recap: Inference

We said that inference in a Bayes net means computingmarginal or conditional probabilities– that is, expressions of the form P(xi) or P(xi | e).

We can restrict our attention to marginal probabilities because

P(xi | e) =P(xi,e)

P(e).

In this lecture, we first focus on a subclass of inference problems: compute marginal prob-abilities P(xi) for a single variable Xi.


4-5 4-5Recap: NP-hardness

I DefinitionA decision problem is a set S⊆ N.

I DefinitionNP is the set of decision problems whose membership function can be computed in polyno-mial time by a non-deterministic Turing machine.

ReductionWe say that a polynomial time reduction from P to S exists

P�P S

if there is a polynomial-time computable function r such that

x ∈ P⇔ r(x) ∈ S .

(S can simulate P with polynomial-time overhead).

4-6 4-63-SAT

I DefinitionA decision problem S is called NP-hard if, for every other decision problem P∈NP, P�P S.

Boolean satisfiability is the “canonical” NP-hard problem:

SAT= {ϕ | ϕ is a satisfiable Boolean formula}

We have defined a decision problem as a set of numbers, but nowwe are talking about a set of formulas.This works because we can easily construct a mapping from formulas to natural numbers: write theformula as a string, and consider the resulting string as a single (huge) binary number. It suffices

to consider only formulas of a certain normal form:

3-SAT= {ϕ | ϕ is a satisfiable Boolean formula in 3-CNF}

3-CNF: conjunction of disjunctions with ≤ 3 literals per clause.

(X1∨X3)︸︷︷︸clause

∧(¬X1∨¬X2∨literal︷︸︸︷¬X3)∧ (¬X1∨X2∨X4)

4-7 4-7Inference as a Decision Problem

Our goal is to show that exact inference in Bayes net is NP-hard. Our first step is to formulateit as a decision problem.

IBN= {(G,x) | G = (V,E,P) is a Bayes net,x ∈ V,P(x)> 0}

We will build a function r that takes as input a 3-CNF formula ϕ and outputs a Bayesiannetwork Gϕ and a variable (name) xφ such that

ϕ ∈ 3-SAT⇔ r(ϕ) = (Gϕ ,xϕ) ∈ IBN

30 4 Exact Inference4.1 Exact Inference is NP-hard

4-8 4-8Translating the Formula into a Network

Our network will consist of three layers:

1. The variable layer (one node per variable).2. The clause layer (one node per clause).3. The formula layer (one node).

X1 X2 X3 X4 variables

C1 C2 C3 clauses

ϕ formula

ϕ = (X1∨X3)︸︷︷︸C1

∧(¬X1∨¬X2∨¬X3)︸︷︷︸C2

∧(¬X1∨X2∨X4)︸︷︷︸C3

4-9 4-9Connecting Nodes in the Formula Network

– Draw an edge from each variable to all clauses in which that variable appears.– Draw an edge from each clause to the formula node.

X1 X2 X3 X4 variables

C1 C2 C3 clauses

ϕ formula

ϕ = (X1∨X3)︸︷︷︸C1

∧(¬X1∨¬X2∨¬X3)︸︷︷︸C2

∧(¬X1∨X2∨X4)︸︷︷︸C3

4-10 4-10Probability Tables for the Formula Network

– Annotate each variable with probability 0.5.– Annotate each clause node with a probability table that encodes this clause.– Annotate the formula node with a probability table that encodes a logical “and”.

X1 X2 X3

X1 P1 0.5

X2 P1 0.5

X3 P1 0.5

C1 C2X1 ∨X3X1 X3 C1 P0 0 1 00 1 1 11 0 1 11 1 1 1

¬X1 ∨¬X2 ∨¬X3X1 X2 X3 C1 P0 0 0 1 10 0 1 1 10 1 0 1 10 1 1 1 11 0 0 1 11 0 1 1 11 1 0 1 11 1 1 1 0

ϕ

C1 C2 ϕ P0 0 1 00 1 1 01 0 1 01 1 1 1


4-11 4-11. ExerciseDraw the network r(ϕ) for the following 2-SAT formula:

ϕ = (X1∨X2)∧ (X1∨¬X2)

Compute the associated marginal probability P(ϕ).

4-12 4-12Probability Tables for the Formula NetworkNetwork size for formula ϕ with n variables and k clauses:

– n variable nodes⇒ n parameters.– k clause nodes, ≤ 3 parents ⇒≤ 8k parameters.– 1 formula nodes, k parents⇒ 2k parameters.

The network still has exponential size!

This is a problem because our reduction needs to be computable in polynomial time. But we cannotoutput an exponential-size network in polynomial time. Solution: compute the “and” function

in k−1 steps instead of a single step. Each step is a function with 2 parameters.

X1 X2 X3 X4

C1 C2 C3

ϕ

n+8k+2k

X1 X2 X3 X4

C1 C2 C3

ϕ1∧2 ϕ(1∧2)∧3

n+8k+4(k−1)

4-13 4-13Completing the ProofGiven a 3CNF-formula ϕ with n variables and k clauses, we can run the following algorithmto determine if φ is satisfiable:

1. Construct the Bayesian network r(ϕ), which has n+ k+(k− 1) nodes, 3k+ 2(k− 1)edges, and n+8k+4(k−1) probability table entries.

2. Determine whether r(ϕ) ∈ IBN.3. ϕ is satisfiable⇔ r(ϕ) ∈ IBN.

Therefore, exact inference in Bayesian networks can simulate Boolean satisfiability withpolynomial time overhead.

3-SAT�P IBN

Since SAT is NP-hard, this means that IBN is also NP-hard.

4-14 4-14Conclusions

Take-home message from complexity analysisEven for “sparse” Bayes nets (≤ 3 parents), computing marginal probabilities remains a hardproblem – no polynomial time algorithm is known, and perhaps none exists.

This leaves us with 3 options to approach inference:

– Use brute-force (expontential time) algorithms.– Try to find efficient algorithms for retricted subclasses of Bayes nets.– Use approximation algorithms.

32 4 Exact Inference4.2 Elimination Orderings

4.2 Elimination Orderings

4-15 4-15Naive MarginalizationConsider a Markov chain with n binary variables (below, n = 4):

X1 X2 X3 X4

To compute P(xn), we need to sum over all x1, . . . ,xn−1:

P(x4) =∑

x1,x2,x3

P(x1)P(x2 | x1)P(x3 | x2)P(x4 | x3)

For example,P(X4 = 1) =P(X1 = 0)P(X2 = 0 | X1 = 0)P(X3 = 0 | X2 = 0)P(X4 = 1 | X3 = 0)

+P(X1 = 1)P(X2 = 0 | X1 = 1)P(X3 = 0 | X2 = 0)P(X4 = 1 | X3 = 0)

+P(X1 = 0)P(X2 = 1 | X1 = 0)P(X3 = 0 | X2 = 1)P(X4 = 1 | X3 = 0)

+P(X1 = 1)P(X2 = 1 | X1 = 1)P(X3 = 0 | X2 = 1)P(X4 = 1 | X3 = 0)

+P(X1 = 0)P(X2 = 0 | X1 = 0)P(X3 = 1 | X2 = 0)P(X4 = 1 | X3 = 1)

+P(X1 = 1)P(X2 = 0 | X1 = 1)P(X3 = 1 | X2 = 0)P(X4 = 1 | X3 = 1)

+P(X1 = 0)P(X2 = 1 | X1 = 0)P(X3 = 1 | X2 = 1)P(X4 = 1 | X3 = 1)

+P(X1 = 1)P(X2 = 1 | X1 = 1)P(X3 = 1 | X2 = 1)P(X4 = 1 | X3 = 1)

A naive evaluation like this requires adding 2n−1 terms!

4-16 4-16The Distributive LawBy using the distributive law, we can save a lot of work.

P(x4) =∑

x3

P(x4 | x3)

(∑x2

P(x3 | x2)

(∑x1

P(x2 | x1)P(x1)

))

X1 X2 X3 X4

X1 P1 0.5

X1 X2 P0 1 0.41 1 0.4

X2 X3 P0 1 0.81 1 0.6

X3 X4 P0 1 0.21 1 0.5

Consider the innermost term:∑x1

P(x2 | x1)P(x1) = P(x2 | X1 = 0)P(X1 = 0)+P(x2 | X1 = 1)P(X1 = 1)

Each summand is a function of X2, can be written as vector:∑x1

P(x2 | x1)P(x1) = 0.5(

0.40.6

)X2

+0.5(

0.40.6

)X2

=

(0.40.6

)X2

4-17 4-17The Distributive LawNow let us go one step further:

∑X2

P(x3 | x2)︸︷︷︸function of X2,X3

(∑x1

P(x2 | x1)P(x1)

)︸︷︷︸

function of X2

If we represent P(x3 | x2) as a matrix

P(x3 | x2) =

(P(X3 = 0 | X2 = 0) P(X3 = 0 | X2 = 1)P(X3 = 1 | X2 = 0) P(X3 = 1 | X2 = 1)

)then we can write the term above as a matrix-vector-product:

∑X2

P(x3 | x2)

(∑x1

P(x2 | x1)P(x1)

)=

(0.2 0.40.8 0.6

)X3,X2

×(

0.40.6

)X2

=

(0.320.68

)X3

4 Exact Inference4.2 Elimination Orderings 33

4-18 4-18The Distributive Law

Now let us apply this repeatedly to our Markov chain:

X1 X2 X3 X4

X1 P1 0.5

X1 X2 P0 1 0.41 1 0.4

X2 X3 P0 1 0.81 1 0.6

X3 X4 P0 1 0.21 1 0.5

We get:

P(x4) =∑

x3

P(x4 | x3)

(∑x2

P(x3 | x2)

(∑x1

P(x2 | x1)P(x1)

))

=

(0.8 0.50.2 0.5

)X4,X3

×

((0.2 0.40.8 0.6

)X3,X2

×(

0.40.6

)X2

)

This gives us n−1 matrix-vector multiplications instead of 2n additions.

4-19 4-19Elimination OrderingsWhat did we just do and why did this work? Let us look again at our expression:

P(x4) =∑

x3︸︷︷︸eliminate X3keep X4

P(x4 | x3)(∑


P(x3 | x2)(∑


P(x2 | x1)P(x1)))

We ordered our factorization such that in each step, we eliminate one variable and keep afunction of only one variable. This requires that the factors be ordered such that, for instanceX1 does not occur “before” the expression

∑X1. Such an ordering of a factorization is called

an elimination ordering.

How can we find elimination orderings for other networks than Markov chains?

4-20 4-20Elimination Ordering of a Fork

Let us now look at a slightly different network:

X1

X2

X3 X4

P(x4) =∑

x1,x2,x3

P(x1 | x2)P(x2)P(x3 | x2)P(x4 | x3)

=∑

x3

(P(x4 | x3)

(∑x2

P(x2)P(x3 | x2)

(∑x1

P(x1 | x2)

)))

=∑

x3

(P(x4 | x3)

(∑x2

P(x2)P(x3 | x2)×(

11

)X2

))

Importantly, the term P(x2)P(x3 | x2) can again be represented as a 2×2 matrix with X2 inthe columns and X3 in the rows.

34 4 Exact Inference4.2 Elimination Orderings

4-21 4-21Elimination Orderings with Multiple Parents

X1

X2

X3 X4

P(x4) =∑

x1,x2,x3

P(x1)P(x2)P(x3 | x1,x2)P(x4 | x3)

=∑

x3

P(x4 | x3)∑x1,x2

P(x1)P(x2)P(x3 | x1,x2)

The inner term cannot be simplified further. It can be written as a product M× v1× v2,where M is a 3D Matrix and v1 and v2 are column vectors. The result is again a columnvector.Parent complexityFor nodes with k parents, we need to eliminate k−1 variables at the same time. For binaryvariables, this requires a k-dimensional matrix with 2k entries.

4-22 4-22. ExerciseFind an elimination ordering for P(x5) in the following network:

X1 X2

X3 X4

X5

4-23 4-23Elimination Orderings and d-SeparationLet us revisit the fork elimination ordering:

X1

X2

X3 X4

P(x4) =∑

x3︸︷︷︸X4yX1,X2|X3

(P(x4 | x3)

( ∑x2︸︷︷︸

X3,X4yX1|X2

P(x2)P(x3 | x2)(∑

x1

P(x1 | x2))))

In each step, the network is cut into three parts:– An inner part Xi.– An outer part Xo.– A variable X such that P(xo | x,xi) = P(xo | x)

This is only possible if X d-separates Xi and Xo.

4-24 4-24Elimination Orderings and Loops

X1

X2

X3

X4

P(x4) =∑

x1,x2,x3

P(x1)P(x2 | x1)P(x3 | x1)P(x4 | x3,x2)

No elimination ordering can be found because there is no variable that cuts the network intwo different parts!

4 Exact InferenceChapter Summary 35

4-25 4-25Trees and Polytrees

I DefinitionA tree is an undirected graph in which there is one, and only one, path between any pair ofnodes.

X1 X2X3

X4

a tree

X1

X2

X3

X4

not a tree

I DefinitionThe skeleton of a directed graph is the same graph in which all directed edges are replacedby undirected edges.

X1 X2X3

X4X1 X2

X3

X4

I DefinitionA polytree is a directed graph whose skeleton is a tree.

4-26 4-26Elimination Orderings in Polytrees

For polytree networks, an efficient elimination ordering is guaranteed to exist: At every nodeX , we have that

Ch∗(X)y Pa∗(X) | X ,

where Ch∗(X) is the component of the network containing the children of X , and Pa∗(X) isthe component of the network containing the parents of X .

(There can be no path from an ancestor of X to a descendant of X that does not pass throughX).

Chapter Summary

4-27 4-271. Exact inference in Bayesian networks is NP-hard.2. But for some subclasses of Bayesian networks, efficient exact inference algorithms do

exist.3. Elimination orderings are one approach for efficient exact inference, and they work in

polytrees.


Problem 4.1 Elimination Ordering in PolytreeFind an elimination ordering for P(z) in the following polytree network!

A B

DE

Z CF

36 5 Approximate Inference

5-1 5-1

Lecture 5Approximate Inference

Today’stopic

Today’stopic

In this lecture, we will recover somewhat from last lecture’s “bad news” and learn aboutan algorithm that performs inference in Bayesian networks. For a subset of networks, thepolytrees, this algorithm has a polynomial runtime and delivers an exact solution. For non-polytree network, it can still be used but will then deliver an approximate solution only.Quite when this solution is useful or close to the true solution is not entirely clear yet, but thealgorithm is widely used in practice. The original version of this algorithm was due to Pearl,and called “Belief Propagation”, but we will here explain this algorithm in a different andmore general way using so-called factor graphs. The advantage of the factor graph approachis the same construction can be used to implement many graph-related algorithms, even ifthey are not related to Bayesian networks. An example is the so-called Viterbi algorithm forinference and training in hidden Markov models.


1. Know what factor graphs are and how to derivethem from Bayesian networks.

2. Be able to apply the message passing algorithm tofactor graphs.

Chapter Contents5.1 Factor Graphs 36

5.2 Message Passing on Factor Graphs 40

5.3 Loopy Message Passing 42


5.1 Factor Graphs

5-4 5-4Putting the Fun Into Elimination Orderings

Since the previous lecture, we know about elimination orderings and what they are. But let’sbe honest – they are no fun.

Today, we will learn about a fun, generic algorithm to find elimination orderings. And thissame algorithm can be used to solve a whole host of other problems, too!

5-5 5-5Factor GraphsSuppose we have a function g(x1, . . . ,xn) that factorizes as

g(x1, . . . ,xn) = fA(x1) fB(x2) fC(x1,x2,x3) fD(x3,x4) fE(x3,x5)

The factor graph of g contains:

– 1 node for each variable.– 1 node for each factor.– edges between each factor and the variables it contains.


fA fB fC fD fE

x1 x2 x3 x4 x5

5-6 5-6. ExerciseDraw a factor graph for a probability density that factorizes according to the followingBayesian network:

X1 X2

X3 X4

X5

5-7 5-7Factor Graphs and Elimination Orderings

Consider again our function

g(x1, . . . ,xn) = fA(x1) fB(x2) fC(x1,x2,x3) fD(x3,x4) fE(x3,x5) .

For x1, this has an elimination ordering

g(x1) = fA(x1)

∑x2,x3

fB(x2) fC(x1,x2,x3)

(∑x4

fD(x3,x4)

)(∑x5

fE(x3,x5)

)Like any nested arithmetic expression, this can be drawn as a (rooted) expression tree:

fA

fB

fC

fD fE

∑x5

∑x4

∑x2,x3

××

×

5-8 5-8Factor Graphs and Expression Trees

If we compare the expression tree of the elimination ordering for g(x1) and the factor graphrooted at x1, we will note topological similarity:

fA

fB

fC

fD fE

∑x5

∑x4

∑x2,x3

×

×

×

fA

fB

fC

fD fE

x1

x2 x3

x4 x5

In fact, we can turn every rooted (tree) factor graph into an expression tree!

38 5 Approximate Inference5.1 Factor Graphs

5-9 5-9Elimination Orderings from Factor Graphs

To begin, we first need to orient the tree such that the desired variable becomes the tree’sroot node.

fAfB

fC fD fE

x1x2

x3

x4 x5

Let’s now discuss some replacement rules that we can use to turn this tree into an expressiontree!

5-10 5-10Replacement Rules for Variable Nodes

If x is a leaf variable node with parent f , simply remove it.f

xi ⇒

f

Otherwise, substitute x by the product of all its child nodes (which need to be processed

before!)

f

x

⇒

f

×

5-11 5-11Replacement Rules for Factor Nodes

If f is a leaf factor nodewith parent v, simply keep it as is.

v

f ⇒

v

f

Otherwise, substitute it by a node that first multiplies all children with f , and then marginal-izes all variables except v.

v

f

⇒

v

× f

∑∼{v}

5-12 5-12

Step 1: process leaf nodes x4,x5. These can just be removed.

fAfB

fC fD fE

x1x2

x3

x4 x5

fAfB

fC fD fE

x1x2

x3


5-13 5-13

Step 2: process leaf factor nodes fB, fA. These can just stay the same.

fAfB

fC fD fE

x1x2

x3

fAfB

fC fD fE

x1x2

x3

5-14 5-14

Step 3: process interior (non-leaf) factor nodes fD, fE . These have no children, and can bereplaced by a sum operator over all variables except x3 (the parent node).

fAfB

fC fD fE

x1x2

x3

fAfB

fC

fD fE

∑∼x3

∑∼x3

x1x2

x3

5-15 5-15

Step 4: process interior variable nodes x1,x2. The should be replaced with a product of alltheir children. But both have only one child so the product is simply the child. Therefore,the nodes can simply be bypassed.

fAfB

fC fD fE

x1x2

x3

fAfB

fC

fD fE

∑∼x3

∑∼x3

x3

5-16 5-16

Step 5: process factor node fC. This is replaced by a product of all its children and fC,summarized over all variables except x3 (the parent node).

fAfB

fC

fD fE

∑∼x3

∑∼x3

x3

fAfB

∑∼x3

×fC fD fE

∑∼x3

∑∼x3

x3

40 5 Approximate Inference5.2 Message Passing on Factor Graphs

5-17 5-17

Step 6: Process the root variable node x3. This is replaced by a product of all its children.

fAfB

∑∼x3

×fC fD fE

∑∼x3

∑∼x3

x3

fAfB

∑∼x3

×fC fD fE

∑∼x3

∑∼x3

×

The resulting tree describes an elimination ordering for g(x3)!

g(x3) =

(∑x1,x2

fA(x1) fB(x2) fC(x1,x2,x3)

)(∑x4

fD(x3,x4)

)∑x5

fE(x3,x5)

5.2 Message Passing on Factor Graphs

5-18 5-18Message PassingIn practice, we will not actually compute the expression trees and the elimination orderings.Instead, we will directly perform computations on the factor graph.To do this, we imagine that each node in the factor graph is a processor with three capabili-ties:

– Send messages (functions) along edges.– Receive messages (functions) along edges.– Process messages by multiplication and addition.

5-19 5-19Revisiting the Markov ChainConsider again our 4-variable Markov chain from the last lecture:

X1 X2 X3 X4

X1 P1 0.5

X1 X2 P0 1 0.41 1 0.4

X2 X3 P0 1 0.81 1 0.6

X3 X4 P0 1 0.21 1 0.5

The factor graph is also a chain:

f1 x1 f2 x2 f3 x3 f4 x4

5-20 5-20Message Passing on the Markov ChainTo compute P(x4), we interpret the factor graph as a tree, and orient its edges towards theroot x4.

f1 x1 f2 x2 f3 x3 f4 x4

– Messages will be passed along the edge directions.– Each message is a function of one variable.– Each edge is associated with one variable.

The Sum-Product AlgorithmThe local function at a variable node is the unit function 1. The local function at a factornode is that factor. The message on an edge e = v→ is the product of the local functionat v with all messages received on edges → v, summed over all variables except the oneassociated with e.

5 Approximate Inference5.3 Message Passing on Factor Graphs 41

5-21 5-21Message Passing on the Markov ChainWe begin with the leaf node f1. This node simply passes on its local function f1(x1)=P(x1).

f1 x1 f2 x2 f3 x3 f4 x4

(0.5,0.5)

The variable node x1 multiplies all incoming messages – in this case, there is only onemessage, which is therefore simply passed along.

f1 x1 f2 x2 f3 x3 f4 x4

(0.5,0.5)X1

(0.5,0.5)X1

5-22 5-22Message Passing on the Markov ChainThe next factor node f2, multiplies the incoming message with f2(x1,x2) = P(x2 | x1), sum-marizes the result for x2, and passes it on. The next variable node, x2, again simply passeson the message.

f1 x1 f2 x2 f3 x3 f4 x4

(0.5,0.5)X1

(0.5,0.5)X1

(0.4,0.6)X2

(0.4,0.6)X2

At f3, we multiply with f3(x2,x3) = P(x3 | x2) and pass on a summary for x3. At x3, we passthe message through.

f1 x1 f2 x2 f3 x3 f4 x4

(0.5,0.5)X1

(0.5,0.5)X1

(0.4,0.6)X2

(0.4,0.6)X2

(0.32,0.68)X3

(0.32,0.68)X3

5-23 5-23Message Passing on the Markov ChainOur last step, at f4, is multiplication with f4(x3,x4) = P(x4 | x3) and summarization for x4.We arrive at our final outcome: the marginal distribution P(x4).

f1 x1 f2 x2 f3 x3 f4 x4

(0.5,0.5)X1

(0.5,0.5)X1

(0.4,0.6)X2

(0.4,0.6)X2

(0.32,0.68)X3

(0.32,0.68)X3

(0.596,0.404)X4

Practical ConsiderationFor Bayesian networks, the message transmitted on an edge e will be a marginal distributionof the variable associated with e. This means it will sum up to 1. To avoid numerical issues,we can also pass on an arbitrarily scaled version of that function, such as (596,404) insteadof (0.596,0.404).

5-24 5-24. ExerciseUse the message-passing algorithm to compute the marginal probability P(x5) on the fol-lowing network:

X1 X2

X3 X4

X5

X1 P1 0.5

X1 X2 P0 1 0.41 1 0.4

X3 P1 0.8

X3 X4 P0 1 0.21 1 0.5

X2 X4 X5 P0 0 1 0.20 1 1 0.41 0 1 0.61 1 1 0.8

42 5 Approximate Inference5.3 Loopy Message Passing

5.3 Loopy Message Passing

5-25 5-25Loopy Factor Graphs

For Bayesian networks that are not polytrees, the factor graph will contain cycles.

X1

X2 X3

X4

f1

X1

f2 f3

X2 X3

f4

X4

This means that we cannot derive an elimination ordering from the factor graph, since itdoes not describe an expression tree.

5-26 5-26Loopy Message Passing

Loopy message passing is an extension of the message passing algorithm that also works forgraphs with loops. It is an approximation algorithm, which isn’t even guaranteed to convergein most cases, but appears to “work well in practice”.

With loopy message passing, messages are passed in both directions along every edge.f1

X1

f2 f3

X2 X3

f4

X4

5 Approximate Inference5.3 Loopy Message Passing 43


We initialize each message to the unit function 1.

In each step, we compute the message on each edge according to the normal rules of thesum-product algorithm.

At some point, we stop the computation.

For trees, this procedure converges to the correct solution after a fixed number of steps.f1

X1

f2 f3

X2 X3

f4

X4

11

1

1

11

11

1

1

11

11

11

We now give a simple example, in which we assume that all probabilities are 0.5.


Let us run the algorithm, assuming that all probabilities are 0.5.

f1

X1

f2 f3

X2 X3

f4

X4

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

f1

X1

f2 f3

X2 X3

f4

X4

(.5, .5)1

1

(1,1)(1,1)

1

(1,1)

1

1

(2,2)(2,2)

1

(2,2)

1

(1,1)

1

44 5 Approximate Inference5.3 Loopy Message Passing



f1

X1

f2 f3

X2 X3

f4

X4

(.5, .5)1

1

(1,1)(1,1)

1

(1,1)

1

1

(2,2)(2,2)

1

(2,2)

1

(1,1)

1

f1

X1

f2 f3

X2 X3

f4

X4

(.5, .5)(1,1)

(.5, .5)(1,1)

(.5, .5)

(1,1)(2,2)

(1,1)(1,1)

(1,1)(2,2)(2,2)

1

(2,2)(1

,1)

(1,1)

5-30 5-30Example: Loopy Message Passing


f1

X1

f2 f3

X2 X3

f4

X4

(.5, .5)1

1

(1,1)(1,1)

1

(1,1)

1

1

(2,2)(2,2)

1

(2,2)

1

(1,1)

1

f1

X1

f2 f3

X2 X3

f4

X4

(.5, .5)(1,1)

(.5, .5)(1,1)(.5, .5)(1,1)(2,2)

(1,1)(2,2)

(1,1)(2,2)(2,2)

1

(2,2)(1

,1)

(1,1)

5 Approximate InferenceChapter Summary 45



f1

X1

f2 f3

X2 X3

f4

X4

(.5, .5)(1,1)

(.5, .5)(1,1)

(.5, .5)

(1,1)(2,2)

(1,1)(2,2)

(1,1)(2,2)(2,2)

1

(2,2)(1

,1)

(1,1)

f1

X1

f2 f3

X2 X3

f4

X4

(.5, .5)(1,1)

(.5, .5)(2,2)

(.5, .5)

(.5, .5)(2,2)

(.5, .5)(2,2)

(1,1)(2,2)(2,2)

1

(2,2)(1

,1)

(2,2)



f1

X1

f2 f3

X2 X3

f4

X4

(.5, .5)(1,1)

(.5, .5)(2,2)

(.5, .5)

(.5, .5)(2,2)

(.5, .5)(2,2)

(1,1)(2,2)(2,2)

1

(2,2)(1

,1)

(2,2)

f1

X1

f2 f3

X2 X3

f4

X4

(5, .5)(4,4)

(1,1)(2,2)

(1,1)

(.5, .5)(2,2)

(.5, .5)(2,2)

(.5, .5)(2,2)(2,2)

1

(2,2)(.5

, .5)

(2,2)

Chapter Summary

5-33 5-331. Factor graphs are a useful representation of Bayesian networks (and many other graph-ical models).

2. For polytree networks, the sum-product algorithm on factor graphs can be used for exactinference.

3. The sum-product algorithm is also often useful for loopy networks, in which case it isan iterative approximation algorithm.

For Further Reading[1] F. R. Kschischang, B. J. Frey, H. -A. Loeliger: Factor graphs and the sum-product

algorithm. IEEE Transactions on Information Theory 47(2):498–519, 2001.

46 5 Approximate InferenceProblems for This Chapter


Problem 5.1 Loopy Message PassingWill the example for loopy message passing that is given in the lecture converge? If so, will it convergeto the correct solution?

Problem 5.2 Loopy Message PassingShow whether or not the following statement holds: “A Bayesian network is polytree if and only if itsfactor graph is a tree.”

6 Structural Equation Models 47

6-1 6-1

Lecture 6Structural Equation Models

Today’stopic

Today’stopic

i This lecture will introduce you into a whole new world – the world of structural equationmodeling (SEM). This world has its own tradition, which is almost a century old – it startedin 1921, with a seminal paper by the geneticist Sewell Wright that was crisply titled “Corre-lation and Causation” (Journal of Agricultural Research. 20: 557–585.) To this day, SEMremains a discipline that is largely unknown to many computer scientists, but is all the moreimportant in empirical fields, above all in Psychology and Social Sciences, and Economet-rics. It may seem surprising, then, that SEM has deep connections to Bayesian networks –indeed, SEMs are Bayesian networks, it’s just that they use specific probability distributionsto represent relationships between variables.


1. Understand the relationships between Bayesiannetworks and (a subset of) Structural EquationModels (SEMs).

2. Apply Bayesian network methodology to SEMs,including model testing.

Chapter Contents6.1 Gaussian Distributions 47

6.2 Structural Equation Models (SEMs) 50

6.3 Testing SEMs 52


6.1 Gaussian Distributions

6-4 6-4Continuous Probability Densities

So far, we used discrete random variables and interpreted P(X = x) as “the probability thatvariable X has the value x”.

This interpretation no longer works for continuous variables. Since continuous variableshave infinitely many possible values, the probability of a continuous variable to attain anyof those values is 0.

Therefore, a different interpretation is needed. In this lecture, we only consider real-valuedvariables. For that case, P(x) has the following interpretation:∫ x1

x=x0

P(x) = Pr [x0 ≤ X ≤ x1] .

(For even more general settings, we need measure theory, which I’m not going to coverhere.)

48 6 Structural Equation Models6.1 Gaussian Distributions

6-5 6-5Properties of Continuous Densities

All of our favourite properties of discrete densities hold also for continuous densities, if wereplace summing by integration:∫

aP(a) =

∫a

P(a | b) = 1

P(a) =∫

bP(a,b)

P(a,b) = P(a | b)P(b)

P(a,b,c) = P(a | b,c)P(b | c)P(c)

Two variables X ,Y are called conditionally independent given a set of variables Z if

P(x,y | z) = P(x | z)P(y | z) .

Equivalently, X and Y are called independent given Z if

P(x | y,z) = P(x | z) .

(Yes, that’s the same as for discrete variables.)

6-6 6-6The Uniform Density

The simplest example of a continuous probability density is the uniform distribution definedby

P(x) =

{1

b−a a≤ X ≤ b0 otherwise

This is, however, not a very interesting distribution, since it rarely occurs in nature.

6-7 6-7Variance and Covariance

For a random variable X, the variance of X is defined by

Var(X) = E((X−E(X))2 ) .

For two random variables X and Y , the covariance is defined by

Cov(X ,Y ) = E((X−E(X))(Y −E(Y ))) .

(Note that Var(X) = Cov(X ,X).)

6-8 6-8The Univariate Gaussian Distribution

The Gaussian distribution, also called normal distribution, is defined by

P(x) =1√

2πσ2e−

(x−µ)2

2σ2 ,

where µ = E(X) is the mean and σ2 = Var(X) is the variance.

If X has a Gaussian distribution, we write that X ∼N (µ,σ2).

x

P(x

)

µ = 0 σ = 1µ = 0 σ = 2µ = 1 σ = 1

6 Structural Equation Models6.1 Gaussian Distributions 49

6-9 6-9. ExerciseBelow, you see three scatterplots of samples drawn from three pairs of random variables.Order these plots with respect to: Var(X),Var(Y ),Cov(X ,Y ).

A B C

● ●

●●●

●●●●

●● ●●

●●●●

●

●

●

● ●●●●

●

●

●

●●

●●●●

●● ●●●

●

●

●●

●●●●

●

●●●●

●●● ●●

●

●●

●●

●● ●

●●

●

●

●●●●●●

●●●

●

●●

●

● ●●

●

●●●●

●●

●

●●

●

●●●●●

●● ●●

●●●●

●●

●

●●

● ●●

●●

● ●●

●●

●● ●●

● ●●

●●

●●●

●

●●

●

●●

●

●●●●

●●

●

●●●

●● ●●

●●

●

●●●●

●

●●

● ●●●

●

●●

●

● ●●●

●

●●

●

●●●●

●

●

●●●●

●●●●●●

●

−4 0 2 4

−4

02

4

0.5

* y

X

Y ●

●●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●●

●

●

●●●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●●●

●

●

●●

●●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●●

●●●●

●

●●

●●

●

●

●

●●

●●●●

●●

●

●

●

●●●●●

●●

●

●

●

●

●

●

●●●

●●

●●●

●●

●●

●●●●

●

●●

●

●●●●

●

●

●●

●●

●●●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●●●

●

●●●

●

●●

●●

−4 0 2 4

−4

02

4

y

X

Y ●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●●

●●●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

● ●

●●

●

●

●●

●

●

●

●

●

●

● ●

●●

●●

●

●●

●

●●

●●

●

●

●

●

●

●●

●●

●

●

●

●

● ●

●

●

●

●●

●

● ●

●

●

●●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●●

●

●●

●●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●●

●

−4 0 2 4

−4

02

4

2 *

y

X

Y

(For instance, you could write something like “Var(X) : A > B >C”.)

6-10 6-10The Multivariate Gaussian Distribution

The multivariate Gaussian distribution for a vector X of variables is defined by

P(x) =1√

2π|Σ|2e−(x−µ)′Σ−1(x−µ)

where:

Symbol Meaningµ mean vectorΣ covariance matrix|Σ| determinant of covariance matrixΣ−1 inverse of covariance matrix

6-11 6-11Properties of Gaussian Distributions

Summing and scalingLet X ∼N (µ,σ2) . Then for a scalar α ,

αX ∼N (αµ,α2σ

2) .

Let X ∼N (µ1,σ21 ) and Y ∼N (µ2,σ

22 ). Then

X +Y ∼N (µ1 +µ2,σ21 +σ

22 +2Cov(X ,Y )) .

(The sum of two Gaussians is also a Gaussian.)

Linear combination propertyA vector of random variables (X1, . . . ,Xn) has a multivariate normal distribution if and onlyif every linear combination of its elements (e.g., 2X1 +5X2) has a univariate normal distri-bution.

6-12 6-12. ExerciseLet X ∼N (0,1),Y ∼N (0,1) where Cov(X ,Y ) = 0.Please compute:

– Var(2X)– Var(X+X)– Var(X+Y)– Var(X+2Y)

50 6 Structural Equation Models6.2 Structural Equation Models (SEMs)

6-13 6-13Conditional Covariances

The conditional covariance is the covariance in the conditional probability distribution. Itcan be computed using the recursive formula

Cov(X ,Y | Z∪{W}) = Cov(X ,Y | Z)− Cov(X ,W | Z)Cov(Y,W | Z)Cov(W,W | Z)

In the case of a single variable,

Cov(X ,Y |W ) = Cov(X ,Y )− Cov(X ,W )Cov(Y,W )

Var(W )

Note that the conditional covariance does not depend on the specific value of Z! This isdifferent from conditional densities, where, in general, P(x,y | Z = z1) , P(x,y | Z = z2).

6-14 6-14. ExerciseTake the following covariance matrix:

A B X YA 1.0 0.4 0.50 0.70B 0.4 1.0 0.80 0.70X 0.5 0.8 1.00 0.65Y 0.7 0.7 0.65 1.00

Compute the conditional covariance Cov(A,B | X).

6.2 Structural Equation Models (SEMs)

6-15 6-15Structural Equation Models

I DefinitionA Structural EquationModel (SEM, SEmodel) is a Bayesian network in which each variableX is a linear function of its parents and a Gaussian variable εX (called residual).

It is customary to represent a SEM as a path diagram labelled with the linear coefficientsand residual variances.

Z

A D

E

βZA βZD

βZE

βAE βED

σ2εA

σ2εD

σ2εE

σ2εZ

Z :=εZ

E :=βAEA+βZEZ + εE

A :=βZAZ + εA

D :=βEDE +βZDZ + εD

Residual means can also be represented, but we will assume here that all means are 0. (Realdata is often “centered” to mean 0 anyway, as a preprocessing step.)

6 Structural Equation Models6.2 Structural Equation Models (SEMs) 51

6-16 6-16Example: Schizophrenic Disease Unfolding

SAN: Social Anxiety; AIS: Active Isolation AFF: Affective Flattening SUS: Suspiciousness EGC: EgocentrismFTW: Living in a Fantasy World ALN: Alienation APA: Apathy HOS: Hostility CDR: Cognitive DerailmentPER: Perceptual Aberrations DET: Delusional Thinking

van Kampen, European Psychiatry 29(7):437–48, 2014

6-17 6-17Multivariate Normality of SEMs

SEMs generate multivariate normal distributions.

– Each variable is a linear combination of residuals.

Z

A D

E

βZA βZD

βZEβAE βED

σ2εA

σ2εD

σ2εE

σ2εZ

D = βEDE +βZDZ + εD

= βED(βAEA+βZEZ + εE)+βZDεZ + εD

= βED(βAE(βZAZ + εA)+βZEZ + εE)+βZDεZ + εD

= βED(βAE(βZAεZ + εA)+βZEεZ + εE)+βZDεZ + εD

– Each linear combination of variables is therefore also a linear combination of residuals.– Residuals are normally distributed.

52 6 Structural Equation Models6.3 Testing SEMs

6-18 6-18Fitting SEMs locally

How do we estimate the SEM from data? One option, as for any Bayesian network, is toestimate each node separately. This can be done by running a linear regression of each nodeon its parents (just like for the logistic regression in Chapter 3).

math science

read

write

hsb2 <- read.table(header=T, sep=",",’https://stats.idre.ucla.edu/wp-content/uploads/2016/02/hsb2-2.csv’)mdl <- lm( science ~ read + write + math,

data=data.frame( scale( hsb2 ) ) )coef(mdl) # extract path coefficients

## (Intercept) read write math## -2.027376e-16 3.122533e-01 1.977167e-01 3.018540e-01

summary( mdl )$sigma # extract residual standard deviation

## [1] 0.7125376

6.3 Testing SEMs

6-19 6-19Applying d-Separation to SEMs

Like any Bayesian network, SEMs can be tested by deriving conditional independence re-lationships. In multivariate Gaussian distributions, conditional independence entails a van-ishing conditional covariance (=conditional covariance is 0).

Model: A

B

C

Implication:AyC | BCov(A,C | B) = 0Data covariance matrix:

A B CA 1 0.5 0.1B 0.5 1 0.5C 0.1 0.5 0

Cov(A,C | B)

= Cov(A,C)− Cov(A,B)Cov(C,B)Var(B)

= 0.1−0.52 =−0.15

The model does not fit the data!

6 Structural Equation Models6.3 Testing SEMs 53

6-20 6-20Testing for Vanishing Conditional Covariance

In practice, we can also use linear regression to test for a vanishing covariance.Consider the linear regression model

Y = βXY X +βZY Z + εZ .

ThenβXY =

Cov(X ,Y | Z)Cov(X ,X | Z)

.

Therefore,βXY = 0⇔ Cov(X ,Y | Z) = 0 .

For example, to test whether Cov(X ,Y | Z) = 0, we can run the regression model Y ∼ X +Zand examine the coefficient of X .This is useful because statistical software will often provide confidence intervals and/or testsfor regression coefficients.

6-21 6-21. ExerciseLook at the following SEM again:A

B

C

X Y Z

Write down 4 different regression equations to test this SEM (the format can be like Y ∼X +Z). Which coefficients would need to be 0?

6-22 6-22Example: Testing SEMs locally

Testing B← A→C when the “true network” is B← A→C:

set.seed(1234)library(dagitty)g <- dagitty("dag{ A -> B [beta=0.5] ; A -> C [beta=0.5] }")# Simulate data from SEMN <- 1000d <- simulateSEM( g, N=1000 )

# Test whether B _||_ C | Alm( B ~ C + A, data=d )

#### Call:## lm(formula = B ~ C + A, data = d)#### Coefficients:## (Intercept) C A## -0.01790 0.03632 0.52089

To assess statistical significance, we can look at the confidence intervals:

confint( lm( B ~ C + A, data=d ) )

## 2.5 % 97.5 %## (Intercept) -0.07147937 0.03568010## C -0.02562048 0.09825195## A 0.46004334 0.58174176

As expected, the coefficient of C is (close to) 0.

54 6 Structural Equation ModelsProblems for This Chapter

6-23 6-23Testing SEMs locallyTesting B← A→C when the “true network” is B→ A←C:

set.seed(1234)library(dagitty)g <- dagitty("dag{ A <- B [beta=0.5] ; A <- C [beta=0.5] }")# Simulate data from SEMN <- 1000d <- simulateSEM( g, N=1000 )

# Test whether B _||_ C | Alm( B ~ C + A, data=d )

#### Call:## lm(formula = B ~ C + A, data = d)#### Coefficients:## (Intercept) C A## -0.01912 -0.29732 0.68128

To assess statistical significance, we look at the confidence intervals again:

confint( lm( B ~ C + A, data=d ) )

## 2.5 % 97.5 %## (Intercept) -0.06975809 0.03151936## C -0.35598798 -0.23864608## A 0.62379699 0.73876032

Now, the coefficient of C is no longer 0.

Chapter Summary

6-24 6-241. SEMs are Bayesian networks in which the nodes and arrows represent linear-Gaussianfunctions.

2. SEMs can be fitted node by node, and they can be tested using d-separation and linearregression.

3. SEMs are widely used (in certain fields) and mature software packages exist to supportthem.


Problem 6.1 Conditional CovarianceUse the recursive formula to compute the conditional covariance Cov(X ,Y | A,B) for the covariancematrix shown on slide 6-14.

Problem 6.2Provide a covariance matrix that can not be generated by the following structural equation model:

X

M

Y

A

Assume that the residual variances are set such that all variable variances are equal to 1.

Problem 6.3Devise a set of regression equations that fully test the model shown in Exercise 6.2. Which coefficientsshould vanish (be equal to zero) in each of the equations?

7 Latent Variables 55

7-1 7-1

Lecture 7Latent Variables

Today’stopic

Today’stopic

Until now, we have built our probabilistic graphical models from observed variables – vari-ables that we actually have data on. In many applications, the most interesting variablesare, however, ones that are not observed. For example, suppose we were building a modelon the relationship between success in school and one’s wealth in adult life, only to realizelater on that our data does not contain any information on the wealth of one’s parents, whichis certainly an important factor in this context. This would be a variable that we have notmeasured, although we could. In other cases, the variable is actually not measurable forsubstantive reasons. For instance, the measurement of psychological traits like shyness orintelligence is itself the subject of scientifc research. It may not be meaningful to treat suchvariables as “observed”.Therefore, many probabilistic graphical models contain non-observed, or latent variables.This raises some immediate questions, first and foremost: how do we model the probabilitydistribution of a latent variable, and how do we infer its parameters? This lecture will verybriefly delve into this subject and introduce one method to approach these questions – themethod of globalmodel fit. We will restrict our attention here to structural equation models,but similar principles apply in other classes of probabilistic graphical models.


1. Be able to derive implied covariance matrices fromSEMs.

2. Understand the concept of global model fit.3. Understand how latent variables can be incorporated

into SEMs.

Chapter Contents7.1 Latent Variables 55

7.2 Implied Covariance Matrices 58

7.3 Estimating Latent Variables 60

7.4 Examples 62


7.1 Latent Variables

7-4 7-4Motivation

Let us consider an example dataset in which students’ abilities in different areas, such as reading,writing, and science, were measured. We could hypothesize that more “advanced” skills, such asmath, are influenced by more “basic” skills such as reading and writing. We could then draw thediagram below to represent this hypothesis.

math science

read

write

56 7 Latent Variables7.1 Latent Variables

Implication: read y write

hsb2 <- read.table(paste0(’https://stats.idre.ucla.edu/’,’wp-content/uploads/2016/02/hsb2-2.csv’),header=T, sep=",")

with( hsb2, cor.test( read, write ) )

#### Pearson’s product-moment correlation#### data: read and write## t = 10.465, df = 198, p-value < 2.2e-16## alternative hypothesis: true correlation is not equal to 0## 95 percent confidence interval:## 0.4993831 0.6792753## sample estimates:## cor## 0.5967765

The implication is clearly false! How do we fix this?

7-5 7-5Hidden common causes

Possible fixes:

math science

read

write

math science

read

write

Neither seems plausible – reading and writing are both “basic” skills. In this study, theywere measured at the same time. More likely, there is a hidden common cause of the two:

math science

read

write

Intuitively: another variable influences both reading and writing skills, but we have notmeasured it. Therefore, we cannot include it in our model.

7-6 7-6Latent variables

A latent variable is a variable that is not observed/measured.

– Hidden variables could bemeasured in principle, but are not present in the data, perhapsfor cost or ethical reasons. Example: family income.

– Conceptual variables (constructs) are variables that are not directly measurable quanti-ties, and perhaps exist only in theory. Example: intelligence, charisma.

Often, we think that some latent common cause of two variables exists, but are unable tospecify it exactly. We can then represent the latent variable implicitly using a bi-directedarrow. Otherwise, we can depict the latent variable explicitly.

read write

implicit

pre-schooleducation

read write

explicit

7 Latent Variables7.1 Latent Variables 57

7-7 7-7Representing Latent Variables Nonparametrically

How can we represent a variable that is not observed in a Bayesian network?In a general Bayesian network, we can represent some latent variables implicitly by jointprobability tables.

math science

read

write

read Pgood 0.5poor 0.5

write Pgood 0.5poor 0.5

math science

read

write

read write Pgood good 0.3good poor 0.2poor good 0.2poor poor 0.3

7-8 7-8Representing Latent Variables

The representation as joint probability tables only works if the two variables linked by↔have no other parents. Here’s an example where this approach fails:

X Y Z

Y and Z depend on a latent variable, but Y also depends on X . There is no straightforwardway to express this in a probability table except a full joint table for X ,Y,Z – which rendersthe network useless.(One) solution: Use parametric representations of latent variables.

7-9 7-9Representing Latent Variables in Structural Equation Models

X Y ZβXY βY Z

σ2Y Z

X =εX

Y =βXY X + εY

Z =βY ZY + εZ

In a regular SEM, we assume that Cov(εY ,εZ) = 0.We can relax this assumption, and introduce a new parameter: the residual covarianceCov(εY ,εZ) = σ2

Y Z .

math science

read

write

.312.456

.198.34

5.302

0.6

58 7 Latent Variables7.2 Implied Covariance Matrices

7-10 7-10Estimating Residual Covariances

How can we estimate residual covariances from data? In some simple cases, the residualcovariance can be set to the observed covariance between the two variables.

math science

read

write

.312.456

.198.34

5.302

0.6

In other cases, that does not work.

X Y ZβXY βY Z

σ2Y Z

Cov(Y,Z) = βY Z +σ2Y Z .

Before showing how we can estimate the residual covariance in such cases, we first need tolearn about a new concept: global estimation.

7.2 Implied Covariance Matrices

7-11 7-11Global Estimation of Bayesian Networks

– Bayesian network parameter estimation normally works locally at each node, condi-tioned its parents.

– For many classes of parametric models, parameters can also be estimated globally. Thismeans that all parameters are fitted simultaneously to the data, often using a numericaloptimization algorithm.

– Global estimation can sometimes be used even if some variables are not observed.– For structural equation models, global estimation works through the implied covariancematrix.

7-12 7-12Reduction to Residuals

As we saw in the previous lecture, every variable in a SEM can be expressed as a linearcombination of residuals.

X Y Z

W

βXY βY Z

βYW

βZW

X = εX

Y = βXY X + εY

= βXY εX + εY

Z = βY ZY + εZ

= βY ZβXY εX +βY ZεY + εZ

W = βYWY +βZW Z + εW

= βYW βXY εX +βYW εY +βZW βY ZβXY εX +βZW βY ZεY +βZW εZ + εW

7 Latent Variables7.2 Implied Covariance Matrices 59

7-13 7-13. ExerciseWrite down Y as a linear combination of residuals in the following SEM.

X

A B

Yβ XA

βXB

βAY β BY

7-14 7-14The Ancestor RuleIn general, we can express each variable X in a SEM S as follows:

X =∑

A∈An(X)

∑paths p from A to X

λpεA ,

whereAn(X) = {A | A is an ancestor of X in S}

andλp =

∏U→V on p

βUV .

For example, if p = A→ B→C, then λp = βABβBC.The ancestor ruleTo compute the reduction to residuals r of X :r := 0For each ancestor A of X :For each path p from A to X :

r := r + (product of all coefficients on p) Var(εA)

7-15 7-15. ExerciseOften, structural equation models are applied to “scaled” data, where all variances are 1. Inthat case, the residual parameter variances are often omitted from the model.

In the following model, what does the value of σ2Z need to be such that the variance of Z is

1?

X Y Z0.5 0.5

1 1 ?

7-16 7-16Implied Covariances

X Y Z

W

βXY βY Z

βYW

βZW

Assuming all variables have mean 0, we can write that

Cov(X ,Z) = E(XZ)

=E(εX (βY ZβXY εX +βY ZεY + εZ))

=E(εX βY ZβXY εX + εX βY ZεY + εX εZ)

=E(εX βY ZβXY εX )+E(εX βY ZεY )+E(εX εZ)

=βY ZβXYE(εX εX )+βY ZE(εX εY )+E(εX εZ)

=βY ZβXYVar(εX )+βY ZCov(εX ,εY )+Cov(εX ,εZ)

=βY ZβXYVar(εX )+0+0

60 7 Latent Variables7.3 Estimating Latent Variables

7-17 7-17. ExercisePlease determine Cov(Y,Z)!

X Y Z

W

βXY βY Z

βYW

βZW

7-18 7-18The Trek Rule

The trek rule is a general algorithm for computing implied covariances Cov(X ,Y ) in SEMs.It is based on considering all paths from common ancestors of X and Y to X and Y .

The trek ruleFor each common ancestor A of X and Y :

For each path pX from A to X :For each path pY from A to Y :

r := r + (product of all coefficients on pX )(product of all coefficients on pY ) Var(εA)

The name trek rule is inspired by considering a path pair as a trek up and down a mountain:

X

A

Y

7.3 Estimating Latent Variables

7-19 7-19Estimating Latent Variables

Now let us return to our initial question: How do we estimate the parameters of this model?

X Y ZβXY βY Z

σ2Y Z

First, we replace the bi-directed arrow by a latent variable.

X Y Z

U

βXY βY Z

βUY βUZ

We now compute the implied covariance matrix of this model, with respect to the observedvariables X , Y , and Z.

The covariance matrix has 6 unique entries: Var(X), Var(Y ), Var(Z), Cov(X ,Y ), Cov(Y,Z),Cov(X ,Z).

And we have 8 parameters: σ2X ,σ

2Y ,σ

2Z ,σ

2U ,βXY ,βY Z ,βUY ,βUZ .

7-20 7-20. ExerciseFind two parameter combinations for this SEM that generate exactly the same covariancematrix.

X Y Z

U

βXY βY Z

βUY βUZ

7 Latent Variables7.4 Estimating Latent Variables 61

7-21 7-21Identification

If there are more unknowns (=parameters) than data points (=covariance matrix entries),then the model is not identified. In this case, we need to impose some constraints on theparameters.Typical constraints:

– Var(U) = 1 (we can’t know the unit of something we don’t measure)– βUY = βUZ (we can’t distinguish between the two effects)

X Y Z

U

1

βXY βY Z

βU βU

This brings us down to 6 parameters for 6 covariance entries. The resulting model is (just)identified.

7-22 7-22. ExercisePlease devise the implied covariance matrix of this model:

X Y Z

U

1

σ2X σ2

Y σ2Z

βXY βY Z

βU βU

7-23 7-23Remark on Identification

– We have said that models with more parameters than covariance matrix entries are notidentified.

– But having no more parameters than covariance matrix entries is not sufficient for hav-ing an identified model.

X Y Z

U

1

σ2X σ2

Y σ2Z

βXY βY Z

βUβU

– Deciding whether a model is identified or not is a difficult problem, and an active re-search topic.

62 7 Latent Variables7.4 Examples

7.4 Examples

7-24 7-24Example: Measurement Model

The below example shows how an abstract latent variable (industrialization) can be defined by so-called manifest variables, or measures. In this case, we have three manifest variables that are allassumed to reflect the latent concept. Because a latent concept is not measured, something needs tobe done to set its scale – otherwise, the model is not identified, because the same model could alwaysbe obtained by multiplying the standard deviation of the latent variable by a constant a and dividingall path coefficients from the latent to the manifest variables by a. Setting the scale can be done in twodifferent ways: either fix the variance of the latent variable to 1, or fix one of the path coefficients to1. The path coefficients going from latent variable to measures are also called loadings.

Industrialization

0.446

Gross nationalproduct percapita (X1)

Inanimate en-ergy consump-tion per capita(X2)

Percentage ofthe labor forcein industry (X3)

12.193

1.824

0.0840.108

0.468

Example implementation in R

suppressMessages( library(lavaan) )mdl <- sem( ’ind =~ x1 + x2 + x3’, data=PoliticalDemocracy )coef( mdl )

## ind=~x2 ind=~x3 x1~~x1 x2~~x2 x3~~x3 ind~~ind## 2.193 1.824 0.084 0.108 0.468 0.446

7-25 7-25Example: Structural Model

Source: http://lavaan.ugent.be

http://lavaan.ugent.be

7 Latent VariablesChapter Summary 63

Chapter Summary

7-26 7-261. Latent variables are unobserved, but important variables that affect our observed vari-ables.

2. Because latent variables are not observed, we need to do “smart things” to estimate theirparameters.

3. In SEMs, we can estimate latent variables through the implied covariance matrix.


Problem 7.1 Implied Covariance Matrix, advancedThis exercise is adapted fromMark Gilthorpe, University of Leeds, and used with his kind permission.The basic “trek rule” to compute covariances in SEMs can be very cumbersome in large SEMs, suchas the model below:

Area type

Area size Populationsize

Nr. doctors

Nr. emergencies

0.4 -0.4

-0.4

-0.3

0.5

0.5

-0.2

Often, it is assumed that the variables in a SEM are standardized, such that all variances are 1. Ifthat is the case, then the “trek rule” can be simplified for quicker computation. Please read the belowexplanation of how this simplification works.Recall that a trek between X and Y is a pair π = (πX ,πY ) of directed paths πX and πY that both start atthe same variable M (called the top of the trek) and where πX ends at X and πY ends atY . For example,(Area type→ Area size,Area type→ Population size) is a trek between “Area size” and “Populationsize”. In the standard trek rule, we compute the covariance between X and Y by enumerating all treksbetween X and Y . For each trek, we then compute the path monomial defined as the product of allcoefficients of the edges on the trek and the residual variance at the top of the trek.Assume that the variables are standardized, such that all variances are 1. Then the trek rule can bechanged as follows. A trek is called π = (πX ,πY ) is called simple if the only variable that occurs in bothπX and πY is the top of the trek. For example, (Area type→Area size,Area type→ Population size) isa simple trek. On the other hand, (Area type→ Population size→Nr. doctors,Area type→ Population size→Nr. emergencies) is a trek, but not a simple trek.The simple trek rule states that, for standardized SEMs, we can compute the covariance between Xand Y by adding the simple path monomials of all simple treks between X and Y . The simple pathmonomial is defined as the product of all coefficients of the edges on the trek, without the residualvariance.For example, the simple pathmonomial of the trek (Area type→Area size,Area type→ Population size)is 0.4 ·−0.4 =−0.16. Because this is the only simple trek between “Area size” and “Population size”,the covariance between these variables is -0.16.This approach has two advantages compared to the standard trek rule: we do not need to compute theresidual covariances, and we need to consider (sometimes significantly) fewer treks.Apply the simple trek rule to compute the covariance matrix of the above SEM!

64 7 Latent VariablesProblems for This Chapter

Problem 7.2 Identification of latent variable models, intermediateTwo of the three SEMs below are identified (the residual covariance can be uniquely estimated), wheresthe other is not. Which one is not identified and why not?

X Y ZβXY βY Z

σ2Y Z

X Y ZβXY βY Z

σ2XY

X Y ZβXY βY Z

σ2XZ

8 Markov Equivalence 65

8-1 8-1

Lecture 8Markov Equivalence

Today’stopic

Today’stopic

The edges in a Bayesian network are often read as causal effects (and we will formalize thisintuition later on). Therefore it may appear unnatural that edges in a network can often bereversed without chaning anything – as in, the resulting network can represent exactly thesame probability distributions as the unchanged one. This fact is a result of the symmetry ofstatistical dependencies, whereas causal relationships are by definition asymmetric. Differ-ent causal relations can sometimes give rise to the same statistical dependencies. This mayseem complicated, but there is a surprisingly simple way to determine when two Bayesiannetworks are in this sense equivalent, and we will learn about it here.The issue of network equivalence is also front and center in structure learning, where wederive Bayesian networks from data by means of an algorithm, instead of constructing themmanually. If the same data is consistent withmany different networks, then structure learningmust involve a degree of ambiguity, which we have to deal with somehow. We’ll see howthis can be done at the end of this lecture, when we introduce our first structure learningalgorithm.


1. Recognize when two Bayesian networks arestatistically equivalent.

2. Understand the concept of “faithfulness” to aprobability distribution.

3. Understand how Bayesian network structure can bederived from data.

Chapter Contents8.1 Faithfulness 66

8.2 Markov Equivalence 67

8.3 The IC Algorithm 69


8-4 8-4Today’s TopicSo far, we’ve worked with Bayesian networks assuming they were built by a human expert.But often, we have no strong theory on how a network should look like.We will now learn about how to derive networks from data algorithmically.However, there are two fundamental limitations to any such approach, which we first needto appreciate:

– Faithfulness– Markov Equivalence

8-5 8-5Structure Learning: Defining Our GoalThe structure learning problem is defined as follows:

input: a probability distribution P.output: a Bayesian network G such that P is consistent with G.

We have seen before that consistency is fully determined by conditional independence – Pis consistent with G if and only if each conditional independence implied by G holds in P.

This means that we can rephrase the structure learning problem as follows:

66 8 Markov Equivalence8.1 Faithfulness

input: a list of all conditional independence statements that hold in a probability dis-tribution P.

output: a Bayesian network G such that P is consistent with G.

(In reality, we will need to infer the conditional independence statements from data.)

8-6 8-6. ExerciseConsider the following example structure learning problem for three variables X ,M,Y :

input: X y Y |Moutput: ?

Please fill in the question mark and provide a valid output for the structure learning problem!

8.1 Faithfulness

8-7 8-7Trivial Consistency

Let usmake the previous exercise slightlymore complicated. We have four variablesA,B,C,D.

input: C y A | B, Dy A,B |Coutput: ?

What is the “laziest” solution we can come up with? Recall the definition:

P is consistent with G if and only if each conditional independence implied by G holds in P.

So what if G does not imply any conditional independence?

A B

C D

This is a valid solution, no matter what the input is!

8-8 8-8Structure Learning and Model Complexity

Expanding our previous example, the structure learning problem

input: C y A | B, Dy A,B |Coutput: ?

is solved by all of the following networks:

A B

C D

A B

C D

A B

C D

A B

C D

Occam’s RazorIf there are several possible explanations for a given phenomenon, prefer the simplest expla-nation.

In our case, the “simplest” (most parsimonious) explanation means that we pick the solutionwith the fewest edges.

8 Markov Equivalence8.2 Markov Equivalence 67

8-9 8-9Faithfulness

A Bayesian network G is faithful to a probability distribution P if G implies exactly the setof conditional independences that hold in P.

Faithful structure learning problem


output: a Bayesian network G such that P is faithful to G.

Faithfulness is a heuristic – there is often no reason to assume that only “faithful” networkstructures are interesting! In particular, the true “causal” Bayesian network that has gener-ated P (if it exists) need not necessarily be faithful to P.

However, faithfulness is a useful heuristic because it delivers sparse networks.

8-10 8-10. ExercisePlease provide a faithful solution for the following structure learning problem for variablesA,B,C:

input: AyC

output: ?

8.2 Markov Equivalence

8-11 8-11The Concept of Markov Equivalence

Consider the following two Bayesian networks:

X M Y X M Y

Both of these networks imply one, and only one, conditional independence:

X y Y |M

This means that both networks are consistent with exactly the same probability distributions.Such networks are called Markov equivalent.

If there are several Markov equivalent Bayesian networks G that fulfil an input list of con-ditional independences, then there is no unique output for the faithful structure learningproblem.

8-12 8-12Markov Equivalence Classes

The set [G] of all Bayesian networks G′ that is equivalent to some Bayesian network G iscalled the Markov equivalence class of G.The Markov equivalence class can contain only one network; in those cases, there is in facta unique faithful solution to the structure learning problem! For example:

XM

Y

=

XM

Y

But most often, Markov equivalence classes contain several networks.

XM

Y

=

XM

Y , XM

Y , XM

Y

68 8 Markov Equivalence8.3 Markov Equivalence

8-13 8-13Characterizing Markov Equivalence

The skeleton of a Bayesian network G = (V,E) is the undirected graph Gs = (V,Es) inwhich Es = {i− j | i→ j ∈ E}. (The skeleton is obtained from G by replacing all arrowswith undirected edges.)

A v-structure in a Bayesian network G = (V,E) is a set of three variables U,V,W such thatU → V ←W is an induced subgraph of G. (Thus, U → V ∈ E, W → V ∈ E, U →W < E,and W →U < E.)

I Theorem: Verma and Pearl, 1990Two Bayesian networks G = (V,EG) and H = (V,EH) are Markov equivalent if and only if

– G and H have the same skeleton; and– G and H have the same v-structures.

For the proof, see https://arxiv.org/pdf/1304.1108.pdf

8-14 8-14. ExerciseFor each of the following pairs of Bayesian networks, please decide whether or not they areMarkov equivalent.

E D

C

A B

E D

C

A B

E D

C

A B

E D

C

A B

E D

C

A B

E D

C

A B

E D

C

A B

E D

C

A B

8-15 8-15Graph Patterns and CPDAGs

A simple way to represent an equivalence class is amixed graph in which only the edges thatare part of v-structures are drawn directed, and the other edges undirected. Such a mixedgraph is called a graph pattern.

For example, the graph pattern ofE D

CA B

isE D

CA B

.

Often, undirected edges in a graph pattern can only be oriented in one direction withoutcreating a new v-structure (which would change the equivalence class). If all such edges arereplaced by directed edges, we obtain the completed partial DAG or CPDAG.

For example, the CPDAG ofE D

CA B

isE D

CA B

.

A CPDAG is a good way to represent all possible outputs of a faithful structure learningproblem.

https://arxiv.org/pdf/1304.1108.pdf

8 Markov Equivalence8.4 The IC Algorithm 69

8.3 The IC Algorithm

8-16 8-16Faithful Structure Learning Finalized

We have now learned about two fundamental limitations of the structure learning problem:(Un)faithfulness, and Markov equivalence.Keeping in mind these limitations, we can now formulate our final version of the problem,which reflects the extent to which we can actually solve it:

Faithful structure learning problem


output: a CPDAG G such that P is faithful to all DAGs represented by G.

We will now learn about an algorithm that solves this kind of structure learning problem.

8-17 8-17The IC Algorithm

The inferred causation algorithm (Pearl and Verma, 1994) proceeds in two steps. It firstinfers the skeleton, and then the v-structures.

The IC Algorithm

– Start with a graph containing no edges.– For each pair of variables X and Y , search for a set ZXY such that the independence

X y Y | ZXY is either in the input list, or follows from those in the input list. If no suchset exists, link X and Y by an undirected edge X−Y .

– For each pair of variables X and Y that are not linked, but have a common neighbourW (X −W −Y ), check whether W ∈ ZXY . If not, then add arrowheads pointing to W ,i.e. (X →W ← Y ).

– Orient the resulting graph pattern into a CPDAG.

The original paper and correctness proof is at: http://ftp.cs.ucla.edu/pub/stat_ser/r156-reprint.pdf

8-18 8-18. ExerciseApply the IC algorithm to derive a graph pattern for the following set of conditional inde-pendences:

– C y E,D | A,B– E y B,C | A– Dy A,C | B,E

Keep inmind the decomposition property of conditional independence – e.g.,CyE,D |A,Bimplies both C y E | A,B and C y D | A,B.

Chapter Summary

8-19 8-191. Different Bayesian networks can imply exactly the same constraints on their compatibleprobability distributions.

2. Therefore, it most often not possible to uniquely identify Bayesian network structuresfrom data.

3. Instead, the IC algorithm can identify equivalence classes of Bayesian networks fromdata.

http://ftp.cs.ucla.edu/pub/stat_ser/r156-reprint.pdf

http://ftp.cs.ucla.edu/pub/stat_ser/r156-reprint.pdf

70 8 Markov EquivalenceProblems for This Chapter


Problem 8.1 Edge orientations, basicThe following mixed graph represents the skeleton and the v-structures of a Bayesian network. Pleasederive the CPDAG from these graphs by orienting as many additional edges as possible! (That is,add an orientation to those undirected edges whose direction is the same in every network of theequivalence class, even though they are not part of v-structures.)A

B

C D E F

Problem 8.2 Edge orientations, basicDerive a CPDAG from the following Bayesian network by “undirecting” all edges that do not have thesame direction in all networks of the Markov equivalence class!

D

A B

Z

E

C F

Problem 8.3 Markov Equivalence, basicFor each of the following networks, answer these two questions:

– Which of the arrows in this network can be reversed without changing the set of probabilitydistributions that are consistent with it?

– List all the graphs that are Markov equivalent to this network.

1.DA B C

2.

E

D

B

Z

3.E

DZ

A

9 Structure Learning 71

9-1 9-1

Lecture 9Structure Learning

Today’stopic

Today’stopic

The last lecture has equipped us with a basic understanding of structure learning – we knowhow the structure learning problem is defined, what its fundamental limitations are, and wehave seen a first basic structure learning algorithm. Now, we are going to tackle some criticalpractical issues that come up with any kind of structure learning algorithm based on con-ditional independence constraints. First, in practice, conditional independence statementsmust somehow be derived from data. We will see in this lecture why that problem is farfrom trivial. Second, we will discuss some heuristics for the structure learning algorithmfrom the last lecture, turning it into perhaps the simplest such algorithm that is considereduseful – the PC algorithm.

Both of these topics – how do we test conditional independence, and how do we improvestructure learning? – are very active research areas. Of course this lecture can do nothingmore than scratch the surface and, I hope, attract your interest.


1. Understand the principles and limitations ofconditional independence testing for continuousdata.

2. Apply structure learning ideas in practice.3. Efficiently implement the IC algorithm.

Chapter Contents9.1 Testing Conditional Independence 71

9.2 The PC Algorithm 75


9.1 Testing Conditional Independence

9-4 9-4The IC Algorithm

The inferred causation algorithm (Pearl and Verma, 1994) operates on a list of conditionalindependencies.

But how do we obtain the list of conditional independencies?

9-5 9-5Testing Conditional Independence

In practice, we do not have an input list of conditional independencies. Instead, we needto test conditional independence statistically. This is no easy task, and a major barrier tostructure learning.

We briefly touched this topic before, and we know that a statement like X y Y | Z can betested by examining the coefficient of Y in a regression equation

X ∼ Y +Z

However, this was for normally distributed data only.

72 9 Structure Learning9.1 Testing Conditional Independence

9-6 9-6Testing Independence

We start with the “simplest” case: independence statements of the form X yY (thus, Z = ∅).An important issue is that dependence can be complex.

linear

●

●

●

●

●●

●●●●

●

●●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●●

●●●●

●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●●●

●

●

●●●

●●

●

● ●

●

●

●

●

●

●●

●

−2

01

2

lm( y ~ x )$coef[2]

## x## 0.997

monotone

●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●●

●

●●

●●

●●

●●

●●

●●● ●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●

● ●●

●

●●

●●

●

●

●

●

●

●

●●

●

●●

●

0.0

1.0

2.0

lm( y ~ x )$coef[2]

## x## 0.443

non-monotone

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●●●

●●●

●

●●

●

●●

●●

●

●

●●

●

● ●●●●●

●

●

●

●● ●● ●●

●

●●●

●

●

● ●●

●●●

●● ●

●

●

●●●●

● ● ●

●

●●

●

● ●

●

●●●●

●

●

●02

46

8

lm( y ~ x )$coef[2]

## x## -0.158

Our linear regression test is designed for linear relations. This can lead towrong conclusions.

9-7 9-7. ExerciseWhich, if any, of the below statements is correct?

– If X and Y are statistically independent, then there is no linear dependence betweenthem.

– If X and Y are statistically dependent, then there is a linear dependence between them.– If there is a linear dependence between X and Y , then X and Y are not statisticallyindependent.

– If there is no linear dependence between X and Y , then X and Y are statistically inde-pendent.

9-8 9-8Modelling Non-Linear Relations

Faced with complex non-linear relationships, one option is to use free-form regression in-stead of linear regression.

One approach to free-form regression is local polynomial regression, often called LOWESS(locally weighted scatterplot smoothing) or LOESS.

m <- lm( y ~ x )

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●●●

●●●

●

●

●

●

●●

●

●

●

●

●●

●

● ●● ●●

●

●

●

●

●

● ●● ●●

●

●●

●●

●

● ●●

●●●

●●

●●

●

●●●

●

● ●●

●

●●

●

●●

●

●●

●●●

●

●02

46

810

m <- loess( y ~ x )

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●●●

●●●

●

●

●

●

●●

●

●

●

●

●●

●

● ●● ●●

●

●

●

●

●

● ●● ●●

●

●●

●●

●

● ●●

●●●

●●

●●

●

●●●

●

● ●●

●

●●

●

●●

●

●●

●●●

●

●02

46

810

Many functions are locally well approximated by polynomials.

9 Structure Learning9.1 Testing Conditional Independence 73

9-9 9-9Testing Independence with LOESS

One way to test for dependence is to determine the correlation of the regression predictionsE[Y | X ] with the actual values Y . This works for both linear and non-linear regression.

m <- lm( y ~ x )cor( predict( m ), y )

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●●●

●●●

●

●

●

●

●●

●

●

●

●

●●

●

● ●● ●●

●

●

●

●

●

● ●● ●●

●

●●

●●

●

● ●●

●●●

●●

●●

●

●●●

●

● ●●

●

●●

●

●●

●

●●

●●●

●

●02

46

810

## [1] 0.0948

m <- loess( y ~ x )cor( predict( m ), y )

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●●●

●●●

●

●

●

●

●●

●

●

●

●

●●

●

● ●● ●●

●

●

●

●

●

● ●● ●●

●

●●

●●

●

● ●●

●●●

●●

●●

●

●●●

●

● ●●

●

●●

●

●●

●

●●

●●●

●

●02

46

810

## [1] 0.996

We can assess the statistical significance of these correlations using a permutation test.

9-10 9-10Testing Independence with LOESS

Permutation test for independence between X and Y

– Determine θ = Cor(E[Y | X ],Y ).– Generate n random permutations X (i) of X .– Determine θ (i) = Cor(E[Y | X (i)],Y )– Determine

p ={i : |θ (i)|> |θ |}

n

The histograms plots below show the distribution of correlations found in randomly permuted samples,compared to the correlation with the real sample (red line). With linear regression, the correlationsin the permuted sample cluster around the real sample, so we would conclude that the correlationis not that important. With LOESS regression, all of the correlations in the permuted samples aremuch lower than the correlation in the real sample, so we would conclude that the correlation differssubstantially from what would be expected if there were no conditional dependence.

replicate(100,cor(predict(lm(y~sample(x))),y))

correlation

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

05

1020

30

74 9 Structure Learning9.1 Testing Conditional Independence

replicate(100,cor(predict(loess(y~sample(x))),y))

correlationF

requ

ency

0.0 0.2 0.4 0.6 0.8 1.0

05

1020

30

9-11 9-11Conditional Independence for Linear Relations

Let’s now consider conditional independence again – i.e., Z , ∅. For multivariate normaldata, we can run the regression

Y ∼ X +Z

and examine the coefficient of X . An equivalent way to phrase this is by using residuals:

– Regress X on Z and determine the residuals rX = X−E[X | Z].– Regress Y on Z and determine the residuals rY = Y −E[Y | Z].– If rX and rY correlate, then X y Y | Z does not hold.

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●●

●

●

●●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

regression lineresiduals

9-12 9-12Example: Testing X y Y | Z

ZX

YZ

X

YW

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●●

●● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●●● ●

●●●

●

●

●●

●

●

●

−2 −1 0 1 2

−3

−1

1

z

x

●

●

●●

●●

●

●

●●

● ●●

●

●●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

● ●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●●

−2 −1 0 1 2

−4

−2

02

z

y

●●●

●

●

●

●

●

●

●●●

●

●●

●

●

●●

●

●

●●

●●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●●

●

●

●

● ●

●

●

●●●●

●●

● ●●

●●

●

●

●

●

●

●

●

●● ●

●

●

●●

●

●

●

●●

●

●

●

−2 −1 0 1 2

−2

01

2

resid(mx)

resi

d(m

y)

●

●

●

●

●●

●

●

● ●

●

●

●

●●

●●

●

●

●

●

●●

●

● ●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

● ●

●

●

●

●

●

●●

●

●

●

●

−1 0 1 2

−2

02

4

z

x

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

−1 0 1 2

−4

02

4

z

y

●●●

●

●

●

●

●

●●

●

●

●

●

●● ●

●

● ●

●●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●●●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

−4 −2 0 2 4

−3

−1

13

resid(mx)

resi

d(m

y)

9 Structure Learning9.2 The PC Algorithm 75

9-13 9-13Residual Correlatedness Can Be MisleadingIn the simple case Z = ∅, linear regression testing was prone to false negatives (concludingindependence for dependent variables), but not to false positives (concluding dependencefor independent variables).Unfortunately, this is not true for Z , ∅. Below we give an example for a nonlinear model

X ← Y → Z .

●

●

●

●

●

●

●

●● ●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●●●

●

●●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●●

●●

● ●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●

●●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

Y

X

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

● ●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

● ●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

Y

Z

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●● ●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

● ●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●●

● ●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●● ●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●● ● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

resid(mx)re

sid(

mz)

The residuals appear correlated because linear regression does not fit the X ∼ Y and Z ∼ Yrelationships. But X and Z are nevertheless conditionally independent.

9-14 9-14Free-Form Regression ResidualsAgain, we can attempt to model nonlinear X ∼ Y and Z ∼ Y relationships using free-formregression. This approach is less prone to false positives if there is a clear nonlinear pattern.

●

●

●

●

●

●

●

●● ●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●●●

●

●●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●●

●●

● ●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●

●●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

Y

X

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

● ●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

● ●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

Y

Z

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●●

●

●

●●

●

●

●

●

●

●●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●● ●

●

●

●

● ●

●●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

resid(mx)

resi

d(m

z)

9.2 The PC Algorithm

9-15 9-15Recap: The IC AlgorithmWe recapitulate the IC algorithm from last week:The IC Algorithm

– Start with a graph containing no edges.– For each pair of variables X and Y , search for a set ZXY such that the independence

X y Y | ZXY is either in the input list, or follows from those in the input list. If no suchset exists, link X and Y by an undirected edge X−Y .



Given what we now know about conditional independence, how would be go about findingthe set ZXY ?

9-16 9-16Modification 1We begin by rephrasing the algorithm slightly.The IC Algorithm

– Start with a graph containing edges between all pairs of variables.– For each pair of variables X and Y , search for a set ZXY such that the independence

X yY | ZXY is either in the input list, or follows from those in the input list. If any suchset exists, remove the edge between X and Y .



We shall see in a minute why this is better.

76 9 Structure Learning9.2 The PC Algorithm

9-17 9-17Example: The Sachs et al. Data

Sachs et al. were interested in the interactions between a set of 11 proteins. They used aBayesian network learning algorithm for this purpose.

9-18 9-18Inference Algorithm: Step 1

We begin with a network containing all possible links between variables.

praf

pmek

plcg

PIP2

PIP3

p44/42

pakts473

PKA

PKC

P38

pjnk

9-19 9-19Keeping S Small

Nowwe need examine sets ZXY for each pair of X andY . There are 11 variables, so 29 = 512sets can be examined.Insight: A smaller ZXY is better than a larger ZXY , and an empty ZXY is best.

Thus, we first check which variables are independent. We do this by applying a cutoff of0.05.

library( Hmisc )Mp <- rcorr( d )$Pas.integer(Mp < 0.05)

## praf pmek plcg PIP2 PIP3 p44/42 pakts473 PKA PKC P38 pjnk## praf NA 1 0 0 0 0 0 0 0 0 0## pmek 1 NA 0 0 0 0 0 0 0 0 0## plcg 0 0 NA 0 1 0 0 0 0 0 0## PIP2 0 0 0 NA 1 0 0 0 0 0 0## PIP3 0 0 1 1 NA 0 0 0 0 0 0## p44/42 0 0 0 0 0 NA 1 1 0 0 0## pakts473 0 0 0 0 0 1 NA 1 0 0 0## PKA 0 0 0 0 0 1 1 NA 0 0 0## PKC 0 0 0 0 0 0 0 0 NA 1 1## P38 0 0 0 0 0 0 0 0 1 NA 1## pjnk 0 0 0 0 0 0 0 0 1 1 NA

9 Structure LearningChapter Summary 77

9-20 9-20Inference Algorithm: After Step 1

We have already reduced our graph a lot:

prafpmek

plcg

PIP3

PIP2

p44/42

pakts473PKA

PKCP38

pjnk

9-21 9-21. ExerciseFor those variable pairs X ,Y that are still linked by an edge, we now need to continue search-ing for a separator ZXY .Do we still need to search through all possible sets of variables when searching for separa-tors, or can we ignore some of these sets to speed up the search?

Chapter Summary

9-22 9-221. Testing conditional independence is difficult, and requires proper modelling of the re-lationships between variables.

2. It is preferable to perform tests for small or empty Z whenever possible.3. Most algorithms for structure learning are aware of this and use tricks to avoid tests

with large Z.4. The IC algorithm plus such tricks is known as the PC algorithm.


Problem 9.1 Structure Learning, intermediate to hardConsider the following undirected graph in which each edge represents a pair of nodes that are notunconditionally independent. That is, for each edge X −Y in this graph, we know that X 6y Y . (Agraph like this is generated in the first step of the PC algorithm.)

1. (Intermediate)Which among the DAGs that could have generated this graph has the fewest edges?2. (Hard) How many arrow directions can you already infer from this graph?3. (Intermediate) How many conditional independence tests do you need to perform in the worst

case to learn the entire equivalence class?

A B C

D E

78 9 Structure LearningProblems for This Chapter

Problem 9.2 Structure Learning, hardAs in the previous exercise, this graph is the result of the first step of the PC algorithm and shows pairsof nodes that are not unconditionally independent. How many DAGs exist that could have generatedthis pattern?A B C

D E

F

10 Causality 79

10-1 10-1

Lecture 10CausalityInterventions, Covariate Adjustment, and InstrumentalVariables

Today’stopic

Today’stopic

We have spent most of this lecture introducing what Bayesian networks are, and how wecan perform inference and structure learning. But to be truly successful, a methodologyneeds what is sometimes called a “killer application” – something that this methodology cando much better or much more elegantly than others. In the entirely subjective and biasedopinion of the writer, the “killer application” for Bayesian network is the study of cause-effect relationships. Made prominent by Judea Pearl’s textbook “Causality”, which at thetime of writing had recorded its 10,000th citation on Google Scholar, causal inference basedon Bayesian network has seen rapid adoption in many empirical sciences over the past twodecades.

Causality is a complex subject and can easily fill up a whole dedicated course. Yet, this finallecture will attempt to convey the gist of how Bayesian networks can be used in this context,explain some key definitions and results, and illustrate some applications. The writer wouldwager a guess that if students of this course will come across Bayesian networks in the future,it may well be in the context of Causality.


1. Understand the difference between correlation andcausation.

2. Appreciate how this complicates empirical research.3. Be familiar with the do-operator to define causal

effects.4. Understand how Bayesian networks help to infer

causation.

Chapter Contents10.1 Motivation 80

10.2 Covariate Adjustment 83

10.3 Instrumental Variables 90


80 10 Causality10.1 Motivation

10.1 Motivation

10-4 10-4So What is This About?

10-5 10-5The Boring Part: Correlation Does Not Imply Causation.

10 Causality10.1 Motivation 81

10-6 10-6The Interesting Part: Explaining Correlation Patterns.

10-7 10-7Explanations for CorrelationsWe observe:

– Across Germany, Christians are more likely to vote right-wing.– In both east and west, Christians are less likely to vote right-wing.

Our intuition tells us that the nationwide correlation is “driven by” differences between eastand west, and is therefore not “real”. But how come we know that?

10-8 10-8Simpson’s ParadoxSuppose a new treatement for a disease is tested in a trial with the following results:

Cured Not CuredTreated 20 20Not Treated 16 24

P(C = 1 | T = 1) = 0.5 P(C = 1 | T = 0) = 0.4

This appears to show that the treatment is effective – more people are cured who have taken the treat-ment.

Now the investigator wants to know whether the treatment is more effective in men orwomen, and gets the following results:Males Cured Not CuredTreated 18 12Not Treated 7 3

P(C = 1 | T = 1,S = m) = 0.6 P(C = 1 | T = 0,S = m) = 0.7

Females Cured Not CuredTreated 2 8Not Treated 9 21

P(C = 1 | T = 1,S = f ) = 0.2 P(C = 1 | T = 0,S = f ) = 0.3

This appears to show the opposite: for each men and women, less people are cured who have takenthe treatment.

Do we give the treatment or not?

82 10 Causality10.2 Motivation

For Further Reading[1] Judea Pearl: Simpson’s Paradox: An Anatomy http://bayes.cs.ucla.edu/R264.pdf

10-9 10-9Simpson’s Paradox

Suppose a new treatement for a disease is tested in a trial with the following results:

Cured Not CuredTreated 20 20Not Treated 16 24

P(C = 1 | T = 1) = 0.5 P(C = 1 | T = 0) = 0.4

This appears to show that the treatment is effective – more people are cured who have taken the treat-ment.

The investigator knows that treatment affects blood pressure, andmeasures it after treatment.She gets the following results:

High BP Cured Not CuredTreated 18 12Not Treated 7 3

P(C = 1 | T = 1,S = m) = 0.6 P(C = 1 | T = 0,S = m) = 0.7

Low BP Cured Not CuredTreated 2 8Not Treated 9 21

P(C = 1 | T = 1,S = f ) = 0.2 P(C = 1 | T = 0,S = f ) = 0.3

This appears to show the opposite: for both high and low blood pressure, less people are cured whohave taken the treatment.

Do we give the treatment or not?

10-10 10-10. ExerciseWe have just seen two examples of Simpson’s paradox. The data were exactly the same, justthe labels were different. Does this mean that the answer – do we give the treatment or not– must be the same in both cases?

10-11 10-11Causal Inference

– Statistics is the art of describing data.– Causal inference is the art of deriving causal insight from data:

– Policy recommendations– Intervention targets– Explanation

– Causal inference can never be unambiguously performed based on data alone. We al-ways require extra information or assumptions.

No causes in – no causes out

(Nancy Cartwright, “Hunting Causes and Using Them”)

http://bayes.cs.ucla.edu/R264.pdf

10 Causality10.2 Covariate Adjustment 83

10.2 Covariate Adjustment

10-12 10-12Covariate Adjustment

– We often work with observational data (cohort studies, case-control studies, . . .).– Causal inference from observational data requires making assumptions about the un-derlying process.

No causes in – no causes out(Nancy Cartwright, “Hunting Causes and Using Them”)

– Bayesian networks allow to encode those assumptions in a principled fashion and derivetheir implications.

The structural approach to causalityGiven this Bayesian network, is the causal effect of X on Y identifiable and if so, how?

10-13 10-13Modelling Interventions: The do-Operator

Observational regime

U

cancercoffee

smoke

P(cancer,coffee,smoke,U) = P(cancer | coffee,smoke) P(U) P(smoke | U) P(coffee | U)

Interventional regime

U

cancercoffee

smoke

P(cancer,smoke,U | do(coffee)) = P(cancer | coffee,smoke) P(U) P(smoke | U)

These equations imply that:

P(cancer | do(coffee)) =∑smoke

P(cancer | smoke,coffee)P(smoke)

10-14 10-14Covariate AdjustmentWe just derived the covariate adjustment formula:

P(cancer | do(coffee)) =∑smoke

P(cancer | smoke,coffee)P(smoke)

To estimate the causal effect of coffee drinking on cancer, we adjust for the confounder“smoking”.

Back-Door Criterion (Pearl, 2009)Given a dataset that has been generated by a structural causal model G, let Z be a set ofcovariates such that– Z does not contain any descendants of X ;– Z d-separates all paths from X to Y starting with X ←.

Then Z is an adjustment set for the causal effect X on Y , and

P(y | do(x)) =∑

zP(y | x,z)P(z)

84 10 Causality10.2 Covariate Adjustment

10-15 10-15Examples

X Y

Z

g <- "dag {Z -> {X->Y}}"adjustmentSets(g,"X","Y")

## { Z }

X Y

Z

g <- "dag{Z <- {X->Y}}"adjustmentSets(g,"X","Y")

## {}

A B

M

X Y

g <- "dag{A->{M->X}->Y B->{M->Y}}"adjustmentSets(g,"X","Y")

## { A, M }## { B, M }## { A, B, M }

10-16 10-16Causal Bayesian Networks (DAGs) in Epidemiology

How large is the effect of low education on diabetes risk?

family incomeduring childhood

mother’sgenetics

mother’sdiabetes

loweducation diabetes

Rothman, Greenland & Lash, Modern Epidemiology, 2008

– Epidemiologists use DAGs to represent causal assumptions.– The DAGs are drawn by hand (most often), generated from data (seldomly), or both(sometimes).


10-17 10-17Questions for a Causal Bayesian Network

Hi Mr. Textor,[1em] I am trying to learn more about [causal Bayesian networks]. I want to seeif DAGitty can be used for the attached causal diagram to answer a few of my questions. I amhaving problems with using the program to help answer these questions. [1em] Can you give mesome assistance?

1. Which variable would control for confounding and so reduce bias in estimating the causal effectof the exposure (E) on the disease (D)?

2. Which variable would not impact on the bias in the estimate of causal effect of E on D?

3. Which variable in the model potentially introduces (additional) bias in the estimate of the causaleffect of E on D?

familyincome

maternalgenes

maternaldiabetes

loweducation diabetes

10-18 10-18. ExerciseSuppose we want to compute P(i | do(wue)) and we know that the following Bayesian net-work represents our domain correctly:

Coach(C)

FitnessLevel(FL)

Injury (I)Intra-GameProprioception (IGP)

NeuromuscularFatigue (NMF)

Pre-GameProprioception (PGP)

Previous Injury (PI)

Team Motivation(TM)

Warm-UpExercises(WUE)

Contact Sport(CS)

Give 3 different adjustment sets that can be used to remove confounding!

10-19 10-19Necessity of Causal Analysis

The “more is better” misconception: “Adjust for all pre-treatment covariates that correlatewith exposure and outcome”.Counterexample: the M-bias graph.

M

U1 U2

X Y


isAdjustmentSet(g,Z=c())

## [1] TRUE

isAdjustmentSet(g,Z="M")

## [1] FALSE

Famous exchange between Donald Rubin and Judea Pearl in “Statistics in Medicine” (2009)

10-20 10-20Covariate Adjustment in PracticeWang and Bautista, Int. J. Epidemiol. (2014)

10-21 10-21Robust Covariate Adjustment

Validity of covariate adjustment depends on correctness of the input DAG.

Hopefully, the DAG has been tested by evaluating its conditional independencies. But eventhen, the problem of statistical equivalence remains.

The DAGs below are statistically equivalent, but they imply very different adjustment sets.

X Y

Z

g <- "dag {Z -> {X->Y}}"adjustmentSets(g,"X","Y")

## { Z }

X Y

Z

http://www.dagitty.net


g <- "dag{Z <- {X->Y}}"adjustmentSets(g,"X","Y")

## {}

10-22 10-22Reminder: Markov Equivalence

– Models with the same testable implications are called Markov equivalent.Z

A D

E

Dy A ; Ay Z

Z

A D

E

Dy A ; Ay Z

– Two models are Markov equivalent iff they have the same “skeleton” (edges w/o arrowheads) andthe same ”immoralities” (children of “unmarried” = unlinked parents).

Z

A D

E

Z

A D

E

10-23 10-23Adjustment for Markov Equivalence Classes

It isn’t always necessary to know the whole DAG for causal inference.

Generalized adjustment criterion (Perkovic et al, 2015)Given a CPDAG (equivalence class) G = (V,E), let Z be a set of covariates such that

– No possible directed path from X to Y starts with an undirected edge;– Z does not contain any possible descendants of a node W ∈ V \X on a proper causalpath from X to Y ;

– Z blocks all proper non-causal paths from X to Y .

Then Z is an adjustment set for the causal effect X on Y .

directed: X →W → Y ; possibly directed: X−W → Y


10-24 10-24Adjustment for Markov Equivalence Classes

Alcohol use

C−Reactive Protein

Marital status Physical activity

Psychosocial stress

Serum Albumin

Uric Acid

Age

Bilirubin

Creatinine

Diet

Education

Hypertension

Income

Liv

Obesity Smoking


Alcohol use

C−Reactive Protein

Marital status Physical activity

Psychosocial stress

Serum Albumin

Uric Acid

Age

Bilirubin

Creatinine

Diet

Education

Hypertension

Income

Liv

Obesity Smoking

adjustmentSets(equivalenceClass(g))

## Age ; Alcohol use ; Creatinine ; Education ;## Obesity ; Serum Albumin ; Uric Acid ;

10-25 10-25Summary

The DAG approach to covariate adjustment:

– Draw DAG based on the best domain knowledge (and validate it!)– Compute adjustment sets from DAG.– Collect data.– If DAG is correct, then adjustment sets will remove confounding bias.

Model misspecifications are not always criticalBenito van der Zander, Maciej Liskiewicz, Johannes Textor: Constructing Separators and Adjust-ment Sets in Ancestral Graphs. In Proceedings of the 30th Conference on Uncertainty in ArtificialIntelligence (UAI 2014), pp. 907–916. AUAI Press, 2014.

Emilija Perkovic, Johannes Textor, Markus Kalisch, Marloes Maathuis: A Complete GeneralizedAdjustment Criterion. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence(UAI 2015), pp. 682-691. AUAI Press, 2015.

90 10 Causality10.3 Instrumental Variables

10.3 Instrumental Variables

10-26 10-26Unobserved ConfoundingThere is much evidence that Vietnam veterans are worse off later in life. It is tempting toconclude that these problems are due to their experiences in the war. But various unknownfactors could have caused both the decision to join the military and the later earnings.

Unknown

Served inVietnam War

LifetimeEarnings

We discussed before how covariate adjustment can be used to control for these confoundingfactors, but that requires that we know and measure the important sources of confounding.

For Further Reading[1] Joshua D. Angrist: Estimating the Labor Market Impact of Voluntary Military Service

Using Social Security Data on Military Applicants. Econometrica 66(2):249–288,1998

10-27 10-27Instrumental Variables (IVs)

– Goal: estimate causal effect of X on Y .– Problem: an unobserved confounder U .– Solution: two-stages least squares using an instrumental variable (IV).

I X

U

Yβ

Exogeneity: I yU .Exclusion restriction: I only affects Y through X , but not directly.

m1 <- lm( Y ~ X )library( AER ); m2 <- ivreg( Y ~ X | I )

m1 m2

β

Two-stages least squares

1. Regress X on I and obtain the prediction X .2. Regress Y on X .

10-28 10-28The Vietnam Draft LotteryIn 1969, men were called in a random order determined by their birthdays, and asked toserve in the war. 195 out of 366 possible birthdays were “drafted”. For example, men bornon a September 14th were drafted, but men born on a June 20th were not. Not every draftedperson enlisted.

Unknown

“Won”Lottery


LifetimeEarnings

Angrist used “draft” as an IV to determine the effect of serving in the war on lifetime earn-ings.

10 Causality10.3 Instrumental Variables 91

m1 <- lm( earn ~ enlistment )m2 <- ivreg( earn ~ served | won )m3 <- ivreg( earn ~ enlist + yearofbirth | served + yearofbirth )

10-29 10-29Conditional IVsThe Vietnam Draft Lottery was repeated a few times, but with slightly different procedures.When combining these data, the “exogeneity” criterion I yU is violated.

Year of BirthUnknown

“Won”Lottery (I)


LifetimeEarnings

However, we can condition on the year of birth to to d-separate I from U . We say that I is aconditional IV.

m1 m2 m3

β

10-30 10-30Conditional IVs in Bayesian Networks

I X

U

Y

The classic IV model.

I

W

X

U

Y

I is a conditional IV given W .

Graphical Definition (Brito and Pearl)I is a conditional IV if for some observed variables W,∗(a) I correlates with X conditional on W,(b) W d-separates I and Y when X → Y is removed,(c) W contains no descendants of Y .

Instrumentalization: Given net G and X → Y, I, find W that– d-separates I and Y but– does not d-separate I and X .

10-31 10-31Instrumentalization is NP-hard

I TheoremDetermining whether I is a conditional IV relative to X → Y is an NP-hard problem.

X C′1 C1 C′2 C2 C′3 · · · I

Y ′ P1 F1 N1 Y ′′

......

...Pn Fn Nn Y

V 11 · · · V o1

1

V 01

V−11

V 1n · · · V on

n

V 0n

V−1n

V 11 · · · V o1

1

V 01

V 1n · · · V on

n

V 0n· · ·

Proof by reduction from 3-SAT (Ci: clauses in formula). Task: Choose Pn,Fn to Connect X to I butkeep it separate from Y .

92 10 Causality10.3 Instrumental Variables

10-32 10-32Conditional IVs and Spurious CorrelationsConditioning on W is normally meant to remove confounding from an IV. But we can alsouse it to create an IV.

X Y

UI1

W

I2

m1 <- lm( Y ~ X )m2 <- ivreg( Y ~ X | I2 )m3 <- ivreg( Y ~ X + W | I2 + W )

m1 m2 m3

β

10-33 10-33Ancestral IVsWe introduce a restricted definition that is more in line with the “remove confounding”interpretation of conditional IVs.

I Definition: Ancestral IVsI is an ancestral IV if for some observed variables W,∗(a) I correlates with X conditional on W,(b) W d-separates I and Y when X → Y is removed,(c) W contains only ancestors of Y, I and no descendants of Y .

X Y

UI1

W

I2 I1: ancestral IV

I2: non-ancestral IV conditional on W

10-34 10-34Instrumentalization for Ancestral IVs: Easyfunction Ancestral-Instrument(G,X ,Y,Z)Gc := G with edge X → Y removedW := Nearest-Separator(Gc,Y,Z)if (W =⊥)∨ (W∩De(Y ) , ∅)∨ (X ∈W) then

return ⊥if (Z 6y X |W) in Gc then return W else return ⊥

I

A BCD U1

U2X Y

GI

A B

CD U1

U2X Y

Gc (moral graph)

Nearest-Separator(Gc,Y,Z) = {D,B} (but not {A,B})Blocks only those paths necessary to separate I and Y but not more

10 Causality10.4 Instrumental Variables 93

10-35 10-35Putting it Together– We can not (efficiently) instrumentalize IVs, but– We can (efficiently) instrumentalize ancestral IVs.

I TheoremFor a given DAG G and variables X and Y , a conditional IV I relative to X → Y exists ifand only if an ancestral IV I′ relative to X → Y exists.

Whenever there exists a conditional IV in a given graphical model, we can find one effi-ciently.

For Further Reading[1] B. van der Zander, J. Textor, and M. Liśkiewicz: Efficiently finding conditional instru-

ments for causal inference. Proceedings of the 24th International Joint Conference onArtificial Intelligence (IJCAI 2015), pages 3243–3249. AAAI Press, 2015.

10-36 10-36Implementation in RThe R package dagitty provides a command to list IVs, including conditional IVs.

library(dagitty)d <- dagitty("dag{ U -> X ; U -> Y ; I -> X -> Y }")exposures(d) <- "X" ; outcomes(d) <- "Y"instrumentalVariables( d )

## I

d <- getExample("Shrier") # The contact sport DAGinstrumentalVariables( d )

## Coach | NeuromuscularFatigue, TissueWeakness## ConnectiveTissueDisorder | NeuromuscularFatigue, TissueWeakness## FitnessLevel | NeuromuscularFatigue, TissueWeakness## Genetics | NeuromuscularFatigue, TissueWeakness## PreGameProprioception | NeuromuscularFatigue, TissueWeakness## PreviousInjury | ContactSport, NeuromuscularFatigue, TissueWeakness## TeamMotivation | NeuromuscularFatigue, TissueWeakness

10-37 10-37Summary– Conditional IVs are a natural generalization of IVs, but they are hard to find.– It suffices to consider ancestral conditional IVs– Ancestral conditional IVs can be found quickly in Bayesian networks.

10-38 10-38Conclusion– Bayesian networks can be used to encode causal assumptions and derive their implica-tions.

– This is especially useful when attempting to devise causal effects from observationaldata.

– Two ways to do that are covariate adjustment and instrumental variables.

dagitty.net/primer

http://dagitty.net/primer

94 10 CausalityProblems for This Chapter


Problem 10.1 The do-operator, intermediateConsider the following Bayesian network.

Z

X Y

Z P1 0.5

Z X P0 1 0.11 1 0.5

Z X Y P0 0 1 0.10 1 1 0.21 0 1 0.21 1 1 0.5

1. Determine p(Y = 1 | X = 1).2. Determine p(Y = 1 | do(X = 1)). What is the difference to p(Y = 1 | X = 1)? How can you

explain this difference?3. Determine

∑z p(Y = 1 | X = 1,z)p(z). Does this match p(Y = 1 | do(X = 1))?

Problem 10.2 Covariate adjustment, hardConsider this Bayesian network:

X Y

Z

Obviously, it is not necessary to adjust for Z since there is no back-door path from X toY . What wouldhappen if we were to adjust for Z anyway?

Problem 10.3 Covariate adjustment, intermediateDetermine three different adjustment sets for the causal effect of X onY based on the Bayesian networkbelow.A

B

CD

E

XY

Assuming all variables are binary, which of these adjustment sets would you prefer in practice andwhy?

Problem 10.4 Covariate adjustment, hardIs it always possible to find a set Z that satisfied the back-door criterion with respect to the causal effectof a variable X on another variable Y? If so, which set can you always use? It not, why not?Does your answer change if you consider X to be a set instead of a single variable?

Appendix 95

AppendixSolutionsSuggested Solutions for Selected Problems

Suggested Solution for Problem 1.1

1. P(A = 0) = 0.7; P(B = 0) = 0.52. P(A = 0 | B = 0) = P(A=0,B=0)

P(B=0) = 0.30.5 = 0.6

3. P(a | B = 0) is a function of A.

P(a | B = 0) =

{0.6 A = 00.4 A = 1

4. We could of course also just insert the definition of conditional probability. But by Bayes’ theo-rem, we have

P(B = 0 | A = 0) =P(A = 0 | B = 0)P(B = 0)

P(A = 0)=

0.6∗0.50.7

=37

Suggested Solution for Problem 1.2We consider the scenario where the player has chosen door 1 (X = 1), and the host has opened door(Z = 3). We then want to prove that the chance of the car being behind door 2 (Y = 2) is now higherthan the chance of the car being behind door 1 (Y = 1), so it is better to switch doors. In a formula:

P(Y = 1 | X = 1,Z = 3)< P(Y = 2 | X = 1,Z = 3)

We first compute that

P(Y = 1 | X = 1,Z = 3) =P(Y = 1,X = 1,Z = 3)

P(X = 1,Z = 3)=

1/9∗1/21/9∗1/2+1/9

=13.

And in the same way, we get that

P(Y = 2 | X = 1,Z = 3) =P(Y = 2,X = 1,Z = 3)

P(X = 1,Z = 3)=

1/91/9∗1/2+1/9

=23.

So if we switch doors, then we get a twice as high chance to win.


1. We call the four variables as follows. D is having the disease. M is having the mutation. T1 isbeing tested positive by 42andYou. T2 is being tested positive by the other test. Then our networklooks as follows.

D M

T1T2

D P1 0.0001

D M P0 1 0.011 1 0.99

M T1 P0 1 01 1 0.99

D T2 P0 1 0.011 1 0.99

Note that in the probability tables, we have omitted some values that follow from the other values.For example, becasue P(M = 1 | D = 1) = 0.99, we know that P(M = 0 | D = 1) = 0.01, so wedo not list it explicitly in the table.

96 AppendixSolutions

2. We first use the law of total probability to bring in the parent of T1: P(T1 = 1) = P(T1 = 1,M =0)+P(T1 = 1,M = 1) = P(T1 = 1,M = 1) = P(T1 = 1 |M = 1)P(M = 1)Now we againk use the law of total probability to determine P(M = 1) = P(M = 1,D = 0)+P(M = 1,D = 1) = P(M = 1 | D = 0)P(D = 0)+P(M = 1,D = 1)P(D = 1)Putting these together, we getP(T1 = 1) = P(T1 |M = 1)P(M = 1) = 0.99∗ (0.01∗0.9999+0.99∗0.0001) = 0.00999702

3. We first apply Bayes’ rule:

P(D = 1 | T1 = 1) =P(T1 = 1 | D = 1)P(D = 1)

P(T1 = 1)We know that P(D = 1) = 0.001 and we know from the previous exercise that P(T1 = 1 =0.00999702). So the missing ingredient is P(T1 = 1 | D = 1). Because the test can only bepositive when the mutation is carried, we know that P(T1 = 1 | D = 1) = P(T1 = 1,M = 1 | D =1) = P(T1 = 1 | D = 1,M = 1)P(M = 1 | D = 1). Futher, we know that P(T1 = 1 | D = 1,M =1) = P(T1 = 1 | M = 1). So in total we get P(T1 = 1 | D = 1) = P(T1 = 1 | M = 1)P(M = 1 |D = 1) = 0.99∗0.99 = 0.9801. This leads us to

P(D = 1 | T1 = 1) =0.9801∗0.0001

0.00999702= 0.009803922

This is a quite small number – despite the positive test, we are still vastly more likely to not havethe disease.

4. We will probably have had some reason to take the test in the first place. For instance, we mighthave felt certain symptoms of the disease, or a family member might have been affected. Manyof such potential reasons to take the test probably increase our risk to bear the disease to morethan 0.01%. However, this additional information is not factored into our Bayesian network.

5. Agan we use Bayes’ rule to rewrite P(D = 1 | T1 = 1,T2 = 0) as P(T1=1,T2=0|D=1)P(D=1)P(T1=1,T2=0) . An

important property is now that T1 and T2 are conditionally independent given D. This allows usto rewrite

P(D = 1 | T1 = 1,T2 = 0) =P(T1 = 1 | D = 1)P(T2 = 0 | D = 1)P(D = 1)

P(T1 = 1 | D = 0)P(T2 = 0 | D = 0)P(D = 0)+P(T1 = 1 | D = 1)P(T2 = 0 | D = 1)P(D = 1)

Inserting numbers we already computed in previous exercises or can look up from the probabilitytables, this turns into

P(D = 1 | T1 = 1,T2 = 0) =0.9801∗0.01∗0.0001

P(T1 = 1 | D = 0)∗0.99∗0.9999+0.9801∗0.01∗0.0001

We now still need to determine P(T1 = 1 | D = 0). We compute this in the same fashion asP(T1 = 1 | D = 1), that is, we write P(T1 = 1 | D = 0) = P(T1 = 1 | M = 1)P(M = 1 | D = 0)which gives P(T1 = 1 | D = 0) = 0.99∗0.01 = 0.0099. That leaves us with

P(D = 1 | T1 = 1,T2 = 0)

=0.9801∗0.01∗0.0001

0.0099∗0.99∗0.9999+0.9801∗0.01∗0.0001

=0.01∗0.0001

0.01∗0.9999+0.01∗0.0001= 0.0001

In other words, the two tests “cancel each other out” exactly, because the a posteriori probabilitynow equals the a priori probability.


1. All pairs of variables that are on different “sides” of the collider D are d-separated by the emptyset. These are six different pairs: (A,E),(A,F),(B,E),(B,F),(C,E),(C,F).

2. {D,E} d-separates A only from the variable F . Hence the answer is Z = {F}.3. All pairs of other variables except the pair (C,D). These are: (A,C), (A,D), (A,F), (C,F), and

(D,F).4. From the first part, we already know the six pairs that are d-separated by the empty set. The

pair (A,C) is d-separated by {B} but not by the empty set. The only remaining pairs that can bed-separated all involve the collider D. This variable can be d-separated from A by either {B} or{C}, from B by {C}, and from F by {E}.

AppendixSolutions 97


1. This set is also known as the “Markov Blanket”. A : Z = {C,B,X}; B : Z = {A,C,X ,Y,Z};C : Z = {A,B,Y,Z}; X : Z = {A,B,Y}; Y : Z = {X ,Z,B,C}; Z : Z = {C,B,Y}.

2. If only the variables A, B, X , and Z can be measured, then the network cannot be tested. Thisis because A, B, and X cannot be d-separated from each other, and to d-separate either of thosevariables from Z, we would always need to condition on B, or Y , or both. Conditioning on Brenders A and C dependent.

Suggested Solution for Problem 2.3Many options exist. Note that for some of the below, the modification could also lead to new implica-tions or change the first implication.

– Add an arrow X → Y .– Remove variable X or Y .– Reverse the arrow from Y to M2.– Reverse the arrow from Y to M1.– Reverse both the arrow from Y to M1 and from Y to M2.– Replace M1 by a new variableU1 and make M1 a child ofU1 (and possibly some other variables).

Suggested Solution for Problem 2.4An algorithm is described in the paper “Bayes-Ball: The Rational Pastime (for Determining Irrele-vance and Requisite Information in Belief Networks and Influence Diagrams)” by Ross D. Shacher(see http://arxiv.org/abs/1301.7412). The basic idea is to use a simple graph traversal algorithm, suchas breadth-first search or depth-first search, but to traverse the edges of the graph in both directions. Itis necessary to remember from which direction a node has already been visited. Therefore, one needsto work with two different queues – one for the nodes that are visited going along the arrow directions,and another for the nodes that are visited going againts the arrow direction.When arriving at a node v, the following cases can occur:

– Visited from a child of v (against arrow direction) and v ∈ Z: do nothing.– Visited from a child of v (against arrow direction) and v < Z: visit all parents and children of v.– Visited from a parent of v (following arrow direction) and v is an ancestor of some node in Z:

visit all other parents of v.– Visited from a parent of v (following arrow direction) and v is not in Z: visit all children of v.

Nodes that have been visited need to be labelled such that they are not visited again in the samedirection.An example implementation of this algorithm can be found on the course website.

Suggested Solution for Problem 2.5The basic problem with testing conditional independence between two variables X and Y given avariable Z is that X and Y could be independent given “almost all” values of Z. For example, supposethat Z is a variable that can take on a million different values, and that X and Y are only dependentgiven a single value z. Then several million samples would be necessary to discover this dependence.Therefore, real-world tests of conditional independence need to assume that any dependence betweenX and Y holds for “enough” values of Z to be detected.For discrete variables, we would still be guaranteed to discover the dependence at some point, i.e.,when we obtain enough samples. But for continuous variables, not even that is guaranteed. For amore mathematically rigorous explanation, see “Testing Conditional Independence for ContinuousRandom Variables” by Bergsma (http://www.eurandom.tue.nl/reports/2004/048-report.pdf).


1. 3 parameters are necessary to represent the distribution of theY ’s. Then another 8 parameters arenecessary for each X . That makes it 3+2*8=19 parameters. For the divorced network, we needonly one parameter for each X . That means the number of parameters is reduced to 3 + 2 = 5.

2. The original network was capable to make each X dependent on the precise combination of theY ’s. This is no longer possible in the divorced network. Once one of the Y ’s has a value of 1,then the values of the other two Y ’s can no longer influence the X . An explicit example that isnot compatible with the divorced network is the following (where all probabilities not shown inthe table are 0):

http://arxiv.org/abs/1301.7412

http://www.eurandom.tue.nl/reports/2004/048-report.pdf


X1 X2 Y1 Y2 Y3 P0 0 1 1 0 0.40 0 1 1 1 0.11 0 1 1 0 0.251 0 1 1 1 0.25

The divorced network implies that X1 y Y3 |M. Then it must hold that P(X1 = 0,Y3 = 0 |M =1) = P(X1 = 0 | M = 1)P(Y3 = 0 | M = 1), which means that 0.4 = 0.5 * (0.4+0.25), which isfalse. Note that M = 1 is always true in this probability distribution.The same example is compatible with the original network. The distribution can be generated asfollows (we do not show table rows for value combinations that cannot occur, e.g., any combina-tion in which Y2 = 0):

X1 X2

Y1 Y2 Y3

Y1 Y2 Y3 X1 P1 1 0 1 0.38461541 1 1 1 0.7142857

Y3 P1 0.5

Y1 P1 1

Y2 P1 1

Y1 Y2 Y3 X2 P1 1 0 1 01 1 1 1 0

Suggested Solution for Problem 3.2The base log odds ratios, also called “intercepts”, of the fitted logistic regression model are as follows:

Variable InterceptAsia -0.016Pollution .008Smoker -0.06002TB -0.4187Cancer -0.6669Bronchitis -1.362XRay -4.987Dyspnoea -2.786

The model coefficients are as follows:

Parent Child CoefficientPollution Cancer 0.5935Smoker Cancer 0.8670Asia TB 0.9209Smoker Bronchitis 2.685TB XRay 4.935Cancer XRay 5.120TB Dyspnoea 1.783Cancer Dyspnoea 1.623Bronchitis Dyspnoea 1.905

These values were all obtained using R’s generalized linear model (glm) command. If you obtaindifferent values using some other software, please let me know.

Suggested Solution for Problem 4.1An elimination ordering that retains all factors is the following:

P(z) =∑

a,b,c,d,e, f

P(a)P(b)P( f )P(c | b)P(z | a,b)P(d | z)P(e | f ,z)

=∑a,b

P(z | a,b)P(a)

(∑c

P(b)P(c | b)

)(∑d

P(d | z)

)∑e, f

P( f )P(e | f ,z)

But note that ∑

cP(b)P(c | b) = P(b) ,


∑d

P(d | z) = 1 ,

and ∑e, f

P( f )P(e | f ,z) =∑e, f

P(e, f | z) = 1 ,

Therefore, the elimination ordering simplifies to:

P(z) =∑a,b

P(z | a,b)P(a)P(b)

which, as it should be, corresponds to the factorization of the ancestor graph of Z.

Suggested Solution for Problem 5.2This statement is correct. To prove it, first we note that an edge X → Y in a Bayesian network Gcorresponds to a path X− fY −Y in the factor graph of G, where fY is the conditional distribution ofY given its parents (including X).We now prove the statement “A Bayesian network is polytree if and only if its factor graph is a tree” inthe following way. Let B(G) be a shorthand for “The Bayesian network G is a polytree” and let F(G)be a shorthand for “The factor graph of the Bayesian network G is a tree”.First, we show that ¬B(G)⇒ ¬F(G). So, suppose that G is not a polytree. Then there exists somepath π = X1, . . . ,Xn,X1 in G such that for each i ∈ {1, . . . ,n−1}, either Xi→ Xi+1 or Xi← Xi+1 is anedge in G. We now transform π into a sequence π

′ by replacing each Xi→ Xi+1 by Xi− fXi+1 −Xi+1and each Xi← Xi+1 by Xi− fXi −Xi+1. The resulting sequence π

′ is a path in the factor graph of G.Since π

′ is a cycle (a path starting and ending at the same variable X1), the factor graph of G cannotbe a tree.Now, we show that ¬F(G)⇒ ¬B(G). If the factor graph of G is not a polytree, it contains a cycle.Since the factor graph is bipartite (variable nodes are only connected to factor nodes and vice versa),this cycle is an alternating sequence of variable nodes and factor nodes. Without loss of generality,we assume that the cycle starts and ends at a variable node X1. Now, replace each subsequence Xi−fXi−Xi+1 by Xi← Xi+1 and each subsequence Xi− fXi+1−Xi+1 by Xi→ Xi+1. The result is a path inG that starts and ends at the same variable. Hence, G is not a polytree.

Suggested Solution for Problem 6.1Applying the recursive formula once, we can express Cov(X ,Y | A,B) as follows:

Cov(X ,Y | A,B) = Cov(X ,Y | A)− Cov(X ,B | A)Cov(Y,B | A)Cov(B,B | A)

Therefore, we now first have to compute 4 more conditional covariances:

Cov(X ,Y | A) = Cov(X ,Y )− Cov(X ,A)Cov(Y,A)Cov(A,A)

= 0.65−0.5 ·0.7 = 0.3

Cov(X ,B | A) = Cov(X ,B)− Cov(X ,A)Cov(B,A)Cov(A,A)

= 0.8−0.5 ·0.4 = 0.6

Cov(Y,B | A) = Cov(Y,B)− Cov(Y,A)Cov(B,A)Cov(A,A)

= 0.7−0.7 ·0.4 = 0.42

Cov(B,B | A) = Cov(B,B)− Cov(B,A)Cov(B,A)Cov(A,A)

= 1−0.42 = 0.84

Inserting these, we get

Cov(X ,Y | A,B) = 0.3− 0.6 ·0.420.84

= 0


Suggested Solution for Problem 6.2The model implies, for example, that X y Y |M. Therefore, Cov(X ,Y |M) = 0 must hold.Using the recursive formula, this means that the covariance matrix would need to fulfill the equation

Cov(X ,Y |M) = Cov(X ,Y )− Cov(X ,M)Cov(Y,M)

Cov(M,M)= 0 ,

which, because we assumed unit variances, would simplify to

Cov(X ,Y ) = Cov(X ,M)Cov(Y,M) .

Therefore, we only need to provide a covariance matrix where this identity does not hold. A simpleexample would be, for instance,

A M X YA 1 0 0 0M 0 1 0 0X 0 0 1 0.5Y 0 0 0.5 1

Suggested Solution for Problem 6.3We can use the local Markov property (see also Tutorial 1, Exercise 1.9) to devise a set of such equa-tions. The local Markov property says that each node must be independent of its non-descendants,given its parents. In our example, this would mean

1. X y Y,A |M2. Y y X |M3. A y X ,M | Y

Translating this to regression equations, we would get:

1. X ∼ Y +A+MThe coefficients of Y and A would need to vanish;

2. Y ∼ X +MThe coefficient of X would need to vanish;

3. A∼ X +M+YThe coefficients of X and M would need to vanish.

Suggested Solution for Problem 7.1Instead of listing the matrix, we provide the individual entries below, and spell out each entry as thesum of the simple treks that define it.

1. Cov(Area type, Area size) = .42. Cov(Area type, Nr. emergencies) = .4 ·−.3+(−.4 · .5)+(−.4 · .5 ·−.2)+(−.4)3. Cov(Area type, Population size) = −.44. Cov(Area type, Nr. doctors) = −.4 · .55. Cov(Area size, Nr. emergencies) = −.3+(.4 ·−.4 · .5)+(.4 ·−.4 · .5 ·−.2)+(.4 ·−.4)6. Cov(Area size, Population size) = .4 ·−.47. Cov(Area size, Nr. doctors) = (.4 ·−.4 · .5)8. Cov(Population size, Nr. doctors) = .59. Cov(Population size, Nr. emergencies) = .5+(.5 ·−.2)+(−.4 ·−.4)+(−.4 · .4 ·−.3)

10. Cov(Nr. doctors, Nr. emergencies) = −.2+(.5 · .5)+(.5 ·−.4 · .4 ·−.3)+(.5 ·−.4 ·−.4)

Suggested Solution for Problem 7.2The second model is not identified. The covariance between X and Y depends on the path coefficientand the residual variance. It is possible to change these to get exactly the same covariance between Xand Y without affecting any of the other parameters.More specifically, the three covariances are implied by the parameters of the second model as follows(assuming all residual variances are set such that all variables have variance 1):

Cov(X ,Y ) = σ2XY +βXY

Cov(X ,Z) = βY Zσ2XY +βY ZβXY = βY Z(σ

2XY +βXY )

Cov(Y,Z) = βY Z

This implies, for example, that setting σ2XY = 0 and βXY = 0.5 or setting σ

2XY = 0.5 and βXY = 0 gives

exactly the same covariances.


Suggested Solution for Problem 8.1All edges of the graph pattern can be directed – in the resulting CPDAG below, no edge can be reversedwithout either creating or destroying a v-structure.A

B

C D E F

Suggested Solution for Problem 8.2Only the leftmost two edges (highlighted in bold below) can be “undirected”. The edge A→C can bereversed without creating or destroying a v-structure; afterwards, the edgeC→ E can also be reversedwithout creating or destroying a v-structure. The edges A→ Z, E→ Z, B→ F and D→ F are all partof v-structures and therefore cannot be changed, which also implies that the edges Z→ B and Z→ Dannot be changed without creating a new v-structure at Z.

D

A B

Z

E

C F


1. Only the first arrow can be reversed:DA B C

There are 3 Markov equivalent graphs (plus the original one):DA B C DA B C DA B C

2. Only the B→ Z arrow can be reversed. Reversing Z → E, E → D or B→ D would create ordestroy a v-structure; reversing Z→ D would create a cycle.There are 2 Markov equivalent networks in addition to the original one:

E

D

B

Z

E

D

B

Z

3. The only arrow that cannot be reversed is the A→ E arrow, which would create a new v-structure.There are 9 Markov equivalent networks in addition to the original one:E

DZ

A E

DZ

A E

DZ

A E

DZ

A E

DZ

A E

DZ

A

E

DZ

A E

DZ

A E

DZ

A


1. We see that A and C are not connected, but both are connected to B. This means that in the finalDAG, there must be a path p1 from A to B and a path p2 from C to B, such that neither p1 norp2 contains a collider. (For example, p1 could simply be a direct path from A to B or an indirectpath via D.) If we combine p1 and p2, then we get a path from A to C. C must be a collider onthis path.The same is true if we consider any pair consisting of one of {A,D} and one of {C,E}. Therefore,we know that no arrow can point away from B. For example, the DAG may or may not containan arrow from A to B, but if it does contain one, then it must be A→ B.


We can summarize this as follows:A B C

D E2. Let us start with the variables A, B, and D. These variables form a triangle, so at least two of

these three variables must be directly connected. We also know that A and D must be directlyconnected, because no arrow can point away from B. So, the edge A−B or the edge D−B couldbe absent, but not both. For each edge, we need to perform one conditional independence test tofind out whether it is absent. For example, the edge A−B can only be absent if Ay B |D. Hence,for this triangle, we need to perform two conditional independence tests in the worst case, whichis that the edges A−B and D−B are both present.The same arguments work for the second triangle, so we get at most 4 conditional independencetests in total.

3. There are 4 DAGs that can generate this pattern with 4 edges. One of them is:A B C

D E

Suggested Solution for Problem 9.2In fact there is only a single DAG that can possibly have generated this pattern:A B C

D E

F

To see this, we first observe – using the same arguments as in the previous exercise – that no arrowcan point away from the variables B, D and E. This implies that there cannot be any edges betweenthese variables at all. This leaves us with only 6 possible edges, which point away from A, C, and F ,respectively. Each of these edges must be present to explain the correlation between the two variablesthat it links; for example, we cannot remove the A→ B edge, because then the variables A and B wouldhave to be uncorrelated.Thus, in this fortunate case, the (thresholded) correlation matrix already suffices to determine a uniquefaithful DAG for the data.

Suggested Solution for Problem 10.1It is best to first compute the full joint distribution for this network. This allows us to answer thequestions most effectively. To save some work, we can omit the terms for which X = 0, which we willnever need.

X Y Z P1 0 0 .041 0 1 .1251 1 0 .011 1 1 .125

1. From the table, we can readily compute p(X = 1,Y = 1) = .135 and p(X = 1) = 0.3. This givesp(Y = 1 | X = 1) = p(X = 1,Y = 1)/p(X = 1) = 0.45.

2. We now are working with a modified distribution in which the term p(x | z) is removed:

p(y,z | do(x)) = p(y | x,z)p(z) .

Again, it is useful to compute the table for p(y,z | do(x)), at least for X = 1.

X Y Z P1 0 0 .41 0 1 .251 1 0 .11 1 1 .25


From this table, we can quickly determine that p(Y = 1 | do(X = 1)) = 0.35.Thus, the causal effect of X on Y appears lower than the conditional expectation of Y given X .This is to be expected, given the variable Z, which has a positive influence on both X and Z.

3. Again, we can turn to our table to compute this expression, and we get .5∗ ((.04+ .01)+(.125+.125)) = .35. Indeed, this matches the causal effect, as it should ({Z} is an adjustment set for theeffect of X on Y ).

Suggested Solution for Problem 10.2In such a case, adjustment for Z is not necessary. However, it can still be advantageous: If only theeffect of X on Y is estimated, e.g., by regression, then Z will be included as a “noise factor” in thisregression. Including Z into the regression will reduce the noise, and therefore increase the precisionof the estimate.

Suggested Solution for Problem 10.3Using the back-door criterion, we can find, for instance, the following 3 adjustment sets:

1. {A,B,C,D,E}2. {A,B,C}3. {B,C,D,E}

There is no reason to use set 1 instead of set 3. However, the choice between set 2 and set 3 is notclear. Set 2 is smaller, but set 3 contains more direct causes of Y . Including such direct causes canimprove precision (see Exercise 10.2).

Suggested Solution for Problem 10.4If we take the set of parents of X , then this will always block each back-door path from X to Y . Theonly exception is when Y is itself a parent of X , but then we have a back-door path that cannot beblocked at all (X ← Y ).If we consider multiple X , then things become much more difficult. Take the following example:

J

X1

X2

YU

Here, the set of parents of X = {X1,X2} is {J}. But adjusting for J is forbidden by the back-doorcriterion, because J is also a descendant of X1. (The situation that a variable is both a parent and adescendant of X can only happen when there are multiple X .)

probabilistic graphical models - radboud universiteit · probabilistic graphical models (pgms) ......

Documents