the mind is a neural computer, tted by natural selection
TRANSCRIPT
“The mind is a neural computer, fitted by naturalselection with combinatorial algorithms for causaland probabilistic reasoning about plants, animals,objects, and people.”
. . .“In a universe with any regularities at all,
decisions informed about the past are better thandecisions made at random. That has always beentrue, and we would expect organisms, especiallyinformavores such as humans, to have evolved acuteintuitions about probability. The founders ofprobability, like the founders of logic, assumed theywere just formalizing common sense.”
Steven Pinker, How the Mind Works, 1997, pp. 524, 343.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 1 1 / 17
Learning Objectives
At the end of the class you should be able to:
justify the use and semantics of probability
know how to compute marginals and apply Bayes’theorem
build a belief network for a domain
predict the inferences for a belief network
explain the predictions of a causal model
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 2 2 / 17
Using Uncertain Knowledge
Agents don’t have complete knowledge about the world.
Agents need to make (informed) decisions given theiruncertainty.
It isn’t enough to assume what the world is like.Example: wearing a seat belt.
An agent needs to reason about its uncertainty.
When an agent makes an action under uncertainty, it isgambling =⇒ probability.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 3 3 / 17
Using Uncertain Knowledge
Agents don’t have complete knowledge about the world.
Agents need to make (informed) decisions given theiruncertainty.
It isn’t enough to assume what the world is like.Example: wearing a seat belt.
An agent needs to reason about its uncertainty.
When an agent makes an action under uncertainty, it isgambling =⇒ probability.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 4 3 / 17
Using Uncertain Knowledge
Agents don’t have complete knowledge about the world.
Agents need to make (informed) decisions given theiruncertainty.
It isn’t enough to assume what the world is like.Example: wearing a seat belt.
An agent needs to reason about its uncertainty.
When an agent makes an action under uncertainty, it isgambling =⇒ probability.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 5 3 / 17
Using Uncertain Knowledge
Agents don’t have complete knowledge about the world.
Agents need to make (informed) decisions given theiruncertainty.
It isn’t enough to assume what the world is like.Example: wearing a seat belt.
An agent needs to reason about its uncertainty.
When an agent makes an action under uncertainty, it isgambling =⇒ probability.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 6 3 / 17
Using Uncertain Knowledge
Agents don’t have complete knowledge about the world.
Agents need to make (informed) decisions given theiruncertainty.
It isn’t enough to assume what the world is like.Example: wearing a seat belt.
An agent needs to reason about its uncertainty.
When an agent makes an action under uncertainty, it isgambling =⇒ probability.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 7 3 / 17
Probability
Probability is an agent’s measure of belief in someproposition — subjective probability.
An agent’s belief depends on its prior belief and what itobserves.
Example: An agent’s probability of a particular bird flyingI Other agents may have different probabilitiesI An agent’s belief in a bird’s flying ability is affected by
what the agent knows about that bird.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 8 4 / 17
Probability
Probability is an agent’s measure of belief in someproposition — subjective probability.
An agent’s belief depends on its prior belief and what itobserves.
Example: An agent’s probability of a particular bird flyingI Other agents may have different probabilitiesI An agent’s belief in a bird’s flying ability is affected by
what the agent knows about that bird.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 9 4 / 17
Random Variables
A random variable starts with upper case.
The domain of a variable X , written domain(X ), is theset of values X can take. (Sometimes use “range”,“frame”, “possible values”).
A tuple of random variables 〈X1, . . . ,Xn〉 is a complexrandom variable with domaindomain(X1)× · · · × domain(Xn).Often the tuple is written as X1, . . . ,Xn.
Assignment X = x means variable X has value x .
A proposition is a Boolean formula made fromassignments of values to variables or inequality (e.g., <,≤,. . . ) between variables and values.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 10 5 / 17
Random Variables
A random variable starts with upper case.
The domain of a variable X , written domain(X ), is theset of values X can take. (Sometimes use “range”,“frame”, “possible values”).
A tuple of random variables 〈X1, . . . ,Xn〉 is a complexrandom variable with domaindomain(X1)× · · · × domain(Xn).Often the tuple is written as X1, . . . ,Xn.
Assignment X = x means variable X has value x .
A proposition is a Boolean formula made fromassignments of values to variables or inequality (e.g., <,≤,. . . ) between variables and values.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 11 5 / 17
Random Variables
A random variable starts with upper case.
The domain of a variable X , written domain(X ), is theset of values X can take. (Sometimes use “range”,“frame”, “possible values”).
A tuple of random variables 〈X1, . . . ,Xn〉 is a complexrandom variable with domaindomain(X1)× · · · × domain(Xn).Often the tuple is written as X1, . . . ,Xn.
Assignment X = x means variable X has value x .
A proposition is a Boolean formula made fromassignments of values to variables or inequality (e.g., <,≤,. . . ) between variables and values.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 12 5 / 17
Random Variables
A random variable starts with upper case.
The domain of a variable X , written domain(X ), is theset of values X can take. (Sometimes use “range”,“frame”, “possible values”).
A tuple of random variables 〈X1, . . . ,Xn〉 is a complexrandom variable with domaindomain(X1)× · · · × domain(Xn).Often the tuple is written as X1, . . . ,Xn.
Assignment X = x means variable X has value x .
A proposition is a Boolean formula made fromassignments of values to variables or inequality (e.g., <,≤,. . . ) between variables and values.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 13 5 / 17
Possible World Semantics
A possible world specifies an assignment of one value toeach random variable.
A random variable is a function from possible worlds intothe domain of the random variable.
ω |= X = xmeans variable X is assigned value x in world ω.
Logical connectives have their standard meaning:
ω |= α ∧ β if ω |= α and ω |= β
ω |= α ∨ β if ω |= α or ω |= β
ω |= ¬α if ω 6|= α
Let Ω be the set of all possible worlds.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 14 6 / 17
Possible World Semantics
A possible world specifies an assignment of one value toeach random variable.
A random variable is a function from possible worlds intothe domain of the random variable.
ω |= X = xmeans variable X is assigned value x in world ω.
Logical connectives have their standard meaning:
ω |= α ∧ β if ω |= α and ω |= β
ω |= α ∨ β if ω |= α or ω |= β
ω |= ¬α if ω 6|= α
Let Ω be the set of all possible worlds.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 15 6 / 17
Possible World Semantics
A possible world specifies an assignment of one value toeach random variable.
A random variable is a function from possible worlds intothe domain of the random variable.
ω |= X = xmeans variable X is assigned value x in world ω.
Logical connectives have their standard meaning:
ω |= α ∧ β if ω |= α and ω |= β
ω |= α ∨ β if ω |= α or ω |= β
ω |= ¬α if ω 6|= α
Let Ω be the set of all possible worlds.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 16 6 / 17
Semantics of Probability
Probability defines a measure on sets of possible worlds.A probability measure is a function µ from sets of worlds intothe non-negative real numbers such that:
µ(Ω) = 1
µ(S1 ∪ S2) = µ(S1) + µ(S2)if S1 ∩ S2 = .
Then P(α) = µ(ω | ω |= α).
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 17 7 / 17
Semantics of Probability
Probability defines a measure on sets of possible worlds.A probability measure is a function µ from sets of worlds intothe non-negative real numbers such that:
µ(Ω) = 1
µ(S1 ∪ S2) = µ(S1) + µ(S2)if S1 ∩ S2 = .
Then P(α) = µ(ω | ω |= α).
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 18 7 / 17
Semantics
Possible Worlds:
Suppose the measure of each singleton world is 0.1.
What is the probability of circle?
What us the probability of star?
What is the probability of triangle?
What is the probability of orange?
What is the probability of orange and star?
What are the random variables?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 19 8 / 17
Semantics
Possible Worlds:
Suppose the measure of each singleton world is 0.1.
What is the probability of circle?
What us the probability of star?
What is the probability of triangle?
What is the probability of orange?
What is the probability of orange and star?
What are the random variables?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 20 8 / 17
Semantics
Possible Worlds:
Suppose the measure of each singleton world is 0.1.
What is the probability of circle?
What us the probability of star?
What is the probability of triangle?
What is the probability of orange?
What is the probability of orange and star?
What are the random variables?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 21 8 / 17
Semantics
Possible Worlds:
Suppose the measure of each singleton world is 0.1.
What is the probability of circle?
What us the probability of star?
What is the probability of triangle?
What is the probability of orange?
What is the probability of orange and star?
What are the random variables?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 22 8 / 17
Semantics
Possible Worlds:
Suppose the measure of each singleton world is 0.1.
What is the probability of circle?
What us the probability of star?
What is the probability of triangle?
What is the probability of orange?
What is the probability of orange and star?
What are the random variables?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 23 8 / 17
Semantics
Possible Worlds:
Suppose the measure of each singleton world is 0.1.
What is the probability of circle?
What us the probability of star?
What is the probability of triangle?
What is the probability of orange?
What is the probability of orange and star?
What are the random variables?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 24 8 / 17
Semantics
Possible Worlds:
Suppose the measure of each singleton world is 0.1.
What is the probability of circle?
What us the probability of star?
What is the probability of triangle?
What is the probability of orange?
What is the probability of orange and star?
What are the random variables?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 25 8 / 17
Axioms of Probability (finite case)
Three axioms define what follows from a set of probabilities:
Axiom 1 0 ≤ P(a) for any proposition a.
Axiom 2 P(true) = 1
Axiom 3 P(a ∨ b) = P(a) + P(b) if a and b cannot bothbe true.
These axioms are sound and complete with respect to thesemantics.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 26 9 / 17
Conditioning
Probabilistic conditioning specifies how to revise beliefsbased on new information.
An agent builds a probabilistic model taking allbackground information into account.This gives a prior probability.
All other information must be conditioned on.
If evidence e is the all of the information obtainedsubsequently, the conditional probability P(h | e) of hgiven e is the posterior probability of h.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 27 10 / 17
Conditioning
Probabilistic conditioning specifies how to revise beliefsbased on new information.
An agent builds a probabilistic model taking allbackground information into account.This gives a prior probability.
All other information must be conditioned on.
If evidence e is the all of the information obtainedsubsequently, the conditional probability P(h | e) of hgiven e is the posterior probability of h.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 28 10 / 17
Semantics of Conditional Probability
Evidence e rules out possible worlds incompatible with e.
Evidence e induces a new measure, µe , over possibleworlds:
µe(S) =
c × µ(S)
if ω |= e for all ω ∈ S
0 if ω 6|= e for all ω ∈ S
We can show c = 1P(e)
.
The conditional probability of formula h given evidence eis
P(h | e) = µe(ω : ω |= h)
=P(h ∧ e)
P(e)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 29 11 / 17
Semantics of Conditional Probability
Evidence e rules out possible worlds incompatible with e.
Evidence e induces a new measure, µe , over possibleworlds:
µe(S) =
c × µ(S)
if ω |= e for all ω ∈ S
0 if ω 6|= e for all ω ∈ S
We can show c = 1P(e)
.
The conditional probability of formula h given evidence eis
P(h | e) = µe(ω : ω |= h)
=P(h ∧ e)
P(e)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 30 11 / 17
Semantics of Conditional Probability
Evidence e rules out possible worlds incompatible with e.
Evidence e induces a new measure, µe , over possibleworlds:
µe(S) =
c × µ(S)
if ω |= e for all ω ∈ S
0 if ω 6|= e for all ω ∈ S
We can show c = 1P(e)
.
The conditional probability of formula h given evidence eis
P(h | e) = µe(ω : ω |= h)
=P(h ∧ e)
P(e)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 31 11 / 17
Semantics of Conditional Probability
Evidence e rules out possible worlds incompatible with e.
Evidence e induces a new measure, µe , over possibleworlds:
µe(S) =
c × µ(S) if ω |= e for all ω ∈ S
0 if ω 6|= e for all ω ∈ S
We can show c = 1P(e)
.
The conditional probability of formula h given evidence eis
P(h | e) = µe(ω : ω |= h)
=P(h ∧ e)
P(e)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 32 11 / 17
Semantics of Conditional Probability
Evidence e rules out possible worlds incompatible with e.
Evidence e induces a new measure, µe , over possibleworlds:
µe(S) =
c × µ(S) if ω |= e for all ω ∈ S
0
if ω 6|= e for all ω ∈ S
We can show c = 1P(e)
.
The conditional probability of formula h given evidence eis
P(h | e) = µe(ω : ω |= h)
=P(h ∧ e)
P(e)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 33 11 / 17
Semantics of Conditional Probability
Evidence e rules out possible worlds incompatible with e.
Evidence e induces a new measure, µe , over possibleworlds:
µe(S) =
c × µ(S) if ω |= e for all ω ∈ S0 if ω 6|= e for all ω ∈ S
We can show c = 1P(e)
.
The conditional probability of formula h given evidence eis
P(h | e) = µe(ω : ω |= h)
=P(h ∧ e)
P(e)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 34 11 / 17
Semantics of Conditional Probability
Evidence e rules out possible worlds incompatible with e.
Evidence e induces a new measure, µe , over possibleworlds:
µe(S) =
c × µ(S) if ω |= e for all ω ∈ S0 if ω 6|= e for all ω ∈ S
We can show c =
1P(e)
.
The conditional probability of formula h given evidence eis
P(h | e) = µe(ω : ω |= h)
=P(h ∧ e)
P(e)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 35 11 / 17
Semantics of Conditional Probability
Evidence e rules out possible worlds incompatible with e.
Evidence e induces a new measure, µe , over possibleworlds:
µe(S) =
c × µ(S) if ω |= e for all ω ∈ S0 if ω 6|= e for all ω ∈ S
We can show c = 1P(e)
.
The conditional probability of formula h given evidence eis
P(h | e) =
µe(ω : ω |= h)
=P(h ∧ e)
P(e)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 36 11 / 17
Semantics of Conditional Probability
Evidence e rules out possible worlds incompatible with e.
Evidence e induces a new measure, µe , over possibleworlds:
µe(S) =
c × µ(S) if ω |= e for all ω ∈ S0 if ω 6|= e for all ω ∈ S
We can show c = 1P(e)
.
The conditional probability of formula h given evidence eis
P(h | e) = µe(ω : ω |= h)
=
P(h ∧ e)
P(e)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 37 11 / 17
Semantics of Conditional Probability
Evidence e rules out possible worlds incompatible with e.
Evidence e induces a new measure, µe , over possibleworlds:
µe(S) =
c × µ(S) if ω |= e for all ω ∈ S0 if ω 6|= e for all ω ∈ S
We can show c = 1P(e)
.
The conditional probability of formula h given evidence eis
P(h | e) = µe(ω : ω |= h)
=P(h ∧ e)
P(e)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 38 11 / 17
Conditioning
Possible Worlds:
Observe Color = orange:
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 39 12 / 17
Conditioning
Possible Worlds:
Observe Color = orange:
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 40 12 / 17
Conditioning
Possible Worlds:
Observe Color = orange:
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 41 12 / 17
Exercise
Flu Sneeze Snore µtrue true true 0.064true true false 0.096true false true 0.016true false false 0.024false true true 0.096false true false 0.144false false true 0.224false false false 0.336
What is:
(a) P(flu ∧ sneeze)
(b) P(flu ∧ ¬sneeze)
(c) P(flu)
(d) P(sneeze | flu)
(e) P(¬flu ∧ sneeze)
(f) P(flu | sneeze)
(g) P(sneeze | flu∧snore)
(h) P(flu | sneeze∧snore)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 42 13 / 17
Chain Rule
P(f1 ∧ f2 ∧ . . . ∧ fn)
=
P(fn | f1 ∧ · · · ∧ fn−1)×P(f1 ∧ · · · ∧ fn−1)
= P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)×P(f1 ∧ · · · ∧ fn−2)
= P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)
× · · · × P(f3 | f1 ∧ f2)× P(f2 | f1)× P(f1)
=n∏
i=1
P(fi | f1 ∧ · · · ∧ fi−1)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 43 14 / 17
Chain Rule
P(f1 ∧ f2 ∧ . . . ∧ fn)
= P(fn | f1 ∧ · · · ∧ fn−1)×P(f1 ∧ · · · ∧ fn−1)
=
P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)×P(f1 ∧ · · · ∧ fn−2)
= P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)
× · · · × P(f3 | f1 ∧ f2)× P(f2 | f1)× P(f1)
=n∏
i=1
P(fi | f1 ∧ · · · ∧ fi−1)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 44 14 / 17
Chain Rule
P(f1 ∧ f2 ∧ . . . ∧ fn)
= P(fn | f1 ∧ · · · ∧ fn−1)×P(f1 ∧ · · · ∧ fn−1)
= P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)×P(f1 ∧ · · · ∧ fn−2)
= P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)
× · · · × P(f3 | f1 ∧ f2)× P(f2 | f1)× P(f1)
=n∏
i=1
P(fi | f1 ∧ · · · ∧ fi−1)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 45 14 / 17
Bayes’ theorem
The chain rule and commutativity of conjunction (h ∧ e isequivalent to e ∧ h) gives us:
P(h ∧ e) =
P(h | e)× P(e)
= P(e | h)× P(h).
If P(e) 6= 0, divide the right hand sides by P(e):
P(h | e) =P(e | h)× P(h)
P(e).
This is Bayes’ theorem.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 46 15 / 17
Bayes’ theorem
The chain rule and commutativity of conjunction (h ∧ e isequivalent to e ∧ h) gives us:
P(h ∧ e) = P(h | e)× P(e)
= P(e | h)× P(h).
If P(e) 6= 0, divide the right hand sides by P(e):
P(h | e) =P(e | h)× P(h)
P(e).
This is Bayes’ theorem.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 47 15 / 17
Bayes’ theorem
The chain rule and commutativity of conjunction (h ∧ e isequivalent to e ∧ h) gives us:
P(h ∧ e) = P(h | e)× P(e)
= P(e | h)× P(h).
If P(e) 6= 0, divide the right hand sides by P(e):
P(h | e) =P(e | h)× P(h)
P(e).
This is Bayes’ theorem.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 48 15 / 17
Bayes’ theorem
The chain rule and commutativity of conjunction (h ∧ e isequivalent to e ∧ h) gives us:
P(h ∧ e) = P(h | e)× P(e)
= P(e | h)× P(h).
If P(e) 6= 0, divide the right hand sides by P(e):
P(h | e) =
P(e | h)× P(h)
P(e).
This is Bayes’ theorem.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 49 15 / 17
Bayes’ theorem
The chain rule and commutativity of conjunction (h ∧ e isequivalent to e ∧ h) gives us:
P(h ∧ e) = P(h | e)× P(e)
= P(e | h)× P(h).
If P(e) 6= 0, divide the right hand sides by P(e):
P(h | e) =P(e | h)× P(h)
P(e).
This is Bayes’ theorem.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 50 15 / 17
Why is Bayes’ theorem interesting?
Often you have causal knowledge:P(symptom | disease)P(light is off | status of switches and switch positions)P(alarm | fire)
P(image looks like | a tree is in front of a car)
and want to do evidential reasoning:P(disease | symptom)P(status of switches | light is off and switch positions)P(fire | alarm).
P(a tree is in front of a car | image looks like )
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 51 16 / 17
Exercise
A cab was involved in a hit-and-run accident at night. Twocab companies, the Green and the Blue, operate in the city.You are given the following data:
85% of the cabs in the city are Green and 15% are Blue.
A witness identified the cab as Blue. The court tested thereliability of the witness in the circumstances that existedon the night of the accident and concluded that thewitness correctly identifies each one of the two colours80% of the time and failed 20% of the time.
What is the probability that the cab involved in the accidentwas Blue?
[From D. Kahneman, Thinking Fast and Slow, 2011, p. 166.]
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 52 17 / 17
Conditional independence
Random variable X is independent of random variable Y givenrandom variable(s) Z if,
P(X | YZ ) = P(X | Z )
i.e. for all x ∈ dom(X ), y , y ′ ∈ dom(Y ), and z ∈ dom(Z ),
P(X = x | Y = y ∧ Z = z)
= P(X = x | Y = y ′ ∧ Z = z)
= P(X = x | Z = z).
That is, knowledge of Y ’s value doesn’t affect the belief in thevalue of X , given a value of Z .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 1 1 / 12
Conditional independence
Random variable X is independent of random variable Y givenrandom variable(s) Z if,
P(X | YZ ) = P(X | Z )
i.e. for all x ∈ dom(X ), y , y ′ ∈ dom(Y ), and z ∈ dom(Z ),
P(X = x | Y = y ∧ Z = z)
= P(X = x | Y = y ′ ∧ Z = z)
= P(X = x | Z = z).
That is, knowledge of Y ’s value doesn’t affect the belief in thevalue of X , given a value of Z .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 2 1 / 12
Example
Consider a student writing an exam.What are reasonable independences among the following?
Whether the student works hard
Whether the student is intelligent
The student’s answers on the exam
The student’s mark on an exam
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 3 2 / 12
Example domain (diagnostic assistant)
light
two-wayswitch
switch
off
on
poweroutlet
circuitbreaker
outside power
l1
l2
w1
w0
w2
w4
w3
w6
w5
p2
p1
cb2
cb1s1
s2s3
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 4 3 / 12
Examples of conditional independence?
Suppose you know whether there was power in w1 andwhether there was power in w2 what information isrelevant to whether light l1 is lit? What is independent?
Whether light l1 is lit is independent of the position oflight switch s2 given what?
Every other variable may be independent of whether lightl1 is lit given whether there is power in wire w0 and thestatus of light l1 (if it’s ok , or if not, how it’s broken).
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 5 4 / 12
Examples of conditional independence?
Suppose you know whether there was power in w1 andwhether there was power in w2 what information isrelevant to whether light l1 is lit? What is independent?
Whether light l1 is lit is independent of the position oflight switch s2 given what?
Every other variable may be independent of whether lightl1 is lit given whether there is power in wire w0 and thestatus of light l1 (if it’s ok , or if not, how it’s broken).
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 6 4 / 12
Examples of conditional independence?
Suppose you know whether there was power in w1 andwhether there was power in w2 what information isrelevant to whether light l1 is lit? What is independent?
Whether light l1 is lit is independent of the position oflight switch s2 given what?
Every other variable may be independent of whether lightl1 is lit given
whether there is power in wire w0 and thestatus of light l1 (if it’s ok , or if not, how it’s broken).
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 7 4 / 12
Examples of conditional independence?
Suppose you know whether there was power in w1 andwhether there was power in w2 what information isrelevant to whether light l1 is lit? What is independent?
Whether light l1 is lit is independent of the position oflight switch s2 given what?
Every other variable may be independent of whether lightl1 is lit given whether there is power in wire w0 and thestatus of light l1 (if it’s ok , or if not, how it’s broken).
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 8 4 / 12
Idea of belief networks
l1 is lit (L1 lit) dependsonly on the status of thelight (L1 st) and whetherthere is power in wire w0.
In a belief network, W 0and L1 st are parents ofL1 lit.
w1 w2
s2_pos
s2_st
w0
l1_lit
l1_st
... ... ......
W 0 depends only on
whether there is power in w1,whether there is power in w2, the position of switch s2(S2 pos), and the status of switch s2 (S2 st).
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 9 5 / 12
Idea of belief networks
l1 is lit (L1 lit) dependsonly on the status of thelight (L1 st) and whetherthere is power in wire w0.
In a belief network, W 0and L1 st are parents ofL1 lit.
w1 w2
s2_pos
s2_st
w0
l1_lit
l1_st
... ... ......
W 0 depends only on whether there is power in w1,whether there is power in w2, the position of switch s2(S2 pos), and the status of switch s2 (S2 st).
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 10 5 / 12
Belief networks
Totally order the variables of interest: X1, . . . ,Xn
Theorem of probability theory (chain rule):P(X1, . . . ,Xn) =
∏ni=1 P(Xi | X1, . . . ,Xi−1)
The parents parents(Xi) of Xi are those predecessors ofXi that render Xi independent of the other predecessors.That is,
parents(Xi) ⊆ X1, . . . ,Xi−1 andP(Xi | parents(Xi)) = P(Xi | X1, . . . ,Xi−1)
So P(X1, . . . ,Xn) =∏n
i=1 P(Xi | parents(Xi))
A belief network is a graph: the nodes are randomvariables; there is an arc from the parents of each nodeinto that node.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 11 6 / 12
Belief networks
Totally order the variables of interest: X1, . . . ,Xn
Theorem of probability theory (chain rule):P(X1, . . . ,Xn) =
∏ni=1 P(Xi | X1, . . . ,Xi−1)
The parents parents(Xi) of Xi are those predecessors ofXi that render Xi independent of the other predecessors.That is, parents(Xi) ⊆ X1, . . . ,Xi−1 andP(Xi | parents(Xi)) = P(Xi | X1, . . . ,Xi−1)
So P(X1, . . . ,Xn) =∏n
i=1 P(Xi | parents(Xi))
A belief network is a graph: the nodes are randomvariables; there is an arc from the parents of each nodeinto that node.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 12 6 / 12
Example: fire alarm belief network
Variables:
Fire: there is a fire in the building
Tampering: someone has been tampering with the firealarm
Smoke: what appears to be smoke is coming from anupstairs window
Alarm: the fire alarm goes off
Leaving: people are leaving the building en masse.
Report: a colleague says that people are leaving thebuilding en masse. (A noisy sensor for leaving.)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 13 7 / 12
Components of a belief network
A belief network consists of:
a directed acyclic graph with nodes labeled with randomvariables
a domain for each random variable
a set of conditional probability tables for each variablegiven its parents (including prior probabilities for nodeswith no parents).
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 14 8 / 12
Example belief network
Outside_power
W3
Cb1_st Cb2_st
W6
W2
W0
W1
W4
S1_st
S2_st
P1P2
S1_pos
S2_pos
S3_pos
S3_st
L2_st
L2_lit
L1_st
L1_lit
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 15 9 / 12
Example belief network (continued)
The belief network also specifies:
The domain of the variables:W0, . . . ,W6 have domain live, deadS1 pos, S2 pos, and S3 pos have domain up, downS1 st has ok , upside down, short, intermittent, broken.Conditional probabilities, including:P(W1 = live | s1 pos = up ∧ S1 st = ok ∧W3 = live)P(W1 = live | s1 pos = up ∧ S1 st = ok ∧W3 = dead)P(S1 pos = up)P(S1 st = upside down)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 16 10 / 12
Belief network summary
A belief network is a directed acyclic graph (DAG) wherenodes are random variables.
The parents of a node n are those variables on which ndirectly depends.
A belief network is automatically acyclic by construction.
A belief network is a graphical representation ofdependence and independence:
I A variable is independent of its non-descendants givenits parents.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 17 11 / 12
Constructing belief networks
To represent a domain in a belief network, you need toconsider:
What are the relevant variables?I What will you observe?I What would you like to find out (query)?I What other features make the model simpler?
What values should these variables take?
What is the relationship between them? This should beexpressed in terms of a directed graph, representing howeach variable is generated from its predecessors.
How does the value of each variable depend on itsparents? This is expressed in terms of the conditionalprobabilities.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 18 12 / 12
Understanding Independence: Common ancestors
smokealarm
fire
alarm and smoke are
dependent
alarm and smoke are
independent
given fire
Intuitively, fire canexplain alarm andsmoke; learning onecan affect the other bychanging your belief infire.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 1 1 / 20
Understanding Independence: Common ancestors
smokealarm
fire
alarm and smoke aredependent
alarm and smoke are
independent
given fire
Intuitively, fire canexplain alarm andsmoke; learning onecan affect the other bychanging your belief infire.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 2 1 / 20
Understanding Independence: Common ancestors
smokealarm
fire
alarm and smoke aredependent
alarm and smoke are
independent
given fire
Intuitively, fire canexplain alarm andsmoke; learning onecan affect the other bychanging your belief infire.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 3 1 / 20
Understanding Independence: Common ancestors
smokealarm
fire
alarm and smoke aredependent
alarm and smoke areindependent given fire
Intuitively, fire canexplain alarm andsmoke; learning onecan affect the other bychanging your belief infire.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 4 1 / 20
Understanding Independence: Common ancestors
smokealarm
fire
alarm and smoke aredependent
alarm and smoke areindependent given fire
Intuitively, fire canexplain alarm andsmoke; learning onecan affect the other bychanging your belief infire.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 5 1 / 20
Understanding Independence: Chain
report
alarm
leaving
alarm and report are
dependent
alarm and report are
independent
givenleaving
Intuitively, the onlyway that the alarmaffects report is byaffecting leaving .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 6 2 / 20
Understanding Independence: Chain
report
alarm
leaving
alarm and report aredependent
alarm and report are
independent
givenleaving
Intuitively, the onlyway that the alarmaffects report is byaffecting leaving .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 7 2 / 20
Understanding Independence: Chain
report
alarm
leaving
alarm and report aredependent
alarm and report are
independent
givenleaving
Intuitively, the onlyway that the alarmaffects report is byaffecting leaving .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 8 2 / 20
Understanding Independence: Chain
report
alarm
leaving
alarm and report aredependent
alarm and report areindependent givenleaving
Intuitively, the onlyway that the alarmaffects report is byaffecting leaving .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 9 2 / 20
Understanding Independence: Chain
report
alarm
leaving
alarm and report aredependent
alarm and report areindependent givenleaving
Intuitively, the onlyway that the alarmaffects report is byaffecting leaving .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 10 2 / 20
Understanding Independence: Common
descendants
tampering
alarm
fire tampering and fire are
independent
tampering and fire are
dependent
given alarm
Intuitively, tamperingcan explain away fire
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 11 3 / 20
Understanding Independence: Common
descendants
tampering
alarm
fire tampering and fire areindependent
tampering and fire are
dependent
given alarm
Intuitively, tamperingcan explain away fire
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 12 3 / 20
Understanding Independence: Common
descendants
tampering
alarm
fire tampering and fire areindependent
tampering and fire are
dependent
given alarm
Intuitively, tamperingcan explain away fire
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 13 3 / 20
Understanding Independence: Common
descendants
tampering
alarm
fire tampering and fire areindependent
tampering and fire aredependent given alarm
Intuitively, tamperingcan explain away fire
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 14 3 / 20
Understanding Independence: Common
descendants
tampering
alarm
fire tampering and fire areindependent
tampering and fire aredependent given alarm
Intuitively, tamperingcan explain away fire
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 15 3 / 20
Understanding independence: example
B CA D E F
G H
M
I J K L
N
Q
R
O
S
P
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 16 4 / 20
Understanding independence: questions
1. On which given probabilities does P(N) depend?
2. If you were to observe a value for B , which variables’probabilities will change?
3. If you were to observe a value for N , which variables’probabilities will change?
4. Suppose you had observed a value for M ; if you were tothen observe a value for N , which variables’ probabilitieswill change?
5. Suppose you had observed B and Q; which variables’probabilities will change when you observe N?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 17 5 / 20
Understanding independence: questions
1. On which given probabilities does P(N) depend?
2. If you were to observe a value for B , which variables’probabilities will change?
3. If you were to observe a value for N , which variables’probabilities will change?
4. Suppose you had observed a value for M ; if you were tothen observe a value for N , which variables’ probabilitieswill change?
5. Suppose you had observed B and Q; which variables’probabilities will change when you observe N?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 18 5 / 20
Understanding independence: questions
1. On which given probabilities does P(N) depend?
2. If you were to observe a value for B , which variables’probabilities will change?
3. If you were to observe a value for N , which variables’probabilities will change?
4. Suppose you had observed a value for M ; if you were tothen observe a value for N , which variables’ probabilitieswill change?
5. Suppose you had observed B and Q; which variables’probabilities will change when you observe N?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 19 5 / 20
Understanding independence: questions
1. On which given probabilities does P(N) depend?
2. If you were to observe a value for B , which variables’probabilities will change?
3. If you were to observe a value for N , which variables’probabilities will change?
4. Suppose you had observed a value for M ; if you were tothen observe a value for N , which variables’ probabilitieswill change?
5. Suppose you had observed B and Q; which variables’probabilities will change when you observe N?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 20 5 / 20
Understanding independence: questions
1. On which given probabilities does P(N) depend?
2. If you were to observe a value for B , which variables’probabilities will change?
3. If you were to observe a value for N , which variables’probabilities will change?
4. Suppose you had observed a value for M ; if you were tothen observe a value for N , which variables’ probabilitieswill change?
5. Suppose you had observed B and Q; which variables’probabilities will change when you observe N?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 21 5 / 20
What variables are affected by observing?
If you observe variable(s) Y , the variables whose posteriorprobability is different from their prior are:
I The ancestors of Y andI their descendants.
Intuitively (if you have a causal belief network):I You do abduction to possible causes andI prediction from the causes.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 22 6 / 20
d-separation
A connection is a meeting of arcs in a belief network. Aconnection is open is defined as follows:
I If there are arcs A→ B and B → C such that B 6∈ Z ,then the connection at B between A and C is open.
I If there are arcs B → A and B → C such that B 6∈ Z ,then the connection at B between A and C is open.
I If there are arcs A→ B and C → B such that B (or adescendent of B) is in Z , then the connection at Bbetween A and C is open.
X is d-connected from Y given Z if there is a path fromX to Y , along open connections.
X is d-separated from Y given Z if it is not d-connected.
X is independent Y given Z for all conditionalprobabilities iff X is d-separated from Y given Z
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 23 7 / 20
d-separation
A connection is a meeting of arcs in a belief network. Aconnection is open is defined as follows:
I If there are arcs A→ B and B → C such that B 6∈ Z ,then the connection at B between A and C is open.
I If there are arcs B → A and B → C such that B 6∈ Z ,then the connection at B between A and C is open.
I If there are arcs A→ B and C → B such that B (or adescendent of B) is in Z , then the connection at Bbetween A and C is open.
X is d-connected from Y given Z if there is a path fromX to Y , along open connections.
X is d-separated from Y given Z if it is not d-connected.
X is independent Y given Z for all conditionalprobabilities iff X is d-separated from Y given Z
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 24 7 / 20
d-separation
A connection is a meeting of arcs in a belief network. Aconnection is open is defined as follows:
I If there are arcs A→ B and B → C such that B 6∈ Z ,then the connection at B between A and C is open.
I If there are arcs B → A and B → C such that B 6∈ Z ,then the connection at B between A and C is open.
I If there are arcs A→ B and C → B such that B (or adescendent of B) is in Z , then the connection at Bbetween A and C is open.
X is d-connected from Y given Z if there is a path fromX to Y , along open connections.
X is d-separated from Y given Z if it is not d-connected.
X is independent Y given Z for all conditionalprobabilities iff X is d-separated from Y given Z
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 25 7 / 20
Markov Random Field
A Markov random field is composed of
of a set of discrete-valued random variables:X = X1,X2, . . . ,Xn and
a set of factors f1, . . . , fm, where a factor is anon-negative function of a subset of the variables.
and defines a joint probability distribution:
P(X = x) =1
Z
∏k
fk(Xk = xk) .
Z =∑
x
∏k
fk(Xk = xk)
where fk(Xk) is a factor on Xk ⊆ X, andxk is x projected onto Xk .Z is a normalization constant known as the partition function.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 26 8 / 20
Markov Networks and Factor graphs
A Markov network is a graphical representation of aMarkov random field where the nodes are the randomvariables and there is an arc between any two variablesthat are in a factor together.
A factor graph is a bipartite graph, which contains avariable node for each random variable and a factor nodefor each factor. There is an edge between a variable nodeand a factor node if the variable appears in the factor.
A belief network is a type of Markov random field wherethe factors represent conditional probabilities, there is afactor for each variable, and directed graph is acyclic.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 27 9 / 20
Markov Networks and Factor graphs
A Markov network is a graphical representation of aMarkov random field where the nodes are the randomvariables and there is an arc between any two variablesthat are in a factor together.
A factor graph is a bipartite graph, which contains avariable node for each random variable and a factor nodefor each factor. There is an edge between a variable nodeand a factor node if the variable appears in the factor.
A belief network is a type of Markov random field wherethe factors represent conditional probabilities, there is afactor for each variable, and directed graph is acyclic.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 28 9 / 20
Markov Networks and Factor graphs
A Markov network is a graphical representation of aMarkov random field where the nodes are the randomvariables and there is an arc between any two variablesthat are in a factor together.
A factor graph is a bipartite graph, which contains avariable node for each random variable and a factor nodefor each factor. There is an edge between a variable nodeand a factor node if the variable appears in the factor.
A belief network is a
type of Markov random field wherethe factors represent conditional probabilities, there is afactor for each variable, and directed graph is acyclic.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 29 9 / 20
Markov Networks and Factor graphs
A Markov network is a graphical representation of aMarkov random field where the nodes are the randomvariables and there is an arc between any two variablesthat are in a factor together.
A factor graph is a bipartite graph, which contains avariable node for each random variable and a factor nodefor each factor. There is an edge between a variable nodeand a factor node if the variable appears in the factor.
A belief network is a type of Markov random field wherethe factors represent conditional probabilities, there is afactor for each variable, and directed graph is acyclic.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 30 9 / 20
Independence in a Markov Network
The Markov blanket of a variable X is the set of variablesthat are in factors with X .
A variable is independent of the other variables given itsMarkov blanket.
X is connected to Y given Z if there is a path from X toY in the Markov network, which does not contain anelement of Z .
X is separated from Y given Z if it is not connected.
A positive distribution is one that does not contain zeroprobabilities.
X is independent Y given Z for all positive distributionsiff X is separated from Y given Z
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 31 10 / 20
Independence in a Markov Network
The Markov blanket of a variable X is the set of variablesthat are in factors with X .
A variable is independent of the other variables given itsMarkov blanket.
X is connected to Y given Z if there is a path from X toY in the Markov network, which does not contain anelement of Z .
X is separated from Y given Z if it is not connected.
A positive distribution is one that does not contain zeroprobabilities.
X is independent Y given Z for all positive distributionsiff X is separated from Y given Z
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 32 10 / 20
Independence in a Markov Network
The Markov blanket of a variable X is the set of variablesthat are in factors with X .
A variable is independent of the other variables given itsMarkov blanket.
X is connected to Y given Z if there is a path from X toY in the Markov network, which does not contain anelement of Z .
X is separated from Y given Z if it is not connected.
A positive distribution is one that does not contain zeroprobabilities.
X is independent Y given Z for all positive distributionsiff X is separated from Y given Z
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 33 10 / 20
Canonical Representations
The parameters of a graphical model are the numbersthat define the model.
A belief network is a canonical representation: given thestructure and the distribution, the parameters areuniquely determined.
A Markov random field is not a canonical representation.Many different parameterizations result in the samedistribution.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 34 11 / 20
Representations of Conditional Probabilities
There are many representations of conditional probabilities andfactors:
Tables
Decision Trees
Rules
Weighted Logical Formulae
Contextual Tables
Logistic Functions
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 35 12 / 20
Tabular Representation
P(D | A,B ,C ) :
A B C D Prob
true true true true 0.9true true true false 0.1true true false true 0.9true true false false 0.1true false true true 0.2true false true false 0.8true false false true 0.2true false false false 0.8false true true true 0.3false true true false 0.7false true false true 0.4false true false false 0.6false false true true 0.3false false true false 0.7false false false true 0.4false false false false 0.6
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 36 13 / 20
Decision Tree Representation
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 37 14 / 20
Rule Representation
0.9 : d ← a ∧ b
0.2 : d ← a ∧ ¬b
0.3 : d ← ¬a ∧ c
0.4 : d ← ¬a ∧ ¬c
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 38 15 / 20
Weighted Logical Formulae
d ↔ ((a ∧ b ∧ n0)
∨ (a ∧ ¬b ∧ n1)
∨ (¬a ∧ c ∧ n2)
∨ (¬a ∧ ¬c ∧ n2))
ni are independent:
P(n0) = 0.9
P(n1) = 0.2
P(n2) = 0.3
P(n3) = 0.4
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 39 16 / 20
Contextual-Table Representation
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 40 17 / 20
Logistic Functions
P(h | e) =P(h ∧ e)
P(e)
=P(h ∧ e)
P(h ∧ e) + P(¬h ∧ e)
=1
1 + P(¬h ∧ e)/P(h ∧ e)
=1
1 + e− logP(h∧e)/P(¬h∧e)
= sigmoid(log odds(h | e))
sigmoid(x) =1
1 + e−x
odds(h | e) =P(h ∧ e)
P(¬h ∧ e)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 41 18 / 20
Logistic Functions
P(h | e) =P(h ∧ e)
P(e)
=P(h ∧ e)
P(h ∧ e) + P(¬h ∧ e)
=1
1 + P(¬h ∧ e)/P(h ∧ e)
=1
1 + e− logP(h∧e)/P(¬h∧e)
= sigmoid(log odds(h | e))
sigmoid(x) =1
1 + e−x
odds(h | e) =P(h ∧ e)
P(¬h ∧ e)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 42 18 / 20
Logistic Functions
P(h | e) =P(h ∧ e)
P(e)
=P(h ∧ e)
P(h ∧ e) + P(¬h ∧ e)
=1
1 + P(¬h ∧ e)/P(h ∧ e)
=1
1 + e− logP(h∧e)/P(¬h∧e)
= sigmoid(log odds(h | e))
sigmoid(x) =1
1 + e−x
odds(h | e) =P(h ∧ e)
P(¬h ∧ e)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 43 18 / 20
Logistic Functions
P(h | e) =P(h ∧ e)
P(e)
=P(h ∧ e)
P(h ∧ e) + P(¬h ∧ e)
=1
1 + P(¬h ∧ e)/P(h ∧ e)
=1
1 + e− logP(h∧e)/P(¬h∧e)
= sigmoid(log odds(h | e))
sigmoid(x) =1
1 + e−x
odds(h | e) =P(h ∧ e)
P(¬h ∧ e)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 44 18 / 20
Logistic Functions
P(h | e) =P(h ∧ e)
P(e)
=P(h ∧ e)
P(h ∧ e) + P(¬h ∧ e)
=1
1 + P(¬h ∧ e)/P(h ∧ e)
=1
1 + e− logP(h∧e)/P(¬h∧e)
= sigmoid(log odds(h | e))
sigmoid(x) =1
1 + e−x
odds(h | e) =P(h ∧ e)
P(¬h ∧ e)c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 45 18 / 20
Logistic Functions
A conditional probability is the sigmoid of the log-odds.
00.10.20.30.40.50.60.70.80.9
1
-10 -5 0 5 10
1
1 + e- x
sigmoid(x) =1
1 + e−x
A logistic function is the sigmoid of a linear function.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 46 19 / 20
Logistic Representation of Conditional Probability
P(d | A,B ,C ) = sigmoid(0.9† ∗ A ∗ B
+ 0.2† ∗ A ∗ (1− B)
+ 0.3† ∗ (1− A) ∗ C
+ 0.4† ∗ (1− A) ∗ (1− C ))
where 0.9† is sigmoid−1(0.9).
P(d | A,B ,C ) = sigmoid(0.4†
+ (0.2† − 0.4†) ∗ A
+ (0.9† − 0.2†) ∗ A ∗ B
+ ...
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 47 20 / 20
Logistic Representation of Conditional Probability
P(d | A,B ,C ) = sigmoid(0.9† ∗ A ∗ B
+ 0.2† ∗ A ∗ (1− B)
+ 0.3† ∗ (1− A) ∗ C
+ 0.4† ∗ (1− A) ∗ (1− C ))
where 0.9† is sigmoid−1(0.9).
P(d | A,B ,C ) = sigmoid(0.4†
+ (0.2† − 0.4†) ∗ A
+ (0.9† − 0.2†) ∗ A ∗ B
+ ...
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 48 20 / 20
Belief network inference
Main approaches to determine posterior distributions in graphicalmodels:
Variable Elimination, recursive conditioning: exploit thestructure of the network to eliminate (sum out) thenon-observed, non-query variables one at a time.
Stochastic simulation: random cases are generated accordingto the probability distributions.
Variational methods: find the closest tractable distribution tothe (posterior) distribution we are interested in.
Bounding approaches: bound the conditional probabilitesabove and below and iteratively reduce the bounds.
. . .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 1 1 / 17
Belief network inference
Main approaches to determine posterior distributions in graphicalmodels:
Variable Elimination, recursive conditioning: exploit thestructure of the network to eliminate (sum out) thenon-observed, non-query variables one at a time.
Stochastic simulation: random cases are generated accordingto the probability distributions.
Variational methods: find the closest tractable distribution tothe (posterior) distribution we are interested in.
Bounding approaches: bound the conditional probabilitesabove and below and iteratively reduce the bounds.
. . .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 2 1 / 17
Belief network inference
Main approaches to determine posterior distributions in graphicalmodels:
Variable Elimination, recursive conditioning: exploit thestructure of the network to eliminate (sum out) thenon-observed, non-query variables one at a time.
Stochastic simulation: random cases are generated accordingto the probability distributions.
Variational methods: find the closest tractable distribution tothe (posterior) distribution we are interested in.
Bounding approaches: bound the conditional probabilitesabove and below and iteratively reduce the bounds.
. . .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 3 1 / 17
Belief network inference
Main approaches to determine posterior distributions in graphicalmodels:
Variable Elimination, recursive conditioning: exploit thestructure of the network to eliminate (sum out) thenon-observed, non-query variables one at a time.
Stochastic simulation: random cases are generated accordingto the probability distributions.
Variational methods: find the closest tractable distribution tothe (posterior) distribution we are interested in.
Bounding approaches: bound the conditional probabilitesabove and below and iteratively reduce the bounds.
. . .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 4 1 / 17
Factors
A factor is a representation of a function from a tuple ofrandom variables into a number.
We write factor f on variables X1, . . . ,Xj as f (X1, . . . ,Xj).
We can assign some or all of the variables of a factor:I f (X1 = v1,X2, . . . ,Xj), where v1 ∈ dom(X1), is a factor on
X2, . . . ,Xj .I f (X1 = v1,X2 = v2, . . . ,Xj = vj) is a number that is the value
of f when each Xi has value vi .
The former is also written as f (X1,X2, . . . ,Xj)X1 = v1 , etc.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 5 2 / 17
Factors
A factor is a representation of a function from a tuple ofrandom variables into a number.
We write factor f on variables X1, . . . ,Xj as f (X1, . . . ,Xj).
We can assign some or all of the variables of a factor:I f (X1 = v1,X2, . . . ,Xj), where v1 ∈ dom(X1), is a factor on
X2, . . . ,Xj .I f (X1 = v1,X2 = v2, . . . ,Xj = vj) is a number that is the value
of f when each Xi has value vi .
The former is also written as f (X1,X2, . . . ,Xj)X1 = v1 , etc.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 6 2 / 17
Example factors
r(X ,Y ,Z ):
X Y Z val
t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7
r(X=t,Y ,Z ):
Y Z val
t t 0.1t f
0.9
f t
0.2
f f
0.8
r(X=t,Y ,Z=f ):
Y val
t
0.9
f
0.8
r(X=t,Y=f ,Z=f ) = 0.8
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 7 3 / 17
Example factors
r(X ,Y ,Z ):
X Y Z val
t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7
r(X=t,Y ,Z ):
Y Z val
t t 0.1t f 0.9f t
0.2
f f
0.8
r(X=t,Y ,Z=f ):
Y val
t
0.9
f
0.8
r(X=t,Y=f ,Z=f ) = 0.8
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 8 3 / 17
Example factors
r(X ,Y ,Z ):
X Y Z val
t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7
r(X=t,Y ,Z ):
Y Z val
t t 0.1t f 0.9f t 0.2f f
0.8
r(X=t,Y ,Z=f ):
Y val
t
0.9
f
0.8
r(X=t,Y=f ,Z=f ) = 0.8
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 9 3 / 17
Example factors
r(X ,Y ,Z ):
X Y Z val
t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7
r(X=t,Y ,Z ):
Y Z val
t t 0.1t f 0.9f t 0.2f f 0.8
r(X=t,Y ,Z=f ):
Y val
t
0.9
f
0.8
r(X=t,Y=f ,Z=f ) = 0.8
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 10 3 / 17
Example factors
r(X ,Y ,Z ):
X Y Z val
t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7
r(X=t,Y ,Z ):
Y Z val
t t 0.1t f 0.9f t 0.2f f 0.8
r(X=t,Y ,Z=f ):
Y val
t
0.9
f
0.8
r(X=t,Y=f ,Z=f ) = 0.8
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 11 3 / 17
Example factors
r(X ,Y ,Z ):
X Y Z val
t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7
r(X=t,Y ,Z ):
Y Z val
t t 0.1t f 0.9f t 0.2f f 0.8
r(X=t,Y ,Z=f ):
Y val
t
0.9
f
0.8
r(X=t,Y=f ,Z=f ) = 0.8
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 12 3 / 17
Example factors
r(X ,Y ,Z ):
X Y Z val
t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7
r(X=t,Y ,Z ):
Y Z val
t t 0.1t f 0.9f t 0.2f f 0.8
r(X=t,Y ,Z=f ):
Y val
t 0.9f
0.8
r(X=t,Y=f ,Z=f ) = 0.8
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 13 3 / 17
Example factors
r(X ,Y ,Z ):
X Y Z val
t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7
r(X=t,Y ,Z ):
Y Z val
t t 0.1t f 0.9f t 0.2f f 0.8
r(X=t,Y ,Z=f ):
Y val
t 0.9f 0.8
r(X=t,Y=f ,Z=f ) = 0.8
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 14 3 / 17
Example factors
r(X ,Y ,Z ):
X Y Z val
t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7
r(X=t,Y ,Z ):
Y Z val
t t 0.1t f 0.9f t 0.2f f 0.8
r(X=t,Y ,Z=f ):
Y val
t 0.9f 0.8
r(X=t,Y=f ,Z=f ) =
0.8
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 15 3 / 17
Example factors
r(X ,Y ,Z ):
X Y Z val
t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7
r(X=t,Y ,Z ):
Y Z val
t t 0.1t f 0.9f t 0.2f f 0.8
r(X=t,Y ,Z=f ):
Y val
t 0.9f 0.8
r(X=t,Y=f ,Z=f ) = 0.8
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 16 3 / 17
Multiplying factors
The product of factor f1(X ,Y ) and f2(Y ,Z ), where Y are thevariables in common, is the factor (f1 ∗ f2)(X ,Y ,Z ) defined by:
(f1 ∗ f2)(X ,Y ,Z ) = f1(X ,Y )f2(Y ,Z ).
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 17 4 / 17
Multiplying factors example
f1:
A B valt t 0.1t f 0.9f t 0.2f f 0.8
f2:
B C valt t 0.3t f 0.7f t 0.6f f 0.4
f1 ∗ f2:
A B C valt t t 0.03t t f
0.07
t f t
0.54
t f f
0.36
f t t
0.06
f t f
0.14
f f t
0.48
f f f
0.32
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 18 5 / 17
Multiplying factors example
f1:
A B valt t 0.1t f 0.9f t 0.2f f 0.8
f2:
B C valt t 0.3t f 0.7f t 0.6f f 0.4
f1 ∗ f2:
A B C valt t t 0.03t t f 0.07t f t
0.54
t f f
0.36
f t t
0.06
f t f
0.14
f f t
0.48
f f f
0.32
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 19 5 / 17
Multiplying factors example
f1:
A B valt t 0.1t f 0.9f t 0.2f f 0.8
f2:
B C valt t 0.3t f 0.7f t 0.6f f 0.4
f1 ∗ f2:
A B C valt t t 0.03t t f 0.07t f t 0.54t f f
0.36
f t t
0.06
f t f
0.14
f f t
0.48
f f f
0.32
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 20 5 / 17
Multiplying factors example
f1:
A B valt t 0.1t f 0.9f t 0.2f f 0.8
f2:
B C valt t 0.3t f 0.7f t 0.6f f 0.4
f1 ∗ f2:
A B C valt t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t
0.06
f t f
0.14
f f t
0.48
f f f
0.32
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 21 5 / 17
Multiplying factors example
f1:
A B valt t 0.1t f 0.9f t 0.2f f 0.8
f2:
B C valt t 0.3t f 0.7f t 0.6f f 0.4
f1 ∗ f2:
A B C valt t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f
0.14
f f t
0.48
f f f
0.32
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 22 5 / 17
Multiplying factors example
f1:
A B valt t 0.1t f 0.9f t 0.2f f 0.8
f2:
B C valt t 0.3t f 0.7f t 0.6f f 0.4
f1 ∗ f2:
A B C valt t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f 0.14f f t
0.48
f f f
0.32
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 23 5 / 17
Multiplying factors example
f1:
A B valt t 0.1t f 0.9f t 0.2f f 0.8
f2:
B C valt t 0.3t f 0.7f t 0.6f f 0.4
f1 ∗ f2:
A B C valt t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f 0.14f f t 0.48f f f
0.32
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 24 5 / 17
Multiplying factors example
f1:
A B valt t 0.1t f 0.9f t 0.2f f 0.8
f2:
B C valt t 0.3t f 0.7f t 0.6f f 0.4
f1 ∗ f2:
A B C valt t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f 0.14f f t 0.48f f f 0.32
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 25 5 / 17
Summing out variables
We can sum out a variable, say X1 with range v1, . . . , vk, fromfactor f (X1, . . . ,Xj), resulting in a factor on X2, . . . ,Xj defined by:
(∑X1
f )(X2, . . . ,Xj)
= f (X1 = v1, . . . ,Xj) + · · ·+ f (X1 = vk , . . . ,Xj)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 26 6 / 17
Summing out a variable example
f3:
A B C val
t t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f 0.14f f t 0.48f f f 0.32
∑B f3:
A C val
t t 0.57t f
0.43
f t
0.54
f f
0.46
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 27 7 / 17
Summing out a variable example
f3:
A B C val
t t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f 0.14f f t 0.48f f f 0.32
∑B f3:
A C val
t t 0.57t f 0.43f t
0.54
f f
0.46
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 28 7 / 17
Summing out a variable example
f3:
A B C val
t t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f 0.14f f t 0.48f f f 0.32
∑B f3:
A C val
t t 0.57t f 0.43f t 0.54f f
0.46
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 29 7 / 17
Summing out a variable example
f3:
A B C val
t t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f 0.14f f t 0.48f f f 0.32
∑B f3:
A C val
t t 0.57t f 0.43f t 0.54f f 0.46
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 30 7 / 17
Exercise
Given factors:
s:
A val
t 0.75f 0.25
t:
A B val
t t 0.6t f 0.4f t 0.2f f 0.8
o:
A val
t 0.3f 0.1
What are the following factors over?
(a) s ∗ t(b)
∑A s ∗ t
(c)∑
B s ∗ t(d)
∑A
∑B s ∗ t
(e) s ∗ t ∗ o(f)
∑B s ∗ t ∗ o
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 31 8 / 17
Evidence
If we want to compute the posterior probability of Z givenevidence Y1 = v1 ∧ . . . ∧ Yj = vj :
P(Z |Y1 = v1, . . . ,Yj = vj)
=
P(Z ,Y1 = v1, . . . ,Yj = vj)
P(Y1 = v1, . . . ,Yj = vj)
=P(Z ,Y1 = v1, . . . ,Yj = vj)∑Z P(Z ,Y1 = v1, . . . ,Yj = vj).
So the computation reduces to the probability ofP(Z ,Y1 = v1, . . . ,Yj = vj).
Normalize at the end, by summing out Z and dividing.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 32 9 / 17
Evidence
If we want to compute the posterior probability of Z givenevidence Y1 = v1 ∧ . . . ∧ Yj = vj :
P(Z |Y1 = v1, . . . ,Yj = vj)
=P(Z ,Y1 = v1, . . . ,Yj = vj)
P(Y1 = v1, . . . ,Yj = vj)
=
P(Z ,Y1 = v1, . . . ,Yj = vj)∑Z P(Z ,Y1 = v1, . . . ,Yj = vj).
So the computation reduces to the probability ofP(Z ,Y1 = v1, . . . ,Yj = vj).
Normalize at the end, by summing out Z and dividing.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 33 9 / 17
Evidence
If we want to compute the posterior probability of Z givenevidence Y1 = v1 ∧ . . . ∧ Yj = vj :
P(Z |Y1 = v1, . . . ,Yj = vj)
=P(Z ,Y1 = v1, . . . ,Yj = vj)
P(Y1 = v1, . . . ,Yj = vj)
=P(Z ,Y1 = v1, . . . ,Yj = vj)∑Z P(Z ,Y1 = v1, . . . ,Yj = vj).
So the computation reduces to the probability ofP(Z ,Y1 = v1, . . . ,Yj = vj).
Normalize at the end, by summing out Z and dividing.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 34 9 / 17
Probability of a conjunction
Suppose the variables of the belief network are X1, . . . ,Xn.To compute P(Z ,Y1 = v1, . . . ,Yj = vj), we sum out the othervariables, Z1, . . . ,Zk = X1, . . . ,Xn − Z − Y1, . . . ,Yj.We order the Zi into an elimination ordering.
P(Z ,Y1 = v1, . . . ,Yj = vj)
=
∑Zk
· · ·∑Z1
P(X1, . . . ,Xn)Y1 = v1,...,Yj = vj .
=∑Zk
· · ·∑Z1
n∏i=1
P(Xi |parents(Xi ))Y1 = v1,...,Yj = vj .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 35 10 / 17
Probability of a conjunction
Suppose the variables of the belief network are X1, . . . ,Xn.To compute P(Z ,Y1 = v1, . . . ,Yj = vj), we sum out the othervariables, Z1, . . . ,Zk = X1, . . . ,Xn − Z − Y1, . . . ,Yj.We order the Zi into an elimination ordering.
P(Z ,Y1 = v1, . . . ,Yj = vj)
=∑Zk
· · ·∑Z1
P(X1, . . . ,Xn)Y1 = v1,...,Yj = vj .
=
∑Zk
· · ·∑Z1
n∏i=1
P(Xi |parents(Xi ))Y1 = v1,...,Yj = vj .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 36 10 / 17
Probability of a conjunction
Suppose the variables of the belief network are X1, . . . ,Xn.To compute P(Z ,Y1 = v1, . . . ,Yj = vj), we sum out the othervariables, Z1, . . . ,Zk = X1, . . . ,Xn − Z − Y1, . . . ,Yj.We order the Zi into an elimination ordering.
P(Z ,Y1 = v1, . . . ,Yj = vj)
=∑Zk
· · ·∑Z1
P(X1, . . . ,Xn)Y1 = v1,...,Yj = vj .
=∑Zk
· · ·∑Z1
n∏i=1
P(Xi |parents(Xi ))Y1 = v1,...,Yj = vj .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 37 10 / 17
Computing sums of products
Computation in belief networks reduces to computing the sums ofproducts.
How can we compute ab + ac efficiently?
Distribute out the a giving a(b + c)
How can we compute∑
Z1
∏ni=1 P(Xi |parents(Xi )) efficiently?
Distribute out those factors that don’t involve Z1.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 38 11 / 17
Computing sums of products
Computation in belief networks reduces to computing the sums ofproducts.
How can we compute ab + ac efficiently?
Distribute out the a giving a(b + c)
How can we compute∑
Z1
∏ni=1 P(Xi |parents(Xi )) efficiently?
Distribute out those factors that don’t involve Z1.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 39 11 / 17
Computing sums of products
Computation in belief networks reduces to computing the sums ofproducts.
How can we compute ab + ac efficiently?
Distribute out the a giving a(b + c)
How can we compute∑
Z1
∏ni=1 P(Xi |parents(Xi )) efficiently?
Distribute out those factors that don’t involve Z1.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 40 11 / 17
Computing sums of products
Computation in belief networks reduces to computing the sums ofproducts.
How can we compute ab + ac efficiently?
Distribute out the a giving a(b + c)
How can we compute∑
Z1
∏ni=1 P(Xi |parents(Xi )) efficiently?
Distribute out those factors that don’t involve Z1.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 41 11 / 17
Variable elimination algorithm
To compute P(Z |Y1 = v1 ∧ . . . ∧ Yj = vj):
Construct a factor for each conditional probability.
Set the observed variables to their observed values.
Sum out each of the other variables (the Z1, . . . ,Zk)according to some elimination ordering.
Multiply the remaining factors. Normalize by dividing theresulting factor f (Z ) by
∑Z f (Z ).
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 42 12 / 17
Summing out a variable
To sum out a variable Zj from a product f1, . . . , fk of factors:
Partition the factors intoI those that don’t contain Zj , say f1, . . . , fi ,I those that contain Zj , say fi+1, . . . , fk
We know:
∑Zj
f1∗ · · · ∗fk = f1∗ · · · ∗fi∗
∑Zj
fi+1∗ · · · ∗fk
.
Explicitly construct a representation of the rightmost factor.Replace the factors fi+1, . . . , fk by the new factor.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 43 13 / 17
Example
A C
B D
E
F G
P(E | g) =
P(E ∧ g)∑E P(E ∧ g)
P(E ∧ g)
=∑F
∑B
∑C
∑A
∑D
P(A)P(B | AC )
P(C )P(D | C )P(E | B)P(F | E )P(g | ED)
=
(∑F
P(F | E )
)∑B
P(E | B)∑C
(P(C )
(∑A
P(A)P(B | AC )
)(∑
D
P(D | C )P(g | ED)
))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 44 14 / 17
Example
A C
B D
E
F G
P(E | g) =P(E ∧ g)∑E P(E ∧ g)
P(E ∧ g)
=∑F
∑B
∑C
∑A
∑D
P(A)P(B | AC )
P(C )P(D | C )P(E | B)P(F | E )P(g | ED)
=
(∑F
P(F | E )
)∑B
P(E | B)∑C
(P(C )
(∑A
P(A)P(B | AC )
)(∑
D
P(D | C )P(g | ED)
))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 45 14 / 17
Example
A C
B D
E
F G
P(E | g) =P(E ∧ g)∑E P(E ∧ g)
P(E ∧ g)
=∑F
∑B
∑C
∑A
∑D
P(A)P(B | AC )
P(C )P(D | C )P(E | B)P(F | E )P(g | ED)
=
(∑F
P(F | E )
)∑B
P(E | B)∑C
(P(C )
(∑A
P(A)P(B | AC )
)(∑
D
P(D | C )P(g | ED)
))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 46 14 / 17
Example
A C
B D
E
F G
P(E | g) =P(E ∧ g)∑E P(E ∧ g)
P(E ∧ g)
=∑F
∑B
∑C
∑A
∑D
P(A)P(B | AC )
P(C )P(D | C )P(E | B)P(F | E )P(g | ED)
=
(∑F
P(F | E )
)∑B
P(E | B)∑C
(P(C )
(∑A
P(A)P(B | AC )
)(∑
D
P(D | C )P(g | ED)
))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 47 14 / 17
Example
A C
B D
E
F G
P(E | g) =P(E ∧ g)∑E P(E ∧ g)
P(E ∧ g)
=∑F
∑B
∑C
∑A
∑D
P(A)P(B | AC )
P(C )P(D | C )P(E | B)P(F | E )P(g | ED)
=
(∑F
P(F | E )
)∑B
P(E | B)∑C
(P(C )
(∑A
P(A)P(B | AC )
)
(∑D
P(D | C )P(g | ED)
))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 48 14 / 17
Example
A C
B D
E
F G
P(E | g) =P(E ∧ g)∑E P(E ∧ g)
P(E ∧ g)
=∑F
∑B
∑C
∑A
∑D
P(A)P(B | AC )
P(C )P(D | C )P(E | B)P(F | E )P(g | ED)
=
(∑F
P(F | E )
)∑B
P(E | B)∑C
(P(C )
(∑A
P(A)P(B | AC )
)(∑
D
P(D | C )P(g | ED)
))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 49 14 / 17
Example
A C
B D
E
F G
P(E | g) =P(E ∧ g)∑E P(E ∧ g)
P(E ∧ g)
=∑F
∑B
∑C
∑A
∑D
P(A)P(B | AC )
P(C )P(D | C )P(E | B)P(F | E )P(g | ED)
=
(∑F
P(F | E )
)∑B
P(E | B)
∑C
(P(C )
(∑A
P(A)P(B | AC )
)(∑
D
P(D | C )P(g | ED)
))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 50 14 / 17
Example
A C
B D
E
F G
P(E | g) =P(E ∧ g)∑E P(E ∧ g)
P(E ∧ g)
=∑F
∑B
∑C
∑A
∑D
P(A)P(B | AC )
P(C )P(D | C )P(E | B)P(F | E )P(g | ED)
=
(∑F
P(F | E )
)
∑B
P(E | B)∑C
(P(C )
(∑A
P(A)P(B | AC )
)(∑
D
P(D | C )P(g | ED)
))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 51 14 / 17
Example
A C
B D
E
F G
P(E | g) =P(E ∧ g)∑E P(E ∧ g)
P(E ∧ g)
=∑F
∑B
∑C
∑A
∑D
P(A)P(B | AC )
P(C )P(D | C )P(E | B)P(F | E )P(g | ED)
=
(∑F
P(F | E )
)∑B
P(E | B)∑C
(P(C )
(∑A
P(A)P(B | AC )
)(∑
D
P(D | C )P(g | ED)
))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 52 14 / 17
Variable elimination example
A
BC
D
E
F
G
H I
P(A)P(B|A)
elim A−→
f1(B)
P(C )P(D|BC )P(E |C )
elim C−→ f2(BDE )
P(F |D)P(G |FE )
P(H|G )
obs H−→ f3(G )
P(I |G )
elim I−→ f4(G )
P(D, h) = ...(∑
A P(A)P(B|A))(∑
I P(I |G ))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 53 15 / 17
Variable elimination example
A
BC
D
E
F
G
H I
P(A)P(B|A)
elim A−→ f1(B)
P(C )P(D|BC )P(E |C )
elim C−→ f2(BDE )
P(F |D)P(G |FE )
P(H|G )
obs H−→ f3(G )
P(I |G )
elim I−→ f4(G )
P(D, h) = ...(∑
A P(A)P(B|A))(∑
I P(I |G ))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 54 15 / 17
Variable elimination example
A
BC
D
E
F
G
H I
P(A)P(B|A)
elim A−→ f1(B)
P(C )P(D|BC )P(E |C )
elim C−→
f2(BDE )
P(F |D)P(G |FE )
P(H|G )
obs H−→ f3(G )
P(I |G )
elim I−→ f4(G )
P(D, h) = ...(∑
A P(A)P(B|A))(∑
I P(I |G ))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 55 15 / 17
Variable elimination example
A
BC
D
E
F
G
H I
P(A)P(B|A)
elim A−→ f1(B)
P(C )P(D|BC )P(E |C )
elim C−→ f2(BDE )
P(F |D)P(G |FE )
P(H|G )
obs H−→ f3(G )
P(I |G )
elim I−→ f4(G )
P(D, h) = ...(∑
A P(A)P(B|A))(∑
I P(I |G ))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 56 15 / 17
Variable elimination example
A
BC
D
E
F
G
H I
P(A)P(B|A)
elim A−→ f1(B)
P(C )P(D|BC )P(E |C )
elim C−→ f2(BDE )
P(F |D)P(G |FE )
P(H|G ) obs H−→
f3(G )
P(I |G )
elim I−→ f4(G )
P(D, h) = ...(∑
A P(A)P(B|A))(∑
I P(I |G ))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 57 15 / 17
Variable elimination example
A
BC
D
E
F
G
H I
P(A)P(B|A)
elim A−→ f1(B)
P(C )P(D|BC )P(E |C )
elim C−→ f2(BDE )
P(F |D)P(G |FE )
P(H|G ) obs H−→ f3(G )
P(I |G )
elim I−→ f4(G )
P(D, h) = ...(∑
A P(A)P(B|A))(∑
I P(I |G ))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 58 15 / 17
Variable elimination example
A
BC
D
E
F
G
H I
P(A)P(B|A)
elim A−→ f1(B)
P(C )P(D|BC )P(E |C )
elim C−→ f2(BDE )
P(F |D)P(G |FE )
P(H|G ) obs H−→ f3(G )
P(I |G ) elim I−→
f4(G )
P(D, h) = ...(∑
A P(A)P(B|A))(∑
I P(I |G ))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 59 15 / 17
Variable elimination example
A
BC
D
E
F
G
H I
P(A)P(B|A)
elim A−→ f1(B)
P(C )P(D|BC )P(E |C )
elim C−→ f2(BDE )
P(F |D)P(G |FE )
P(H|G ) obs H−→ f3(G )
P(I |G ) elim I−→ f4(G )
P(D, h) = ...(∑
A P(A)P(B|A))(∑
I P(I |G ))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 60 15 / 17
Variable Elimination example
A B C D E F
G H
Query: P(G |f ); elimination ordering: A,H,E ,D,B,C
P(G |f ) ∝
∑C
∑B
∑D
∑E
∑H
∑A
P(A)P(B|A)P(C |B)
P(D|C )P(E |D)P(f |E )P(G |C )P(H|E )
=∑C
(∑B
(∑A
P(A)P(B|A)
)P(C |B)
)P(G |C )(∑
D
P(D|C )
(∑E
P(E |D)P(f |E )∑H
P(H|E )
))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 61 16 / 17
Variable Elimination example
A B C D E F
G H
Query: P(G |f ); elimination ordering: A,H,E ,D,B,C
P(G |f ) ∝∑C
∑B
∑D
∑E
∑H
∑A
P(A)P(B|A)P(C |B)
P(D|C )P(E |D)P(f |E )P(G |C )P(H|E )
=∑C
(∑B
(∑A
P(A)P(B|A)
)P(C |B)
)P(G |C )(∑
D
P(D|C )
(∑E
P(E |D)P(f |E )∑H
P(H|E )
))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 62 16 / 17
Variable Elimination example
A B C D E F
G H
Query: P(G |f ); elimination ordering: A,H,E ,D,B,C
P(G |f ) ∝∑C
∑B
∑D
∑E
∑H
∑A
P(A)P(B|A)P(C |B)
P(D|C )P(E |D)P(f |E )P(G |C )P(H|E )
=∑C
(∑B
(∑A
P(A)P(B|A)
)P(C |B)
)P(G |C )(∑
D
P(D|C )
(∑E
P(E |D)P(f |E )∑H
P(H|E )
))
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 63 16 / 17
Pruning Irrelevant Variables (Belief networks)
Suppose you want to compute P(X | e1 . . . ek):
Prune any variables that have no observed or querieddescendents.
Connect the parents of any observed variable.
Remove arc directions.
Remove observed variables.
Remove any variables not connected to X in the resulting(undirected) graph.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 64 17 / 17
Markov chain
A Markov chain is a special sort of belief network:
S0 S1 S2 S3 S4
What probabilities need to be specified?
P(S0) specifies initial conditions
P(Si+1|Si) specifies the dynamics
What independence assumptions are made?
P(Si+1|S0, . . . , Si) = P(Si+1|Si).
Often St represents the state at time t. Intuitively St
conveys all of the information about the history that canaffect the future states.
“The future is independent of the past given the present.”
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 1 1 / 23
Markov chain
A Markov chain is a special sort of belief network:
S0 S1 S2 S3 S4
What probabilities need to be specified?
P(S0) specifies initial conditions
P(Si+1|Si) specifies the dynamics
What independence assumptions are made?
P(Si+1|S0, . . . , Si) = P(Si+1|Si).
Often St represents the state at time t. Intuitively St
conveys all of the information about the history that canaffect the future states.
“The future is independent of the past given the present.”
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 2 1 / 23
Markov chain
A Markov chain is a special sort of belief network:
S0 S1 S2 S3 S4
What probabilities need to be specified?
P(S0) specifies initial conditions
P(Si+1|Si) specifies the dynamics
What independence assumptions are made?
P(Si+1|S0, . . . , Si) = P(Si+1|Si).
Often St represents the state at time t. Intuitively St
conveys all of the information about the history that canaffect the future states.
“The future is independent of the past given the present.”
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 3 1 / 23
Markov chain
A Markov chain is a special sort of belief network:
S0 S1 S2 S3 S4
What probabilities need to be specified?
P(S0) specifies initial conditions
P(Si+1|Si) specifies the dynamics
What independence assumptions are made?
P(Si+1|S0, . . . , Si) = P(Si+1|Si).
Often St represents the state at time t. Intuitively St
conveys all of the information about the history that canaffect the future states.
“The future is independent of the past given the present.”
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 4 1 / 23
Markov chain
A Markov chain is a special sort of belief network:
S0 S1 S2 S3 S4
What probabilities need to be specified?
P(S0) specifies initial conditions
P(Si+1|Si) specifies the dynamics
What independence assumptions are made?
P(Si+1|S0, . . . , Si) = P(Si+1|Si).
Often St represents the state at time t. Intuitively St
conveys all of the information about the history that canaffect the future states.
“The future is independent of the past given the present.”
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 5 1 / 23
Markov chain
A Markov chain is a special sort of belief network:
S0 S1 S2 S3 S4
What probabilities need to be specified?
P(S0) specifies initial conditions
P(Si+1|Si) specifies the dynamics
What independence assumptions are made?
P(Si+1|S0, . . . , Si) = P(Si+1|Si).
Often St represents the state at time t. Intuitively St
conveys all of the information about the history that canaffect the future states.
“The future is independent of the past given the present.”c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 6 1 / 23
Stationary Markov chain
A stationary Markov chain is when for all i > 0, i ′ > 0,P(Si+1|Si) = P(Si ′+1|Si ′).
We specify P(S0) and P(Si+1|Si).I Simple model, easy to specifyI Often the natural modelI The network can extend indefinitely
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 7 2 / 23
Stationary Distribution
A distribution over states, P is a stationary distribution iffor each state s, P(Si+1=s) = P(Si=s).
Every Markov chain has a stationary distribution.
A Markov chain is ergodic if, for any two states s1 and s2,there is a non-zero probability of eventually reaching s2from s1.
A Markov chain is periodic if there is a strict temporalregularity in visiting states. A state is only visited divisibleat time t if t mod n = m for some n,m.
An ergodic and aperiodic Markov chain has a uniquestationary distribution P andP(s) = limi→∞ Pi(s) — equilibrium distribution
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 8 3 / 23
Stationary Distribution
A distribution over states, P is a stationary distribution iffor each state s, P(Si+1=s) = P(Si=s).
Every Markov chain has a stationary distribution.
A Markov chain is ergodic if, for any two states s1 and s2,there is a non-zero probability of eventually reaching s2from s1.
A Markov chain is periodic if there is a strict temporalregularity in visiting states. A state is only visited divisibleat time t if t mod n = m for some n,m.
An ergodic and aperiodic Markov chain has a uniquestationary distribution P andP(s) = limi→∞ Pi(s) — equilibrium distribution
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 9 3 / 23
Stationary Distribution
A distribution over states, P is a stationary distribution iffor each state s, P(Si+1=s) = P(Si=s).
Every Markov chain has a stationary distribution.
A Markov chain is ergodic if, for any two states s1 and s2,there is a non-zero probability of eventually reaching s2from s1.
A Markov chain is periodic if there is a strict temporalregularity in visiting states. A state is only visited divisibleat time t if t mod n = m for some n,m.
An ergodic and aperiodic Markov chain has a uniquestationary distribution P andP(s) = limi→∞ Pi(s) — equilibrium distribution
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 10 3 / 23
Stationary Distribution
A distribution over states, P is a stationary distribution iffor each state s, P(Si+1=s) = P(Si=s).
Every Markov chain has a stationary distribution.
A Markov chain is ergodic if, for any two states s1 and s2,there is a non-zero probability of eventually reaching s2from s1.
A Markov chain is periodic if there is a strict temporalregularity in visiting states. A state is only visited divisibleat time t if t mod n = m for some n,m.
An ergodic and aperiodic Markov chain has a uniquestationary distribution P andP(s) = limi→∞ Pi(s) — equilibrium distribution
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 11 3 / 23
Stationary Distribution
A distribution over states, P is a stationary distribution iffor each state s, P(Si+1=s) = P(Si=s).
Every Markov chain has a stationary distribution.
A Markov chain is ergodic if, for any two states s1 and s2,there is a non-zero probability of eventually reaching s2from s1.
A Markov chain is periodic if there is a strict temporalregularity in visiting states. A state is only visited divisibleat time t if t mod n = m for some n,m.
An ergodic and aperiodic Markov chain has a uniquestationary distribution P andP(s) = limi→∞ Pi(s) — equilibrium distribution
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 12 3 / 23
Pagerank
Consider the Markov chain:
Domain of Si is the set of all web pages
P(S0) is uniform; P(S0 = pj) = 1/N
P(Si+1 = pj | Si = pk)
= (1− d)/N + d ∗
1/nk if pk links to pj1/N if pk has no links0 otherwise
where there are N web pages and nk links from page pk
d ≈ 0.85 is the probability someone keeps surfing web
This Markov chain converges to a distribution over webpages (original P(Si) for i = 52 for 322 million links):Pagerank - basis for Google’s initial search engine
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 13 4 / 23
Pagerank
Consider the Markov chain:
Domain of Si is the set of all web pages
P(S0) is uniform; P(S0 = pj) = 1/N
P(Si+1 = pj | Si = pk)
= (1− d)/N + d ∗
1/nk if pk links to pj1/N if pk has no links0 otherwise
where there are N web pages and nk links from page pk
d ≈ 0.85 is the probability someone keeps surfing web
This Markov chain converges to a distribution over webpages (original P(Si) for i = 52 for 322 million links):Pagerank - basis for Google’s initial search engine
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 14 4 / 23
Hidden Markov Model
A Hidden Markov Model (HMM) is a belief network:
S0 S1 S2 S3 S4
O0 O1 O2 O3 O4
The probabilities that need to be specified:
P(S0) specifies initial conditions
P(Si+1|Si) specifies the dynamics
P(Oi |Si) specifies the sensor model
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 15 5 / 23
Hidden Markov Model
A Hidden Markov Model (HMM) is a belief network:
S0 S1 S2 S3 S4
O0 O1 O2 O3 O4
The probabilities that need to be specified:
P(S0) specifies initial conditions
P(Si+1|Si) specifies the dynamics
P(Oi |Si) specifies the sensor model
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 16 5 / 23
Filtering
Filtering:
P(Si |o1, . . . , oi)
What is the current belief state based on the observationhistory?
P(Si |o1, . . . , oi) ∝ P(oi |Sio1, . . . , oi−1)P(Si |o1, . . . , oi−1)
=???∑Si−1
P(SiSi−1|o1, . . . , oi−1)
=???
Observe O0, query S0.then observe O1, query S1.then observe O2, query S2.. . .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 17 6 / 23
Filtering
Filtering:
P(Si |o1, . . . , oi)
What is the current belief state based on the observationhistory?
P(Si |o1, . . . , oi) ∝ P(oi |Sio1, . . . , oi−1)P(Si |o1, . . . , oi−1)
=???∑Si−1
P(SiSi−1|o1, . . . , oi−1)
=???
Observe O0, query S0.then observe O1, query S1.then observe O2, query S2.. . .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 18 6 / 23
Filtering
Filtering:
P(Si |o1, . . . , oi)
What is the current belief state based on the observationhistory?
P(Si |o1, . . . , oi) ∝ P(oi |Sio1, . . . , oi−1)P(Si |o1, . . . , oi−1)
=???∑Si−1
P(SiSi−1|o1, . . . , oi−1)
=???
Observe O0, query S0.
then observe O1, query S1.then observe O2, query S2.. . .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 19 6 / 23
Filtering
Filtering:
P(Si |o1, . . . , oi)
What is the current belief state based on the observationhistory?
P(Si |o1, . . . , oi) ∝ P(oi |Sio1, . . . , oi−1)P(Si |o1, . . . , oi−1)
=???∑Si−1
P(SiSi−1|o1, . . . , oi−1)
=???
Observe O0, query S0.then observe O1, query S1.
then observe O2, query S2.. . .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 20 6 / 23
Filtering
Filtering:
P(Si |o1, . . . , oi)
What is the current belief state based on the observationhistory?
P(Si |o1, . . . , oi) ∝ P(oi |Sio1, . . . , oi−1)P(Si |o1, . . . , oi−1)
=???∑Si−1
P(SiSi−1|o1, . . . , oi−1)
=???
Observe O0, query S0.then observe O1, query S1.then observe O2, query S2.. . .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 21 6 / 23
Example: localization
Suppose a robot wants to determine its location based onits actions and its sensor readings: Localization
This can be represented by the augmented HMM:
Loc0 Loc1 Loc2 Loc3 Loc4
Obs0 Obs1 Obs2 Obs3 Obs4
Act0 Act1 Act2 Act3
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 22 7 / 23
Example localization domain
Circular corridor, with 16 locations:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Doors at positions: 2, 4, 7, 11.
Noisy Sensors
Stochastic Dynamics
Robot starts at an unknown location and must determinewhere it is.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 23 8 / 23
Example Sensor Model
P(Observe Door | At Door) = 0.8
P(Observe Door | Not At Door) = 0.1
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 24 9 / 23
Example Dynamics Model
P(loct+1 = L|actiont = goRight ∧ loct = L) = 0.1
P(loct+1 = L + 1|actiont = goRight ∧ loct = L) = 0.8
P(loct+1 = L + 2|actiont = goRight ∧ loct = L) = 0.074
P(loct+1 = L′|actiont = goRight ∧ loct = L) = 0.002 forany other location L′.
I All location arithmetic is modulo 16.I The action goLeft works the same but to the left.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 25 10 / 23
Combining sensor information
Example: we can combine information from a light sensorand the door sensor Sensor Fusion
Loc0 Loc1 Loc2 Loc3 Loc4
D0 D1 D2 D3 D4
Act0 Act1 Act2 Act3
L0 L1 L2 L3 L4
St robot location at time tDt door sensor value at time tLt light sensor value at time t
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 26 11 / 23
Naive Bayes Classifier: User’s request for help
H
"able" "absent" "add" "zoom". . .
H is the help page the user is interested in.
What probabilities are required?
P(hi) for each help page hi . The user is interested in onebest web page, so
∑i P(hi) = 1.
P(wj | hi) for each word wj given page hi . There can bemultiple words used in a query.
Given a help query: condition on the words in the queryand display the most likely help page.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 27 12 / 23
Naive Bayes Classifier: User’s request for help
H
"able" "absent" "add" "zoom". . .
H is the help page the user is interested in.What probabilities are required?
P(hi) for each help page hi . The user is interested in onebest web page, so
∑i P(hi) = 1.
P(wj | hi) for each word wj given page hi . There can bemultiple words used in a query.
Given a help query: condition on the words in the queryand display the most likely help page.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 28 12 / 23
Naive Bayes Classifier: User’s request for help
H
"able" "absent" "add" "zoom". . .
H is the help page the user is interested in.What probabilities are required?
P(hi) for each help page hi . The user is interested in onebest web page, so
∑i P(hi) = 1.
P(wj | hi) for each word wj given page hi . There can bemultiple words used in a query.
Given a help query: condition on the words in the queryand display the most likely help page.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 29 12 / 23
Naive Bayes Classifier: User’s request for help
H
"able" "absent" "add" "zoom". . .
H is the help page the user is interested in.What probabilities are required?
P(hi) for each help page hi . The user is interested in onebest web page, so
∑i P(hi) = 1.
P(wj | hi) for each word wj given page hi . There can bemultiple words used in a query.
Given a help query: condition on the words in the queryand display the most likely help page.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 30 12 / 23
Naive Bayes Classifier: User’s request for help
H
"able" "absent" "add" "zoom". . .
H is the help page the user is interested in.What probabilities are required?
P(hi) for each help page hi . The user is interested in onebest web page, so
∑i P(hi) = 1.
P(wj | hi) for each word wj given page hi . There can bemultiple words used in a query.
Given a help query:
condition on the words in the queryand display the most likely help page.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 31 12 / 23
Naive Bayes Classifier: User’s request for help
H
"able" "absent" "add" "zoom". . .
H is the help page the user is interested in.What probabilities are required?
P(hi) for each help page hi . The user is interested in onebest web page, so
∑i P(hi) = 1.
P(wj | hi) for each word wj given page hi . There can bemultiple words used in a query.
Given a help query: condition on the words in the queryand display the most likely help page.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 32 12 / 23
Simple Language Models: set-of-words
Sentence: w1,w2,w3, . . . .Set-of-words model:
“aardvark” “zzz”“a” ...
Each variable is Boolean: true when word is in thesentence and false otherwise.
What probabilities are provided?I P(”a”), P(”aardvark”), . . . , P(”zzz”)
How do we condition on the question “how can I phonemy phone”?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 33 13 / 23
Simple Language Models: set-of-words
Sentence: w1,w2,w3, . . . .Set-of-words model:
“aardvark” “zzz”“a” ...
Each variable is Boolean: true when word is in thesentence and false otherwise.
What probabilities are provided?
I P(”a”), P(”aardvark”), . . . , P(”zzz”)
How do we condition on the question “how can I phonemy phone”?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 34 13 / 23
Simple Language Models: set-of-words
Sentence: w1,w2,w3, . . . .Set-of-words model:
“aardvark” “zzz”“a” ...
Each variable is Boolean: true when word is in thesentence and false otherwise.
What probabilities are provided?I P(”a”), P(”aardvark”), . . . , P(”zzz”)
How do we condition on the question “how can I phonemy phone”?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 35 13 / 23
Simple Language Models: set-of-words
Sentence: w1,w2,w3, . . . .Set-of-words model:
“aardvark” “zzz”“a” ...
Each variable is Boolean: true when word is in thesentence and false otherwise.
What probabilities are provided?I P(”a”), P(”aardvark”), . . . , P(”zzz”)
How do we condition on the question “how can I phonemy phone”?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 36 13 / 23
Simple Language Models: bag-of-words
Sentence: w1,w2,w3, . . . ,wn.Bag-of-words or unigram:
W2 ...W3 WnW1
Domain of each variable is the set of all words.
What probabilities are provided?I P(wi ) is a distribution over words for each position
How do we condition on the question “how can I phonemy phone”?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 37 14 / 23
Simple Language Models: bag-of-words
Sentence: w1,w2,w3, . . . ,wn.Bag-of-words or unigram:
W2 ...W3 WnW1
Domain of each variable is the set of all words.
What probabilities are provided?
I P(wi ) is a distribution over words for each position
How do we condition on the question “how can I phonemy phone”?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 38 14 / 23
Simple Language Models: bag-of-words
Sentence: w1,w2,w3, . . . ,wn.Bag-of-words or unigram:
W2 ...W3 WnW1
Domain of each variable is the set of all words.
What probabilities are provided?I P(wi ) is a distribution over words for each position
How do we condition on the question “how can I phonemy phone”?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 39 14 / 23
Simple Language Models: bag-of-words
Sentence: w1,w2,w3, . . . ,wn.Bag-of-words or unigram:
W2 ...W3 WnW1
Domain of each variable is the set of all words.
What probabilities are provided?I P(wi ) is a distribution over words for each position
How do we condition on the question “how can I phonemy phone”?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 40 14 / 23
Simple Language Models: bigram
Sentence: w1,w2,w3, . . . ,wn.bigram:
W2 ...W3 WnW1
Domain of each variable is the set of all words.
What probabilities are provided?I P(wi |wi−1) is a distribution over words for each position
given the previous word
How do we condition on the question “how can I phonemy phone”?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 41 15 / 23
Simple Language Models: bigram
Sentence: w1,w2,w3, . . . ,wn.bigram:
W2 ...W3 WnW1
Domain of each variable is the set of all words.
What probabilities are provided?
I P(wi |wi−1) is a distribution over words for each positiongiven the previous word
How do we condition on the question “how can I phonemy phone”?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 42 15 / 23
Simple Language Models: bigram
Sentence: w1,w2,w3, . . . ,wn.bigram:
W2 ...W3 WnW1
Domain of each variable is the set of all words.
What probabilities are provided?I P(wi |wi−1) is a distribution over words for each position
given the previous word
How do we condition on the question “how can I phonemy phone”?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 43 15 / 23
Simple Language Models: bigram
Sentence: w1,w2,w3, . . . ,wn.bigram:
W2 ...W3 WnW1
Domain of each variable is the set of all words.
What probabilities are provided?I P(wi |wi−1) is a distribution over words for each position
given the previous word
How do we condition on the question “how can I phonemy phone”?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 44 15 / 23
Simple Language Models: trigram
Sentence: w1,w2,w3, . . . ,wn.trigram:
W2 ...W3 WnW1 W4
Domain of each variable is the set of all words.
What probabilities are provided?
P(wi |wi−1,wi−2)
N-gram
P(wi |wi−1, . . .wi−n+1) is a distribution over words giventhe previous n − 1 words
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 45 16 / 23
Simple Language Models: trigram
Sentence: w1,w2,w3, . . . ,wn.trigram:
W2 ...W3 WnW1 W4
Domain of each variable is the set of all words.What probabilities are provided?
P(wi |wi−1,wi−2)
N-gram
P(wi |wi−1, . . .wi−n+1) is a distribution over words giventhe previous n − 1 words
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 46 16 / 23
Simple Language Models: trigram
Sentence: w1,w2,w3, . . . ,wn.trigram:
W2 ...W3 WnW1 W4
Domain of each variable is the set of all words.What probabilities are provided?
P(wi |wi−1,wi−2)
N-gram
P(wi |wi−1, . . .wi−n+1) is a distribution over words giventhe previous n − 1 words
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 47 16 / 23
Logic, Probability, Statistics, Ontology over time
From: Google Books Ngram Viewer(https://books.google.com/ngrams)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 48 17 / 23
Predictive Typing and Error Correction
W2 ...W1
L11 Lk1L21 ... L12 Lk2L22 ...
W3
L13 Lk3L23 ...
domain(Wi) = ”a”, ”aarvark”, . . . , ”zzz”, ”⊥”, ”?”domain(Lji) = ”a”, ”b”, ”c”, . . . , ”z”, ”1”, ”2”, . . .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 49 18 / 23
Beyond N-grams
A man with a big hairy cat drank the cold milk.
Who or what drank the milk?
Simple syntax diagram:s
np vp
pp
np
npa man
with
a big hairy cat
drank
the cold milk
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 50 19 / 23
Beyond N-grams
A man with a big hairy cat drank the cold milk.
Who or what drank the milk?
Simple syntax diagram:s
np vp
pp
np
npa man
with
a big hairy cat
drank
the cold milk
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 51 19 / 23
Topic Model
tools food topics
"nut" "tuna""bolt" words
fish
“shark”“salmon”
finance
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 52 20 / 23
Topic Model
tools food topics
"nut" "tuna""bolt" words
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 53 21 / 23
Google’s rephil
900,000 topics
"Aaron's beard" "zzz""aardvark" 12,000,000 words...
350,000,000 links
...
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 54 22 / 23
Deep Belief Networks
...
...
...
...
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 55 23 / 23
Stochastic Simulation
Idea: probabilities ↔ samples
Get probabilities from samples:
X count
x1 n1...
...xk nk
total m
↔
X probability
x1 n1/m...
...xk nk/m
If we could sample from a variable’s (posterior)probability, we could estimate its (posterior) probability.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 1 1 / 15
Generating samples from a distribution
For a variable X with a discrete domain or a (one-dimensional)real domain:
Totally order the values of the domain of X .
Generate the cumulative probability distribution:f (x) = P(X ≤ x).
Select a value y uniformly in the range [0, 1].
Select the x such that f (x) = y .
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 2 2 / 15
Cumulative Distribution
v1 v2 v3 v4 v1 v2 v3 v4
P(X)
f(X)
0
1
0
1
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 3 3 / 15
Hoeffding’s inequality
Theorem (Hoeffding): Suppose p is the true probability, and sis the sample average from n independent samples; then
P(|s − p| > ε) ≤ 2e−2nε2
.
Guarantees a probably approximately correct estimate ofprobability.
If you are willing to have an error greater than ε in δ of thecases, solve 2e−2nε
2< δ for n, which gives
n >− ln δ
2
2ε2.
ε δ n0.1 0.05 1850.01 0.05 18,4450.1 0.01 265
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 4 4 / 15
Hoeffding’s inequality
Theorem (Hoeffding): Suppose p is the true probability, and sis the sample average from n independent samples; then
P(|s − p| > ε) ≤ 2e−2nε2
.
Guarantees a probably approximately correct estimate ofprobability.If you are willing to have an error greater than ε in δ of thecases, solve 2e−2nε
2< δ for n, which gives
n >− ln δ
2
2ε2.
ε δ n0.1 0.05 1850.01 0.05 18,4450.1 0.01 265
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 5 4 / 15
Hoeffding’s inequality
Theorem (Hoeffding): Suppose p is the true probability, and sis the sample average from n independent samples; then
P(|s − p| > ε) ≤ 2e−2nε2
.
Guarantees a probably approximately correct estimate ofprobability.If you are willing to have an error greater than ε in δ of thecases, solve 2e−2nε
2< δ for n, which gives
n >− ln δ
2
2ε2.
ε δ n0.1 0.05 1850.01 0.05 18,4450.1 0.01 265
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 6 4 / 15
Forward sampling in a belief network
Sample the variables one at a time; sample parents of Xbefore sampling X .
Given values for the parents of X , sample from theprobability of X given its parents.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 7 5 / 15
Rejection Sampling
To estimate a posterior probability given evidenceY1 = v1 ∧ . . . ∧ Yj = vj :
Reject any sample that assigns Yi to a value other thanvi .
The non-rejected samples are distributed according to theposterior probability:
P(α | evidence) ≈∑
sample|=α 1∑sample 1
where we consider only samples consistent with evidence.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 8 6 / 15
Rejection Sampling Example: P(ta | sm, re)
Ta Fi
SmAl
Le
Re
Observe Sm = true,Re = true
Ta Fi Al Sm Le Res1 false true false true false false
8
s2 false true true true true true 4
s3 true false true false — — 8
s4 true true true true true true 4
. . .s1000 false false false false — — 8
P(sm) = 0.02P(re | sm) = 0.32How many samples are rejected?How many samples are used?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 9 7 / 15
Rejection Sampling Example: P(ta | sm, re)
Ta Fi
SmAl
Le
Re
Observe Sm = true,Re = true
Ta Fi Al Sm Le Res1 false true false true false false 8
s2 false true true true true true
4
s3 true false true false — — 8
s4 true true true true true true 4
. . .s1000 false false false false — — 8
P(sm) = 0.02P(re | sm) = 0.32How many samples are rejected?How many samples are used?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 10 7 / 15
Rejection Sampling Example: P(ta | sm, re)
Ta Fi
SmAl
Le
Re
Observe Sm = true,Re = true
Ta Fi Al Sm Le Res1 false true false true false false 8
s2 false true true true true true 4
s3 true false true false
— — 8
s4 true true true true true true 4
. . .s1000 false false false false — — 8
P(sm) = 0.02P(re | sm) = 0.32How many samples are rejected?How many samples are used?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 11 7 / 15
Rejection Sampling Example: P(ta | sm, re)
Ta Fi
SmAl
Le
Re
Observe Sm = true,Re = true
Ta Fi Al Sm Le Res1 false true false true false false 8
s2 false true true true true true 4
s3 true false true false — — 8
s4 true true true true true true
4
. . .s1000 false false false false — — 8
P(sm) = 0.02P(re | sm) = 0.32How many samples are rejected?How many samples are used?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 12 7 / 15
Rejection Sampling Example: P(ta | sm, re)
Ta Fi
SmAl
Le
Re
Observe Sm = true,Re = true
Ta Fi Al Sm Le Res1 false true false true false false 8
s2 false true true true true true 4
s3 true false true false — — 8
s4 true true true true true true 4
. . .s1000 false false false false
— — 8
P(sm) = 0.02P(re | sm) = 0.32How many samples are rejected?How many samples are used?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 13 7 / 15
Rejection Sampling Example: P(ta | sm, re)
Ta Fi
SmAl
Le
Re
Observe Sm = true,Re = true
Ta Fi Al Sm Le Res1 false true false true false false 8
s2 false true true true true true 4
s3 true false true false — — 8
s4 true true true true true true 4
. . .s1000 false false false false — — 8
P(sm) = 0.02P(re | sm) = 0.32How many samples are rejected?How many samples are used?
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 14 7 / 15
Importance Sampling
Samples have weights: a real number associated witheach sample that takes the evidence into account.
Probability of a proposition is weighted average ofsamples:
P(α | evidence) ≈∑
sample|=α weight(sample)∑sample weight(sample)
Mix exact inference with sampling: don’t sample all ofthe variables, but weight each sample according toP(evidence | sample).
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 15 8 / 15
Importance Sampling (Likelihood Weighting)
procedure likelihood weighting(Bn, e,Q, n):ans[1 : k] := 0 where k is size of dom(Q)repeat n times:
weight := 1for each variable Xi in order:
if Xi = oi is observedweight := weight × P(Xi = oi | parents(Xi))
else assign Xi a random sample of P(Xi | parents(Xi))if Q has value v :
ans[v ] := ans[v ] + weightreturn ans/
∑v ans[v ]
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 16 9 / 15
Importance Sampling Example: P(ta | sm, re)
Ta Fi
SmAl
Le
Re
Ta Fi Al Le Weights1 true false true false
0.01× 0.01
s2 false true false false
0.9× 0.01
s3 false true true true
0.9× 0.75
s4 true true true true
0.9× 0.75
. . .s1000 false false true true
0.01× 0.75
P(sm | fi) = 0.9P(sm | ¬fi) = 0.01P(re | le) = 0.75P(re | ¬le) = 0.01
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 17 10 / 15
Importance Sampling Example: P(ta | sm, re)
Ta Fi
SmAl
Le
Re
Ta Fi Al Le Weights1 true false true false 0.01× 0.01s2 false true false false 0.9× 0.01s3 false true true true 0.9× 0.75s4 true true true true 0.9× 0.75. . .s1000 false false true true 0.01× 0.75
P(sm | fi) = 0.9P(sm | ¬fi) = 0.01P(re | le) = 0.75P(re | ¬le) = 0.01
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 18 10 / 15
Importance Sampling Example: P(le | sm, ta,¬re)
Ta Fi
SmAl
Le
Re
P(ta) = 0.02P(fi) = 0.01P(al | fi ∧ ta) = 0.5P(al | fi ∧ ¬ta) = 0.99P(al | ¬fi ∧ ta) = 0.85P(al | ¬fi ∧ ¬ta) = 0.0001P(sm | fi) = 0.9P(sm | ¬fi) = 0.01P(le | al) = 0.88P(le | ¬al) = 0.001P(re | le) = 0.75P(re | ¬le) = 0.01
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 19 11 / 15
Computing Expectations & Proposal Distributions
Expected value of f with respect to distribution P :
EP(f ) =∑w
f (w) ∗ P(w)
≈ 1
n
∑s
f (s)
s is sampled with probability P . There are n samples.
EP(f ) =∑w
f (w) ∗ P(w)/Q(w) ∗ Q(w)
≈ 1
n
∑s
f (s) ∗ P(s)/Q(s)
s is selected according the distribution Q.The distribution Q is called a proposal distribution.P(c) > 0 then Q(c) > 0.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 20 12 / 15
Computing Expectations & Proposal Distributions
Expected value of f with respect to distribution P :
EP(f ) =∑w
f (w) ∗ P(w)
≈ 1
n
∑s
f (s)
s is sampled with probability P . There are n samples.
EP(f ) =∑w
f (w) ∗ P(w)/Q(w) ∗ Q(w)
≈ 1
n
∑s
f (s) ∗ P(s)/Q(s)
s is selected according the distribution Q.The distribution Q is called a proposal distribution.P(c) > 0 then Q(c) > 0.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 21 12 / 15
Computing Expectations & Proposal Distributions
Expected value of f with respect to distribution P :
EP(f ) =∑w
f (w) ∗ P(w)
≈ 1
n
∑s
f (s)
s is sampled with probability P . There are n samples.
EP(f ) =∑w
f (w) ∗ P(w)/Q(w) ∗ Q(w)
≈ 1
n
∑s
f (s) ∗ P(s)/Q(s)
s is selected according the distribution Q.The distribution Q is called a proposal distribution.P(c) > 0 then Q(c) > 0.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 22 12 / 15
Computing Expectations & Proposal Distributions
Expected value of f with respect to distribution P :
EP(f ) =∑w
f (w) ∗ P(w)
≈ 1
n
∑s
f (s)
s is sampled with probability P . There are n samples.
EP(f ) =∑w
f (w) ∗ P(w)/Q(w) ∗ Q(w)
≈ 1
n
∑s
f (s) ∗ P(s)/Q(s)
s is selected according the distribution Q.The distribution Q is called a proposal distribution.P(c) > 0 then Q(c) > 0.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 23 12 / 15
Particle Filtering
Importance sampling can be seen as:
for each particle:for each variable:
sample / absorb evidence / update query
where particle is one of the samples.
Instead we could do:
for each variable:for each particle:
sample / absorb evidence / update query
Why?
We can have a new operation of resampling
It works with infinitely many variables (e.g., HMM)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 24 13 / 15
Particle Filtering
Importance sampling can be seen as:
for each particle:for each variable:
sample / absorb evidence / update query
where particle is one of the samples.Instead we could do:
for each variable:for each particle:
sample / absorb evidence / update query
Why?
We can have a new operation of resampling
It works with infinitely many variables (e.g., HMM)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 25 13 / 15
Particle Filtering
Importance sampling can be seen as:
for each particle:for each variable:
sample / absorb evidence / update query
where particle is one of the samples.Instead we could do:
for each variable:for each particle:
sample / absorb evidence / update query
Why?
We can have a new operation of resampling
It works with infinitely many variables (e.g., HMM)
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 26 13 / 15
Particle Filtering for HMMs
Start with random chosen particles (say 1000)
Each particle represents a history.
Initially, sample states in proportion to their probability.
Repeat:I Absorb evidence: weight each particle by the probability
of the evidence given the state of the particle.I Resample: select each particle at random, in proportion
to the weight of the particle.Some particles may be duplicated, some may beremoved. All new particles have same weight.
I Transition: sample the next state for each particleaccording to the transition probabilities.
To answer a query about the current state, use the set ofparticles as data.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 27 14 / 15
Markov Chain Monte Carlo
To sample from a distribution P :
Create (ergodic and aperiodic) Markov chain with P asequilibrium distribution.Let T (Si+1 | Si) be the transition probability.Given state s, sample state s ′ from T (S | s)
After a while, the states sampled will be distributedaccording to P .Ignore the first samples “burn-in”— use the remaining samples.Samples are not independent of each other“autocorrelation”.Sometimes use subset (e.g., 1/1000) of them “thinning”Gibbs sampler: sample each non-observed variable fromthe distribution of the variable given the current (orobserved) value of the variables in its Markov blanket.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 28 15 / 15
Markov Chain Monte Carlo
To sample from a distribution P :
Create (ergodic and aperiodic) Markov chain with P asequilibrium distribution.Let T (Si+1 | Si) be the transition probability.Given state s, sample state s ′ from T (S | s)After a while, the states sampled will be distributedaccording to P .
Ignore the first samples “burn-in”— use the remaining samples.Samples are not independent of each other“autocorrelation”.Sometimes use subset (e.g., 1/1000) of them “thinning”Gibbs sampler: sample each non-observed variable fromthe distribution of the variable given the current (orobserved) value of the variables in its Markov blanket.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 29 15 / 15
Markov Chain Monte Carlo
To sample from a distribution P :
Create (ergodic and aperiodic) Markov chain with P asequilibrium distribution.Let T (Si+1 | Si) be the transition probability.Given state s, sample state s ′ from T (S | s)After a while, the states sampled will be distributedaccording to P .Ignore the first samples “burn-in”— use the remaining samples.Samples are not independent of each other“autocorrelation”.Sometimes use subset (e.g., 1/1000) of them “thinning”
Gibbs sampler: sample each non-observed variable fromthe distribution of the variable given the current (orobserved) value of the variables in its Markov blanket.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 30 15 / 15
Markov Chain Monte Carlo
To sample from a distribution P :
Create (ergodic and aperiodic) Markov chain with P asequilibrium distribution.Let T (Si+1 | Si) be the transition probability.Given state s, sample state s ′ from T (S | s)After a while, the states sampled will be distributedaccording to P .Ignore the first samples “burn-in”— use the remaining samples.Samples are not independent of each other“autocorrelation”.Sometimes use subset (e.g., 1/1000) of them “thinning”Gibbs sampler: sample each non-observed variable fromthe distribution of the variable given the current (orobserved) value of the variables in its Markov blanket.
c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 31 15 / 15