the mind is a neural computer, tted by natural selection

“The mind is a neural computer, fitted by naturalselection with combinatorial algorithms for causaland probabilistic reasoning about plants, animals,objects, and people.”

. . .“In a universe with any regularities at all,

decisions informed about the past are better thandecisions made at random. That has always beentrue, and we would expect organisms, especiallyinformavores such as humans, to have evolved acuteintuitions about probability. The founders ofprobability, like the founders of logic, assumed theywere just formalizing common sense.”

Steven Pinker, How the Mind Works, 1997, pp. 524, 343.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, 1 / 17

Learning Objectives

At the end of the class you should be able to:

justify the use and semantics of probability

know how to compute marginals and apply Bayes’theorem

build a belief network for a domain

predict the inferences for a belief network

explain the predictions of a causal model


Using Uncertain Knowledge

Agents don’t have complete knowledge about the world.

Agents need to make (informed) decisions given theiruncertainty.

It isn’t enough to assume what the world is like.Example: wearing a seat belt.

An agent needs to reason about its uncertainty.

When an agent makes an action under uncertainty, it isgambling =⇒ probability.


Probability

Probability is an agent’s measure of belief in someproposition — subjective probability.

An agent’s belief depends on its prior belief and what itobserves.

Example: An agent’s probability of a particular bird flyingI Other agents may have different probabilitiesI An agent’s belief in a bird’s flying ability is affected by

what the agent knows about that bird.


Random Variables

A random variable starts with upper case.

The domain of a variable X , written domain(X ), is theset of values X can take. (Sometimes use “range”,“frame”, “possible values”).

A tuple of random variables 〈X1, . . . ,Xn〉 is a complexrandom variable with domaindomain(X1)× · · · × domain(Xn).Often the tuple is written as X1, . . . ,Xn.

Assignment X = x means variable X has value x .

A proposition is a Boolean formula made fromassignments of values to variables or inequality (e.g., <,≤,. . . ) between variables and values.


Possible World Semantics

A possible world specifies an assignment of one value toeach random variable.

A random variable is a function from possible worlds intothe domain of the random variable.

ω |= X = xmeans variable X is assigned value x in world ω.

Logical connectives have their standard meaning:

ω |= α ∧ β if ω |= α and ω |= β

ω |= α ∨ β if ω |= α or ω |= β

ω |= ¬α if ω 6|= α

Let Ω be the set of all possible worlds.


Semantics of Probability

Probability defines a measure on sets of possible worlds.A probability measure is a function µ from sets of worlds intothe non-negative real numbers such that:

µ(Ω) = 1

µ(S1 ∪ S2) = µ(S1) + µ(S2)if S1 ∩ S2 = .

Then P(α) = µ(ω | ω |= α).


Semantics

Possible Worlds:

Suppose the measure of each singleton world is 0.1.

What is the probability of circle?

What us the probability of star?

What is the probability of triangle?

What is the probability of orange?

What is the probability of orange and star?

What are the random variables?


Axioms of Probability (finite case)

Three axioms define what follows from a set of probabilities:

Axiom 1 0 ≤ P(a) for any proposition a.

Axiom 2 P(true) = 1

Axiom 3 P(a ∨ b) = P(a) + P(b) if a and b cannot bothbe true.

These axioms are sound and complete with respect to thesemantics.


Conditioning

Probabilistic conditioning specifies how to revise beliefsbased on new information.

An agent builds a probabilistic model taking allbackground information into account.This gives a prior probability.

All other information must be conditioned on.

If evidence e is the all of the information obtainedsubsequently, the conditional probability P(h | e) of hgiven e is the posterior probability of h.


Semantics of Conditional Probability

Evidence e rules out possible worlds incompatible with e.

Evidence e induces a new measure, µe , over possibleworlds:

µe(S) =

c × µ(S)

if ω |= e for all ω ∈ S

0 if ω 6|= e for all ω ∈ S

We can show c = 1P(e)

.

The conditional probability of formula h given evidence eis

P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

P(e)





µe(S) =

c × µ(S) if ω |= e for all ω ∈ S

0 if ω 6|= e for all ω ∈ S


.


P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

P(e)





µe(S) =

c × µ(S) if ω |= e for all ω ∈ S

0

if ω 6|= e for all ω ∈ S


.


P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

P(e)





µe(S) =

c × µ(S) if ω |= e for all ω ∈ S0 if ω 6|= e for all ω ∈ S


.


P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

P(e)





µe(S) =


We can show c =

1P(e)

.


P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

P(e)





µe(S) =



.


P(h | e) =

µe(ω : ω |= h)

=P(h ∧ e)

P(e)





µe(S) =



.


P(h | e) = µe(ω : ω |= h)

=

P(h ∧ e)

P(e)





µe(S) =



.


P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

P(e)


Conditioning

Possible Worlds:

Observe Color = orange:


Exercise

Flu Sneeze Snore µtrue true true 0.064true true false 0.096true false true 0.016true false false 0.024false true true 0.096false true false 0.144false false true 0.224false false false 0.336

What is:

(a) P(flu ∧ sneeze)

(b) P(flu ∧ ¬sneeze)

(c) P(flu)

(d) P(sneeze | flu)

(e) P(¬flu ∧ sneeze)

(f) P(flu | sneeze)

(g) P(sneeze | flu∧snore)

(h) P(flu | sneeze∧snore)


Bayes’ theorem

The chain rule and commutativity of conjunction (h ∧ e isequivalent to e ∧ h) gives us:

P(h ∧ e) =

P(h | e)× P(e)

= P(e | h)× P(h).

If P(e) 6= 0, divide the right hand sides by P(e):

P(h | e) =P(e | h)× P(h)

P(e).

This is Bayes’ theorem.


Bayes’ theorem


P(h ∧ e) = P(h | e)× P(e)

= P(e | h)× P(h).


P(h | e) =P(e | h)× P(h)

P(e).



Bayes’ theorem


P(h ∧ e) = P(h | e)× P(e)

= P(e | h)× P(h).


P(h | e) =

P(e | h)× P(h)

P(e).



Bayes’ theorem


P(h ∧ e) = P(h | e)× P(e)

= P(e | h)× P(h).


P(h | e) =P(e | h)× P(h)

P(e).



Why is Bayes’ theorem interesting?

Often you have causal knowledge:P(symptom | disease)P(light is off | status of switches and switch positions)P(alarm | fire)

P(image looks like | a tree is in front of a car)

and want to do evidential reasoning:P(disease | symptom)P(status of switches | light is off and switch positions)P(fire | alarm).

P(a tree is in front of a car | image looks like )


Exercise

A cab was involved in a hit-and-run accident at night. Twocab companies, the Green and the Blue, operate in the city.You are given the following data:

85% of the cabs in the city are Green and 15% are Blue.

A witness identified the cab as Blue. The court tested thereliability of the witness in the circumstances that existedon the night of the accident and concluded that thewitness correctly identifies each one of the two colours80% of the time and failed 20% of the time.

What is the probability that the cab involved in the accidentwas Blue?

[From D. Kahneman, Thinking Fast and Slow, 2011, p. 166.]


Conditional independence

Random variable X is independent of random variable Y givenrandom variable(s) Z if,

P(X | YZ ) = P(X | Z )

i.e. for all x ∈ dom(X ), y , y ′ ∈ dom(Y ), and z ∈ dom(Z ),

P(X = x | Y = y ∧ Z = z)

= P(X = x | Y = y ′ ∧ Z = z)

= P(X = x | Z = z).

That is, knowledge of Y ’s value doesn’t affect the belief in thevalue of X , given a value of Z .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 1 1 / 12

Conditional independence

Random variable X is independent of random variable Y givenrandom variable(s) Z if,

P(X | YZ ) = P(X | Z )

i.e. for all x ∈ dom(X ), y , y ′ ∈ dom(Y ), and z ∈ dom(Z ),

P(X = x | Y = y ∧ Z = z)

= P(X = x | Y = y ′ ∧ Z = z)

= P(X = x | Z = z).

That is, knowledge of Y ’s value doesn’t affect the belief in thevalue of X , given a value of Z .


Example

Consider a student writing an exam.What are reasonable independences among the following?

Whether the student works hard

Whether the student is intelligent

The student’s answers on the exam

The student’s mark on an exam


Example domain (diagnostic assistant)

light

two-wayswitch

switch

off

on

poweroutlet

circuitbreaker

outside power

l1

l2

w1

w0

w2

w4

w3

w6

w5

p2

p1

cb2

cb1s1

s2s3


Examples of conditional independence?

Suppose you know whether there was power in w1 andwhether there was power in w2 what information isrelevant to whether light l1 is lit? What is independent?

Whether light l1 is lit is independent of the position oflight switch s2 given what?

Every other variable may be independent of whether lightl1 is lit given whether there is power in wire w0 and thestatus of light l1 (if it’s ok , or if not, how it’s broken).





Every other variable may be independent of whether lightl1 is lit given

whether there is power in wire w0 and thestatus of light l1 (if it’s ok , or if not, how it’s broken).


Idea of belief networks

l1 is lit (L1 lit) dependsonly on the status of thelight (L1 st) and whetherthere is power in wire w0.

In a belief network, W 0and L1 st are parents ofL1 lit.

w1 w2

s2_pos

s2_st

w0

l1_lit

l1_st

... ... ......

W 0 depends only on

whether there is power in w1,whether there is power in w2, the position of switch s2(S2 pos), and the status of switch s2 (S2 st).


Idea of belief networks

l1 is lit (L1 lit) dependsonly on the status of thelight (L1 st) and whetherthere is power in wire w0.

In a belief network, W 0and L1 st are parents ofL1 lit.

w1 w2

s2_pos

s2_st

w0

l1_lit

l1_st

... ... ......

W 0 depends only on whether there is power in w1,whether there is power in w2, the position of switch s2(S2 pos), and the status of switch s2 (S2 st).


Belief networks

Totally order the variables of interest: X1, . . . ,Xn

Theorem of probability theory (chain rule):P(X1, . . . ,Xn) =

∏ni=1 P(Xi | X1, . . . ,Xi−1)

The parents parents(Xi) of Xi are those predecessors ofXi that render Xi independent of the other predecessors.That is,

parents(Xi) ⊆ X1, . . . ,Xi−1 andP(Xi | parents(Xi)) = P(Xi | X1, . . . ,Xi−1)

So P(X1, . . . ,Xn) =∏n

i=1 P(Xi | parents(Xi))

A belief network is a graph: the nodes are randomvariables; there is an arc from the parents of each nodeinto that node.


Belief networks

Totally order the variables of interest: X1, . . . ,Xn

Theorem of probability theory (chain rule):P(X1, . . . ,Xn) =

∏ni=1 P(Xi | X1, . . . ,Xi−1)

The parents parents(Xi) of Xi are those predecessors ofXi that render Xi independent of the other predecessors.That is, parents(Xi) ⊆ X1, . . . ,Xi−1 andP(Xi | parents(Xi)) = P(Xi | X1, . . . ,Xi−1)

So P(X1, . . . ,Xn) =∏n

i=1 P(Xi | parents(Xi))

A belief network is a graph: the nodes are randomvariables; there is an arc from the parents of each nodeinto that node.


Example: fire alarm belief network

Variables:

Fire: there is a fire in the building

Tampering: someone has been tampering with the firealarm

Smoke: what appears to be smoke is coming from anupstairs window

Alarm: the fire alarm goes off

Leaving: people are leaving the building en masse.

Report: a colleague says that people are leaving thebuilding en masse. (A noisy sensor for leaving.)


Components of a belief network

A belief network consists of:

a directed acyclic graph with nodes labeled with randomvariables

a domain for each random variable

a set of conditional probability tables for each variablegiven its parents (including prior probabilities for nodeswith no parents).


Example belief network

Outside_power

W3

Cb1_st Cb2_st

W6

W2

W0

W1

W4

S1_st

S2_st

P1P2

S1_pos

S2_pos

S3_pos

S3_st

L2_st

L2_lit

L1_st

L1_lit


Example belief network (continued)

The belief network also specifies:

The domain of the variables:W0, . . . ,W6 have domain live, deadS1 pos, S2 pos, and S3 pos have domain up, downS1 st has ok , upside down, short, intermittent, broken.Conditional probabilities, including:P(W1 = live | s1 pos = up ∧ S1 st = ok ∧W3 = live)P(W1 = live | s1 pos = up ∧ S1 st = ok ∧W3 = dead)P(S1 pos = up)P(S1 st = upside down)


Belief network summary

A belief network is a directed acyclic graph (DAG) wherenodes are random variables.

The parents of a node n are those variables on which ndirectly depends.

A belief network is automatically acyclic by construction.

A belief network is a graphical representation ofdependence and independence:

I A variable is independent of its non-descendants givenits parents.


Constructing belief networks

To represent a domain in a belief network, you need toconsider:

What are the relevant variables?I What will you observe?I What would you like to find out (query)?I What other features make the model simpler?

What values should these variables take?

What is the relationship between them? This should beexpressed in terms of a directed graph, representing howeach variable is generated from its predecessors.

How does the value of each variable depend on itsparents? This is expressed in terms of the conditionalprobabilities.


Understanding Independence: Common ancestors

smokealarm

fire

alarm and smoke are

dependent

alarm and smoke are

independent

given fire

Intuitively, fire canexplain alarm andsmoke; learning onecan affect the other bychanging your belief infire.



smokealarm

fire

alarm and smoke aredependent

alarm and smoke are

independent

given fire




smokealarm

fire


alarm and smoke are

independent

given fire




smokealarm

fire


alarm and smoke areindependent given fire



Understanding Independence: Chain

report

alarm

leaving

alarm and report are

dependent


independent

givenleaving

Intuitively, the onlyway that the alarmaffects report is byaffecting leaving .



report

alarm

leaving

alarm and report aredependent


independent

givenleaving




report

alarm

leaving



independent

givenleaving




report

alarm

leaving


alarm and report areindependent givenleaving



Understanding Independence: Common

descendants

tampering

alarm

fire tampering and fire are

independent

tampering and fire are

dependent

given alarm

Intuitively, tamperingcan explain away fire



descendants

tampering

alarm

fire tampering and fire areindependent


dependent

given alarm




descendants

tampering

alarm



dependent

given alarm




descendants

tampering

alarm


tampering and fire aredependent given alarm



Understanding independence: example

B CA D E F

G H

M

I J K L

N

Q

R

O

S

P


Understanding independence: questions

1. On which given probabilities does P(N) depend?

2. If you were to observe a value for B , which variables’probabilities will change?

3. If you were to observe a value for N , which variables’probabilities will change?

4. Suppose you had observed a value for M ; if you were tothen observe a value for N , which variables’ probabilitieswill change?

5. Suppose you had observed B and Q; which variables’probabilities will change when you observe N?


What variables are affected by observing?

If you observe variable(s) Y , the variables whose posteriorprobability is different from their prior are:

I The ancestors of Y andI their descendants.

Intuitively (if you have a causal belief network):I You do abduction to possible causes andI prediction from the causes.


d-separation

A connection is a meeting of arcs in a belief network. Aconnection is open is defined as follows:

I If there are arcs A→ B and B → C such that B 6∈ Z ,then the connection at B between A and C is open.

I If there are arcs B → A and B → C such that B 6∈ Z ,then the connection at B between A and C is open.

I If there are arcs A→ B and C → B such that B (or adescendent of B) is in Z , then the connection at Bbetween A and C is open.

X is d-connected from Y given Z if there is a path fromX to Y , along open connections.

X is d-separated from Y given Z if it is not d-connected.

X is independent Y given Z for all conditionalprobabilities iff X is d-separated from Y given Z


d-separation









Markov Random Field

A Markov random field is composed of

of a set of discrete-valued random variables:X = X1,X2, . . . ,Xn and

a set of factors f1, . . . , fm, where a factor is anon-negative function of a subset of the variables.

and defines a joint probability distribution:

P(X = x) =1

Z

∏k

fk(Xk = xk) .

Z =∑

x

∏k

fk(Xk = xk)

where fk(Xk) is a factor on Xk ⊆ X, andxk is x projected onto Xk .Z is a normalization constant known as the partition function.


Markov Networks and Factor graphs

A Markov network is a graphical representation of aMarkov random field where the nodes are the randomvariables and there is an arc between any two variablesthat are in a factor together.

A factor graph is a bipartite graph, which contains avariable node for each random variable and a factor nodefor each factor. There is an edge between a variable nodeand a factor node if the variable appears in the factor.

A belief network is a type of Markov random field wherethe factors represent conditional probabilities, there is afactor for each variable, and directed graph is acyclic.





A belief network is a

type of Markov random field wherethe factors represent conditional probabilities, there is afactor for each variable, and directed graph is acyclic.


Independence in a Markov Network

The Markov blanket of a variable X is the set of variablesthat are in factors with X .

A variable is independent of the other variables given itsMarkov blanket.

X is connected to Y given Z if there is a path from X toY in the Markov network, which does not contain anelement of Z .

X is separated from Y given Z if it is not connected.

A positive distribution is one that does not contain zeroprobabilities.

X is independent Y given Z for all positive distributionsiff X is separated from Y given Z


Canonical Representations

The parameters of a graphical model are the numbersthat define the model.

A belief network is a canonical representation: given thestructure and the distribution, the parameters areuniquely determined.

A Markov random field is not a canonical representation.Many different parameterizations result in the samedistribution.


Representations of Conditional Probabilities

There are many representations of conditional probabilities andfactors:

Tables

Decision Trees

Rules

Weighted Logical Formulae

Contextual Tables

Logistic Functions


Tabular Representation

P(D | A,B ,C ) :

A B C D Prob

true true true true 0.9true true true false 0.1true true false true 0.9true true false false 0.1true false true true 0.2true false true false 0.8true false false true 0.2true false false false 0.8false true true true 0.3false true true false 0.7false true false true 0.4false true false false 0.6false false true true 0.3false false true false 0.7false false false true 0.4false false false false 0.6


Decision Tree Representation


Rule Representation

0.9 : d ← a ∧ b

0.2 : d ← a ∧ ¬b

0.3 : d ← ¬a ∧ c

0.4 : d ← ¬a ∧ ¬c


Weighted Logical Formulae

d ↔ ((a ∧ b ∧ n0)

∨ (a ∧ ¬b ∧ n1)

∨ (¬a ∧ c ∧ n2)

∨ (¬a ∧ ¬c ∧ n2))

ni are independent:

P(n0) = 0.9

P(n1) = 0.2

P(n2) = 0.3

P(n3) = 0.4


Contextual-Table Representation


Logistic Functions

P(h | e) =P(h ∧ e)

P(e)

=P(h ∧ e)

P(h ∧ e) + P(¬h ∧ e)

=1

1 + P(¬h ∧ e)/P(h ∧ e)

=1

1 + e− logP(h∧e)/P(¬h∧e)

= sigmoid(log odds(h | e))

sigmoid(x) =1

1 + e−x

odds(h | e) =P(h ∧ e)

P(¬h ∧ e)


Logistic Functions

P(h | e) =P(h ∧ e)

P(e)

=P(h ∧ e)

P(h ∧ e) + P(¬h ∧ e)

=1

1 + P(¬h ∧ e)/P(h ∧ e)

=1



sigmoid(x) =1

1 + e−x


P(¬h ∧ e)


Logistic Functions

P(h | e) =P(h ∧ e)

P(e)

=P(h ∧ e)

P(h ∧ e) + P(¬h ∧ e)

=1

1 + P(¬h ∧ e)/P(h ∧ e)

=1



sigmoid(x) =1

1 + e−x


P(¬h ∧ e)c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 45 18 / 20

Logistic Functions

A conditional probability is the sigmoid of the log-odds.

00.10.20.30.40.50.60.70.80.9

1

-10 -5 0 5 10

1

1 + e- x

sigmoid(x) =1

1 + e−x

A logistic function is the sigmoid of a linear function.


Logistic Representation of Conditional Probability

P(d | A,B ,C ) = sigmoid(0.9† ∗ A ∗ B

+ 0.2† ∗ A ∗ (1− B)

+ 0.3† ∗ (1− A) ∗ C

+ 0.4† ∗ (1− A) ∗ (1− C ))

where 0.9† is sigmoid−1(0.9).

P(d | A,B ,C ) = sigmoid(0.4†

+ (0.2† − 0.4†) ∗ A

+ (0.9† − 0.2†) ∗ A ∗ B

+ ...


Belief network inference

Main approaches to determine posterior distributions in graphicalmodels:

Variable Elimination, recursive conditioning: exploit thestructure of the network to eliminate (sum out) thenon-observed, non-query variables one at a time.

Stochastic simulation: random cases are generated accordingto the probability distributions.

Variational methods: find the closest tractable distribution tothe (posterior) distribution we are interested in.

Bounding approaches: bound the conditional probabilitesabove and below and iteratively reduce the bounds.

. . .








. . .


Factors

A factor is a representation of a function from a tuple ofrandom variables into a number.

We write factor f on variables X1, . . . ,Xj as f (X1, . . . ,Xj).

We can assign some or all of the variables of a factor:I f (X1 = v1,X2, . . . ,Xj), where v1 ∈ dom(X1), is a factor on

X2, . . . ,Xj .I f (X1 = v1,X2 = v2, . . . ,Xj = vj) is a number that is the value

of f when each Xi has value vi .

The former is also written as f (X1,X2, . . . ,Xj)X1 = v1 , etc.


Example factors

r(X ,Y ,Z ):

X Y Z val

t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f

0.9

f t

0.2

f f

0.8

r(X=t,Y ,Z=f ):

Y val

t

0.9

f

0.8

r(X=t,Y=f ,Z=f ) = 0.8


Example factors

r(X ,Y ,Z ):

X Y Z val


r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t

0.2

f f

0.8

r(X=t,Y ,Z=f ):

Y val

t

0.9

f

0.8

r(X=t,Y=f ,Z=f ) = 0.8


Example factors

r(X ,Y ,Z ):

X Y Z val


r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f

0.8

r(X=t,Y ,Z=f ):

Y val

t

0.9

f

0.8

r(X=t,Y=f ,Z=f ) = 0.8


Example factors

r(X ,Y ,Z ):

X Y Z val


r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

Y val

t

0.9

f

0.8

r(X=t,Y=f ,Z=f ) = 0.8


Example factors

r(X ,Y ,Z ):

X Y Z val


r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

Y val

t 0.9f

0.8

r(X=t,Y=f ,Z=f ) = 0.8


Example factors

r(X ,Y ,Z ):

X Y Z val


r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

Y val

t 0.9f 0.8

r(X=t,Y=f ,Z=f ) = 0.8


Example factors

r(X ,Y ,Z ):

X Y Z val


r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

Y val

t 0.9f 0.8

r(X=t,Y=f ,Z=f ) =

0.8


Example factors

r(X ,Y ,Z ):

X Y Z val


r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

Y val

t 0.9f 0.8

r(X=t,Y=f ,Z=f ) = 0.8


Multiplying factors

The product of factor f1(X ,Y ) and f2(Y ,Z ), where Y are thevariables in common, is the factor (f1 ∗ f2)(X ,Y ,Z ) defined by:

(f1 ∗ f2)(X ,Y ,Z ) = f1(X ,Y )f2(Y ,Z ).


Multiplying factors example

f1:

A B valt t 0.1t f 0.9f t 0.2f f 0.8

f2:

B C valt t 0.3t f 0.7f t 0.6f f 0.4

f1 ∗ f2:

A B C valt t t 0.03t t f

0.07

t f t

0.54

t f f

0.36

f t t

0.06

f t f

0.14

f f t

0.48

f f f

0.32



f1:

A B valt t 0.1t f 0.9f t 0.2f f 0.8

f2:

B C valt t 0.3t f 0.7f t 0.6f f 0.4

f1 ∗ f2:

A B C valt t t 0.03t t f 0.07t f t

0.54

t f f

0.36

f t t

0.06

f t f

0.14

f f t

0.48

f f f

0.32



f1:

A B valt t 0.1t f 0.9f t 0.2f f 0.8

f2:

B C valt t 0.3t f 0.7f t 0.6f f 0.4

f1 ∗ f2:

A B C valt t t 0.03t t f 0.07t f t 0.54t f f

0.36

f t t

0.06

f t f

0.14

f f t

0.48

f f f

0.32



f1:

A B valt t 0.1t f 0.9f t 0.2f f 0.8

f2:

B C valt t 0.3t f 0.7f t 0.6f f 0.4

f1 ∗ f2:

A B C valt t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t

0.06

f t f

0.14

f f t

0.48

f f f

0.32



f1:

A B valt t 0.1t f 0.9f t 0.2f f 0.8

f2:

B C valt t 0.3t f 0.7f t 0.6f f 0.4

f1 ∗ f2:

A B C valt t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f

0.14

f f t

0.48

f f f

0.32



f1:

A B valt t 0.1t f 0.9f t 0.2f f 0.8

f2:

B C valt t 0.3t f 0.7f t 0.6f f 0.4

f1 ∗ f2:

A B C valt t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f 0.14f f t

0.48

f f f

0.32



f1:

A B valt t 0.1t f 0.9f t 0.2f f 0.8

f2:

B C valt t 0.3t f 0.7f t 0.6f f 0.4

f1 ∗ f2:

A B C valt t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f 0.14f f t 0.48f f f

0.32



f1:

A B valt t 0.1t f 0.9f t 0.2f f 0.8

f2:

B C valt t 0.3t f 0.7f t 0.6f f 0.4

f1 ∗ f2:

A B C valt t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f 0.14f f t 0.48f f f 0.32


Summing out variables

We can sum out a variable, say X1 with range v1, . . . , vk, fromfactor f (X1, . . . ,Xj), resulting in a factor on X2, . . . ,Xj defined by:

(∑X1

f )(X2, . . . ,Xj)

= f (X1 = v1, . . . ,Xj) + · · ·+ f (X1 = vk , . . . ,Xj)


Summing out a variable example

f3:

A B C val


∑B f3:

A C val

t t 0.57t f

0.43

f t

0.54

f f

0.46



f3:

A B C val


∑B f3:

A C val

t t 0.57t f 0.43f t

0.54

f f

0.46



f3:

A B C val


∑B f3:

A C val

t t 0.57t f 0.43f t 0.54f f

0.46



f3:

A B C val


∑B f3:

A C val

t t 0.57t f 0.43f t 0.54f f 0.46


Exercise

Given factors:

s:

A val

t 0.75f 0.25

t:

A B val

t t 0.6t f 0.4f t 0.2f f 0.8

o:

A val

t 0.3f 0.1

What are the following factors over?

(a) s ∗ t(b)

∑A s ∗ t

(c)∑

B s ∗ t(d)

∑A

∑B s ∗ t

(e) s ∗ t ∗ o(f)

∑B s ∗ t ∗ o


Evidence

If we want to compute the posterior probability of Z givenevidence Y1 = v1 ∧ . . . ∧ Yj = vj :

P(Z |Y1 = v1, . . . ,Yj = vj)

=

P(Z ,Y1 = v1, . . . ,Yj = vj)

P(Y1 = v1, . . . ,Yj = vj)

=P(Z ,Y1 = v1, . . . ,Yj = vj)∑Z P(Z ,Y1 = v1, . . . ,Yj = vj).

So the computation reduces to the probability ofP(Z ,Y1 = v1, . . . ,Yj = vj).

Normalize at the end, by summing out Z and dividing.


Evidence


P(Z |Y1 = v1, . . . ,Yj = vj)

=P(Z ,Y1 = v1, . . . ,Yj = vj)

P(Y1 = v1, . . . ,Yj = vj)

=

P(Z ,Y1 = v1, . . . ,Yj = vj)∑Z P(Z ,Y1 = v1, . . . ,Yj = vj).




Evidence


P(Z |Y1 = v1, . . . ,Yj = vj)

=P(Z ,Y1 = v1, . . . ,Yj = vj)

P(Y1 = v1, . . . ,Yj = vj)

=P(Z ,Y1 = v1, . . . ,Yj = vj)∑Z P(Z ,Y1 = v1, . . . ,Yj = vj).




Probability of a conjunction

Suppose the variables of the belief network are X1, . . . ,Xn.To compute P(Z ,Y1 = v1, . . . ,Yj = vj), we sum out the othervariables, Z1, . . . ,Zk = X1, . . . ,Xn − Z − Y1, . . . ,Yj.We order the Zi into an elimination ordering.

P(Z ,Y1 = v1, . . . ,Yj = vj)

=

∑Zk

· · ·∑Z1

P(X1, . . . ,Xn)Y1 = v1,...,Yj = vj .

=∑Zk

· · ·∑Z1

n∏i=1

P(Xi |parents(Xi ))Y1 = v1,...,Yj = vj .




P(Z ,Y1 = v1, . . . ,Yj = vj)

=∑Zk

· · ·∑Z1

P(X1, . . . ,Xn)Y1 = v1,...,Yj = vj .

=

∑Zk

· · ·∑Z1

n∏i=1





P(Z ,Y1 = v1, . . . ,Yj = vj)

=∑Zk

· · ·∑Z1

P(X1, . . . ,Xn)Y1 = v1,...,Yj = vj .

=∑Zk

· · ·∑Z1

n∏i=1



Computing sums of products

Computation in belief networks reduces to computing the sums ofproducts.

How can we compute ab + ac efficiently?

Distribute out the a giving a(b + c)

How can we compute∑

Z1

∏ni=1 P(Xi |parents(Xi )) efficiently?

Distribute out those factors that don’t involve Z1.







Z1




Variable elimination algorithm

To compute P(Z |Y1 = v1 ∧ . . . ∧ Yj = vj):

Construct a factor for each conditional probability.

Set the observed variables to their observed values.

Sum out each of the other variables (the Z1, . . . ,Zk)according to some elimination ordering.

Multiply the remaining factors. Normalize by dividing theresulting factor f (Z ) by

∑Z f (Z ).


Summing out a variable

To sum out a variable Zj from a product f1, . . . , fk of factors:

Partition the factors intoI those that don’t contain Zj , say f1, . . . , fi ,I those that contain Zj , say fi+1, . . . , fk

We know:

∑Zj

f1∗ · · · ∗fk = f1∗ · · · ∗fi∗

∑Zj

fi+1∗ · · · ∗fk

.

Explicitly construct a representation of the rightmost factor.Replace the factors fi+1, . . . , fk by the new factor.


Example

A C

B D

E

F G

P(E | g) =

P(E ∧ g)∑E P(E ∧ g)

P(E ∧ g)

=∑F

∑B

∑C

∑A

∑D

P(A)P(B | AC )

P(C )P(D | C )P(E | B)P(F | E )P(g | ED)

=

(∑F

P(F | E )

)∑B

P(E | B)∑C

(P(C )

(∑A

P(A)P(B | AC )

)(∑

D

P(D | C )P(g | ED)

))


Variable Elimination example

A B C D E F

G H

Query: P(G |f ); elimination ordering: A,H,E ,D,B,C

P(G |f ) ∝

∑C

∑B

∑D

∑E

∑H

∑A

P(A)P(B|A)P(C |B)

P(D|C )P(E |D)P(f |E )P(G |C )P(H|E )

=∑C

(∑B

(∑A

P(A)P(B|A)

)P(C |B)

)P(G |C )(∑

D

P(D|C )

(∑E

P(E |D)P(f |E )∑H

P(H|E )

))



A B C D E F

G H


P(G |f ) ∝∑C

∑B

∑D

∑E

∑H

∑A

P(A)P(B|A)P(C |B)


=∑C

(∑B

(∑A

P(A)P(B|A)

)P(C |B)

)P(G |C )(∑

D

P(D|C )

(∑E

P(E |D)P(f |E )∑H

P(H|E )

))


Pruning Irrelevant Variables (Belief networks)

Suppose you want to compute P(X | e1 . . . ek):

Prune any variables that have no observed or querieddescendents.

Connect the parents of any observed variable.

Remove arc directions.

Remove observed variables.

Remove any variables not connected to X in the resulting(undirected) graph.


Markov chain

A Markov chain is a special sort of belief network:

S0 S1 S2 S3 S4

What probabilities need to be specified?

P(S0) specifies initial conditions

P(Si+1|Si) specifies the dynamics

What independence assumptions are made?

P(Si+1|S0, . . . , Si) = P(Si+1|Si).

Often St represents the state at time t. Intuitively St

conveys all of the information about the history that canaffect the future states.

“The future is independent of the past given the present.”


Markov chain


S0 S1 S2 S3 S4





P(Si+1|S0, . . . , Si) = P(Si+1|Si).





Markov chain


S0 S1 S2 S3 S4





P(Si+1|S0, . . . , Si) = P(Si+1|Si).



“The future is independent of the past given the present.”c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 6 1 / 23

Stationary Markov chain

A stationary Markov chain is when for all i > 0, i ′ > 0,P(Si+1|Si) = P(Si ′+1|Si ′).

We specify P(S0) and P(Si+1|Si).I Simple model, easy to specifyI Often the natural modelI The network can extend indefinitely


Stationary Distribution

A distribution over states, P is a stationary distribution iffor each state s, P(Si+1=s) = P(Si=s).

Every Markov chain has a stationary distribution.

A Markov chain is ergodic if, for any two states s1 and s2,there is a non-zero probability of eventually reaching s2from s1.

A Markov chain is periodic if there is a strict temporalregularity in visiting states. A state is only visited divisibleat time t if t mod n = m for some n,m.

An ergodic and aperiodic Markov chain has a uniquestationary distribution P andP(s) = limi→∞ Pi(s) — equilibrium distribution


Pagerank

Consider the Markov chain:

Domain of Si is the set of all web pages

P(S0) is uniform; P(S0 = pj) = 1/N

P(Si+1 = pj | Si = pk)

= (1− d)/N + d ∗

1/nk if pk links to pj1/N if pk has no links0 otherwise

where there are N web pages and nk links from page pk

d ≈ 0.85 is the probability someone keeps surfing web

This Markov chain converges to a distribution over webpages (original P(Si) for i = 52 for 322 million links):Pagerank - basis for Google’s initial search engine


Hidden Markov Model

A Hidden Markov Model (HMM) is a belief network:

S0 S1 S2 S3 S4

O0 O1 O2 O3 O4

The probabilities that need to be specified:



P(Oi |Si) specifies the sensor model


Filtering

Filtering:

P(Si |o1, . . . , oi)

What is the current belief state based on the observationhistory?

P(Si |o1, . . . , oi) ∝ P(oi |Sio1, . . . , oi−1)P(Si |o1, . . . , oi−1)

=???∑Si−1

P(SiSi−1|o1, . . . , oi−1)

=???

Observe O0, query S0.then observe O1, query S1.then observe O2, query S2.. . .


Filtering

Filtering:

P(Si |o1, . . . , oi)



=???∑Si−1

P(SiSi−1|o1, . . . , oi−1)

=???



Filtering

Filtering:

P(Si |o1, . . . , oi)



=???∑Si−1

P(SiSi−1|o1, . . . , oi−1)

=???

Observe O0, query S0.

then observe O1, query S1.then observe O2, query S2.. . .


Filtering

Filtering:

P(Si |o1, . . . , oi)



=???∑Si−1

P(SiSi−1|o1, . . . , oi−1)

=???

Observe O0, query S0.then observe O1, query S1.

then observe O2, query S2.. . .


Filtering

Filtering:

P(Si |o1, . . . , oi)



=???∑Si−1

P(SiSi−1|o1, . . . , oi−1)

=???



Example: localization

Suppose a robot wants to determine its location based onits actions and its sensor readings: Localization

This can be represented by the augmented HMM:

Loc0 Loc1 Loc2 Loc3 Loc4

Obs0 Obs1 Obs2 Obs3 Obs4

Act0 Act1 Act2 Act3


Example localization domain

Circular corridor, with 16 locations:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Doors at positions: 2, 4, 7, 11.

Noisy Sensors

Stochastic Dynamics

Robot starts at an unknown location and must determinewhere it is.


Example Sensor Model

P(Observe Door | At Door) = 0.8

P(Observe Door | Not At Door) = 0.1


Example Dynamics Model

P(loct+1 = L|actiont = goRight ∧ loct = L) = 0.1

P(loct+1 = L + 1|actiont = goRight ∧ loct = L) = 0.8

P(loct+1 = L + 2|actiont = goRight ∧ loct = L) = 0.074

P(loct+1 = L′|actiont = goRight ∧ loct = L) = 0.002 forany other location L′.

I All location arithmetic is modulo 16.I The action goLeft works the same but to the left.


Combining sensor information

Example: we can combine information from a light sensorand the door sensor Sensor Fusion

Loc0 Loc1 Loc2 Loc3 Loc4

D0 D1 D2 D3 D4

Act0 Act1 Act2 Act3

L0 L1 L2 L3 L4

St robot location at time tDt door sensor value at time tLt light sensor value at time t


Naive Bayes Classifier: User’s request for help

H

"able" "absent" "add" "zoom". . .

H is the help page the user is interested in.

What probabilities are required?

P(hi) for each help page hi . The user is interested in onebest web page, so

∑i P(hi) = 1.

P(wj | hi) for each word wj given page hi . There can bemultiple words used in a query.

Given a help query: condition on the words in the queryand display the most likely help page.



H


H is the help page the user is interested in.What probabilities are required?


∑i P(hi) = 1.





H




∑i P(hi) = 1.





H




∑i P(hi) = 1.


Given a help query:

condition on the words in the queryand display the most likely help page.



H




∑i P(hi) = 1.




Simple Language Models: set-of-words

Sentence: w1,w2,w3, . . . .Set-of-words model:

“aardvark” “zzz”“a” ...

Each variable is Boolean: true when word is in thesentence and false otherwise.

What probabilities are provided?I P(”a”), P(”aardvark”), . . . , P(”zzz”)

How do we condition on the question “how can I phonemy phone”?






What probabilities are provided?

I P(”a”), P(”aardvark”), . . . , P(”zzz”)



Simple Language Models: bag-of-words

Sentence: w1,w2,w3, . . . ,wn.Bag-of-words or unigram:

W2 ...W3 WnW1

Domain of each variable is the set of all words.

What probabilities are provided?I P(wi ) is a distribution over words for each position





W2 ...W3 WnW1



I P(wi ) is a distribution over words for each position





W2 ...W3 WnW1





Simple Language Models: bigram

Sentence: w1,w2,w3, . . . ,wn.bigram:

W2 ...W3 WnW1


What probabilities are provided?I P(wi |wi−1) is a distribution over words for each position

given the previous word





W2 ...W3 WnW1



I P(wi |wi−1) is a distribution over words for each positiongiven the previous word





W2 ...W3 WnW1






Simple Language Models: trigram

Sentence: w1,w2,w3, . . . ,wn.trigram:

W2 ...W3 WnW1 W4



P(wi |wi−1,wi−2)

N-gram

P(wi |wi−1, . . .wi−n+1) is a distribution over words giventhe previous n − 1 words




W2 ...W3 WnW1 W4

Domain of each variable is the set of all words.What probabilities are provided?


N-gram



Logic, Probability, Statistics, Ontology over time

From: Google Books Ngram Viewer(https://books.google.com/ngrams)


https://books.google.com/ngrams

Predictive Typing and Error Correction

W2 ...W1

L11 Lk1L21 ... L12 Lk2L22 ...

W3

L13 Lk3L23 ...

domain(Wi) = ”a”, ”aarvark”, . . . , ”zzz”, ”⊥”, ”?”domain(Lji) = ”a”, ”b”, ”c”, . . . , ”z”, ”1”, ”2”, . . .


Beyond N-grams

A man with a big hairy cat drank the cold milk.

Who or what drank the milk?

Simple syntax diagram:s

np vp

pp

np

npa man

with

a big hairy cat

drank

the cold milk


Topic Model

tools food topics

"nut" "tuna""bolt" words

fish

“shark”“salmon”

finance


Topic Model

tools food topics

"nut" "tuna""bolt" words


Google’s rephil

900,000 topics

"Aaron's beard" "zzz""aardvark" 12,000,000 words...

350,000,000 links

...


Deep Belief Networks

...

...

...

...


Stochastic Simulation

Idea: probabilities ↔ samples

Get probabilities from samples:

X count

x1 n1...

...xk nk

total m

↔

X probability

x1 n1/m...

...xk nk/m

If we could sample from a variable’s (posterior)probability, we could estimate its (posterior) probability.


Generating samples from a distribution

For a variable X with a discrete domain or a (one-dimensional)real domain:

Totally order the values of the domain of X .

Generate the cumulative probability distribution:f (x) = P(X ≤ x).

Select a value y uniformly in the range [0, 1].

Select the x such that f (x) = y .


Cumulative Distribution

v1 v2 v3 v4 v1 v2 v3 v4

P(X)

f(X)

0

1

0

1


Hoeffding’s inequality

Theorem (Hoeffding): Suppose p is the true probability, and sis the sample average from n independent samples; then

P(|s − p| > ε) ≤ 2e−2nε2

.

Guarantees a probably approximately correct estimate ofprobability.

If you are willing to have an error greater than ε in δ of thecases, solve 2e−2nε

2< δ for n, which gives

n >− ln δ

2

2ε2.

ε δ n0.1 0.05 1850.01 0.05 18,4450.1 0.01 265




P(|s − p| > ε) ≤ 2e−2nε2

.

Guarantees a probably approximately correct estimate ofprobability.If you are willing to have an error greater than ε in δ of thecases, solve 2e−2nε


n >− ln δ

2

2ε2.

ε δ n0.1 0.05 1850.01 0.05 18,4450.1 0.01 265


Forward sampling in a belief network

Sample the variables one at a time; sample parents of Xbefore sampling X .

Given values for the parents of X , sample from theprobability of X given its parents.


Rejection Sampling

To estimate a posterior probability given evidenceY1 = v1 ∧ . . . ∧ Yj = vj :

Reject any sample that assigns Yi to a value other thanvi .

The non-rejected samples are distributed according to theposterior probability:

P(α | evidence) ≈∑

sample|=α 1∑sample 1

where we consider only samples consistent with evidence.


Rejection Sampling Example: P(ta | sm, re)

Ta Fi

SmAl

Le

Re

Observe Sm = true,Re = true

Ta Fi Al Sm Le Res1 false true false true false false

8

s2 false true true true true true 4

s3 true false true false — — 8

s4 true true true true true true 4

. . .s1000 false false false false — — 8

P(sm) = 0.02P(re | sm) = 0.32How many samples are rejected?How many samples are used?



Ta Fi

SmAl

Le

Re


Ta Fi Al Sm Le Res1 false true false true false false 8

s2 false true true true true true

4







Ta Fi

SmAl

Le

Re




s3 true false true false

— — 8






Ta Fi

SmAl

Le

Re





s4 true true true true true true

4





Ta Fi

SmAl

Le

Re






. . .s1000 false false false false

— — 8




Ta Fi

SmAl

Le

Re









Importance Sampling

Samples have weights: a real number associated witheach sample that takes the evidence into account.

Probability of a proposition is weighted average ofsamples:

P(α | evidence) ≈∑

sample|=α weight(sample)∑sample weight(sample)

Mix exact inference with sampling: don’t sample all ofthe variables, but weight each sample according toP(evidence | sample).


Importance Sampling (Likelihood Weighting)

procedure likelihood weighting(Bn, e,Q, n):ans[1 : k] := 0 where k is size of dom(Q)repeat n times:

weight := 1for each variable Xi in order:

if Xi = oi is observedweight := weight × P(Xi = oi | parents(Xi))

else assign Xi a random sample of P(Xi | parents(Xi))if Q has value v :

ans[v ] := ans[v ] + weightreturn ans/

∑v ans[v ]


Importance Sampling Example: P(ta | sm, re)

Ta Fi

SmAl

Le

Re

Ta Fi Al Le Weights1 true false true false

0.01× 0.01

s2 false true false false

0.9× 0.01

s3 false true true true

0.9× 0.75

s4 true true true true

0.9× 0.75

. . .s1000 false false true true

0.01× 0.75

P(sm | fi) = 0.9P(sm | ¬fi) = 0.01P(re | le) = 0.75P(re | ¬le) = 0.01


Importance Sampling Example: P(ta | sm, re)

Ta Fi

SmAl

Le

Re

Ta Fi Al Le Weights1 true false true false 0.01× 0.01s2 false true false false 0.9× 0.01s3 false true true true 0.9× 0.75s4 true true true true 0.9× 0.75. . .s1000 false false true true 0.01× 0.75

P(sm | fi) = 0.9P(sm | ¬fi) = 0.01P(re | le) = 0.75P(re | ¬le) = 0.01


Computing Expectations & Proposal Distributions

Expected value of f with respect to distribution P :

EP(f ) =∑w

f (w) ∗ P(w)

≈ 1

n

∑s

f (s)

s is sampled with probability P . There are n samples.

EP(f ) =∑w

f (w) ∗ P(w)/Q(w) ∗ Q(w)

≈ 1

n

∑s

f (s) ∗ P(s)/Q(s)

s is selected according the distribution Q.The distribution Q is called a proposal distribution.P(c) > 0 then Q(c) > 0.




EP(f ) =∑w

f (w) ∗ P(w)

≈ 1

n

∑s

f (s)


EP(f ) =∑w

f (w) ∗ P(w)/Q(w) ∗ Q(w)

≈ 1

n

∑s

f (s) ∗ P(s)/Q(s)



Particle Filtering

Importance sampling can be seen as:

for each particle:for each variable:

sample / absorb evidence / update query

where particle is one of the samples.

Instead we could do:

for each variable:for each particle:


Why?

We can have a new operation of resampling

It works with infinitely many variables (e.g., HMM)


Particle Filtering




where particle is one of the samples.Instead we could do:



Why?




Particle Filtering for HMMs

Start with random chosen particles (say 1000)

Each particle represents a history.

Initially, sample states in proportion to their probability.

Repeat:I Absorb evidence: weight each particle by the probability

of the evidence given the state of the particle.I Resample: select each particle at random, in proportion

to the weight of the particle.Some particles may be duplicated, some may beremoved. All new particles have same weight.

I Transition: sample the next state for each particleaccording to the transition probabilities.

To answer a query about the current state, use the set ofparticles as data.


Markov Chain Monte Carlo

To sample from a distribution P :

Create (ergodic and aperiodic) Markov chain with P asequilibrium distribution.Let T (Si+1 | Si) be the transition probability.Given state s, sample state s ′ from T (S | s)

After a while, the states sampled will be distributedaccording to P .Ignore the first samples “burn-in”— use the remaining samples.Samples are not independent of each other“autocorrelation”.Sometimes use subset (e.g., 1/1000) of them “thinning”Gibbs sampler: sample each non-observed variable fromthe distribution of the variable given the current (orobserved) value of the variables in its Markov blanket.




Create (ergodic and aperiodic) Markov chain with P asequilibrium distribution.Let T (Si+1 | Si) be the transition probability.Given state s, sample state s ′ from T (S | s)After a while, the states sampled will be distributedaccording to P .

Ignore the first samples “burn-in”— use the remaining samples.Samples are not independent of each other“autocorrelation”.Sometimes use subset (e.g., 1/1000) of them “thinning”Gibbs sampler: sample each non-observed variable fromthe distribution of the variable given the current (orobserved) value of the variables in its Markov blanket.




Create (ergodic and aperiodic) Markov chain with P asequilibrium distribution.Let T (Si+1 | Si) be the transition probability.Given state s, sample state s ′ from T (S | s)After a while, the states sampled will be distributedaccording to P .Ignore the first samples “burn-in”— use the remaining samples.Samples are not independent of each other“autocorrelation”.Sometimes use subset (e.g., 1/1000) of them “thinning”

Gibbs sampler: sample each non-observed variable fromthe distribution of the variable given the current (orobserved) value of the variables in its Markov blanket.




Create (ergodic and aperiodic) Markov chain with P asequilibrium distribution.Let T (Si+1 | Si) be the transition probability.Given state s, sample state s ′ from T (S | s)After a while, the states sampled will be distributedaccording to P .Ignore the first samples “burn-in”— use the remaining samples.Samples are not independent of each other“autocorrelation”.Sometimes use subset (e.g., 1/1000) of them “thinning”Gibbs sampler: sample each non-observed variable fromthe distribution of the variable given the current (orobserved) value of the variables in its Markov blanket.


the mind is a neural computer, tted by natural selection

Documents