stochastic models in space and time i

ST3453: Stochastic Models in Space andTime I

Lecturer: Jason Wyse LATEX: James O’Donnell

December 3, 2014

Contents

1 Examples of Stochastic Processes 21.1 Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Gambler’s Ruin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Social Mobility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Fancy a Drink Tonight? . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Modelling Evolutionary Divergence . . . . . . . . . . . . . . . . . . . . . 5

2 The Markov Property and Markov Chains 62.1 Definition of Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 The Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Transition Probabilities and the Transition Matrix . . . . . . . . . . . . . 72.4 Multistep Transition Probabilities . . . . . . . . . . . . . . . . . . . . . . 9

3 Properties of Markov Chains 153.1 Decomposability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Periodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 Long-Run Regularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.5 Computing Stable Distributions . . . . . . . . . . . . . . . . . . . . . . . 183.6 Detailed Balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Poisson Processes 244.1 Assumptions of the Poisson Process . . . . . . . . . . . . . . . . . . . . . 244.2 Probability Law of Poisson Process . . . . . . . . . . . . . . . . . . . . . 254.3 Moments of the Poisson Distribution . . . . . . . . . . . . . . . . . . . . 264.4 Times of First Arrival . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.5 Memoryless Property of the Exponential Distribution . . . . . . . . . . . 294.6 Time to Occurrence of rth Event . . . . . . . . . . . . . . . . . . . . . . . 294.7 Summary of Inter-Arrival Times . . . . . . . . . . . . . . . . . . . . . . . 314.8 General Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.9 Compound Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . 34

1

5 Some Continuous Time Processes 365.1 Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.2 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3 Brownian Motion With Drift . . . . . . . . . . . . . . . . . . . . . . . . . 415.4 Finance Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Applications of Stochastic Processes: Bayesian Model Estimation ThroughMarkov Chain Monte Carlo 426.1 Likelihood and Maximum Likelihood . . . . . . . . . . . . . . . . . . . . 426.2 Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.3 Posterior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.4 Posterior Quantities of Interest . . . . . . . . . . . . . . . . . . . . . . . 546.5 MCMC: The Key Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.6 The Gibbs Sampling Algorithm . . . . . . . . . . . . . . . . . . . . . . . 556.7 The Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . 58

7 Spatial Processes 59

A Tutorials 59A.1 Tutorial 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59A.2 Tutorial 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

1 Examples of Stochastic Processes

1.1 Importance

Stochastic models comprise some of the most powerful methods available to data analystsin the description of observed real-life processes. In this module we will study Markovmodels. The importance of these models is highlighted by the fact that:

1. a huge number of physical, biological, economic and social phenomena can be mod-elled naturally using them;

2. there is well developed theory and methods which allow us to do this modelling (ina correct way).

On a very crude and granular level one could describe Markov models as models whichuse information observed in the past to give an idea of what to expect in the present.

Let’s begin by considering some examples and giving informal notions of some keyconcepts.

1.2 Gambler’s Ruin

Consider playing a game where on any play of the game you win AC1 with probabilityp = 0.4 or alternatively lose AC1 with probability 1� p = 0.6. Suppose that you decide to

2

stop if your fortune reaches ACN . If you reach AC0 you can’t play anymore. On one playof the game the expected winnings are

(AC1)p+ (�AC1)(1� p) = AC(2p� 1) = �AC0.2

so that the casino has a margin on the game.

Let xn

be the amount of money you have after n plays of the game. If you play again,you either have x

n+1

or x

n�1

after the next play of the game, so x

n+1

only depends onx

n

.

x

n+1

=

8<

:x

n+1

, p

x

n�1

, 1� p

So x

n

has what we call the Markov property. This means that given the current state,x

n

, any other information about the past is irrelevant for predicting the next state x

n+1

.If you are still playing at time n, i.e. 0 < x

n

< N and x

n

= i, then

P(xn+1

= i+ 1|xn

= i, x

n�1

= i

n�1

, . . . , x

0

= i

0

) = p = 0.4

where i

n�1

, . . . , i

0

are past values of your fortune. Note the use here of the conditioningevent, it is a probability x

n+1

= i+ 1 conditional on the past fortunes.

Recall that

P(B|A) = P(B \ A)

P(a)We say that x

n

is a discrete (indexed by the positive integers) time Markov chainwith transition probabilities p(i, j) if for any j, i, i

n�1

, . . . , i

0

:

P(xn+1

= j|xn

= i, x

n�1

= i

n�1

, . . . , x

0

= i

0

) = p(i, j)

P(xn+1

= j|xn

= i) = p(i, j)

The Markov property means that we can forget about the past; only the present isuseful for predicting the future. In formulating the p(i, j) above we have assumed theyare temporally homogeneous, that is,

p(i, j) = P(xn+1

= j|xn

= i)

does not depend on time.

So the transition probabilities really determine the rules of the game. Usually, we putthis information in a matrix.

0 < i < N : p(i, i+ 1) = 0.4 p(i, i� 1) = 0.6

i = 0, i = N : p(0, 0) = 1 p(N,N) = 1

p(i, i± k) = 0 , k > 1

If we stop playing when we reach AC5, the transition matrix is

3

2

6666664

1 0 0 0 0 0.6 0 .4 0 0 00 .6 0 .4 0 00 0 .6 0 .4 00 0 0 .6 0 .40 0 0 0 0 1

3

7777775= p

This matrix could also be represented pictorially using a graph diagram:

DIAGRAM

1.3 Social Mobility

Suppose xn

is a family’s social class in the nth generation, assuming this to be either 1 =‘lower class’, 2 =‘middle’, 3 =‘upper’. In this very simplified version of sociology changesof status are a Markov chain with transition matrix:

p =

2

4.7 .2 .1.3 .5 .2.2 .4 .4

3

5

The graph diagram is as follows:

DIAGRAM

Critiques of this simple model of social mobility:

1. limited – as there are only three states;

2. time (temporal) homogeneity – we may expect probabilities to change with time;

3. model doesn’t learn from the external environment (economy, etc.).

Nonetheless, using this simple model we can ask (and later answer) questions like:‘Does the proportion of families in the three classes approach a limit?’

Note that in the transition matrices P , that the sum of any row is 1

X

j

p(i, j) = 1 stochastic matrix

and also p(i, j) � 0 as entries are all conditional probabilities.

If we consider a family in ‘middle’ class in generation n, xn

= 2 then

P(xn+1

= 1|xn

= 2) + P(xn+1

= 2|xn

= 2) + P(xn+1

= 3|xn

= 2) = 1

since the family will either stay middle class or move in the next generation.

4

1.4 Fancy a Drink Tonight?

Notorious behaviour in the first night of college.

x

0

! yesterday

1 = ‘quiet night’

2 = ‘had a few’

3 = ‘absolute mad one’

p

today

=

2

4.2 .2 .6.2 .4 .4.1 .4 .5

3

5

p

tomorrow

=

2

4.2 .2 .6.3 .5 .2.7 .2 .1

3

5

It is a time inhomogeneous Markov chain:

p

today

(1, 1) = .2 = p

tomorrow

p

today

(2, 2) = .4

p

tomorrow

(2, 2) = .5

1.5 Modelling Evolutionary Divergence

The fundamental description of all living organisms is their genome sequence. This is astring of 4 characters.

A – adenine, C – cytosine, G – guanine, T – thymine.

In DNA terminology these are the bases. DNA is a double-stranded halix (Watson &Crick); complementary base pairs: A with T, C with G.

E. coli �! 4.6⇥ 106 base pairs.

Homo sapiens �! 3⇥ 109 base pairs.

Briefly, evolution of organisms occurs because of ‘mutations’ in these base pairs, whichamounts to a copying error when DNA replicates. Looking at mutations is the key whentalking about evolution. Modern man started our divergence from apes about 5 � 6million years ago.

Modern evolutionary models are largely based on Markov processes, in continuoustime. At any site on the genome we have a stochastic variable x(t) (t is time) taking oneof the following values:

{1, 2, 3, 4} ! {A,C, T,G}The Markov models say

P(x(t+ s) = j|x(s) = i) = P(x(t) = j|x(0) = i) = p

i,j

(t)

5

so that the probabilities have the Markov property. The transition matrix is

p =

2

64P(A|A, t) P(C|A, t) P(G|A, t) P(T |A, t)

......

......

P(A|T, t) P(C|T, t) P(G|T, t) P(T |T, t)

3

75

Using some further assumptions on the structure of the probabilities the Jukes-Cantormodel for base substitution

p

i,i

(t) =1

4(1 + 3e�4↵t)

p

i,j

(t) =1

4(1� e

�4↵t)

for a parameter ↵ to be estimated from data.

2 The Markov Property and Markov Chains

(Discrete time, finite # states)

2.1 Definition of Markov Chain

Consider a stochastic process {xn

, n = 0, 1, 2, . . .} that can take on a finite number ofvalues. Let these values be denoted by the set of non-negative integers {1, 2, . . . , k}. Theprocess is in state i at time n if x

n

= i. Since the time index n is discrete, we say thatx

n

is a discrete time process. Since x

n

can take a finite number of values, we say it is afinite state process.

If we assume there is a fixed probability that the process will be in state j at timen+ 1, given it is in state i at time n.

P(xn+1

= j|xn

= i, x

n�1

= i

n�1

, . . . , x

0

= i

0

} = p(i, j)

for all states i0

, i

1

, . . . , i

n�1

, i, j and for all n � 0. If this is the case, then we say that xn

is a Markov process.

2.2 The Markov Property

If xn

is a Markov chain then it has the Markov property.

This says that the conditional distribution of any future state x

n+1

, given all paststates x

0

, x

1

, . . . , x

n

, depends only on x

n

. It is independent of all the states x0

, . . . , x

n�1

,i.e.,

P(xn+1

|xn

, x

n�1

, . . . , x

0

) = P(xn+1

|xn

)

6

2.3 Transition Probabilities and the Transition Matrix

The one-step transition probabilities p(i, j) give the probability that the chain x

n

goesfrom state i (x

n

= i) to state j in one step (xn+1

= j).

As we know the p(i, j)s are probabilities, it is clear that

p(i, j) � 0 8 1 i, j k

and since the chain either will stay where it is or transition to a di↵erent state

kX

j=1

p(i, j) = 1

Let P denotes the matrix of one-step transition probabilities

P =

2

6664

p(1, 1) p(1, 2) . . . p(1, k)p(2, 1) p(2, 2) . . . p(2, k)

......

. . ....

p(k, 1) p(k, 2) . . . p(k, k)

3

7775

then we refer to P as the one-step transition matrix of the Markov chain {xn

, n � 0}.

Example:Let {x

t

, t � 1} be independent identically distributed (iid).

(Recall that this means:

8 t, E(Xt

) = µ , µ 2 R

8 t,Var(Xt

) = �

2

, � 2 R

8 (t, k),Cov(Xt

, X

k

) = 0)

with P(Xt

= `) = a

`

` = 0,±1,±2, . . .

Now suppose S

0

= 0, S

n

=P

n

t=1

X

t

.

Exercise: Show that Sn

is a Markov chain (MC).

P(Sn+1

= j|Sn

= i, S

n�1

= i

n�1

, . . . , S

0

= 0) = P(Sn

+X

n+1

= j|Sn

= i, S

n�1

= i

n�1

, . . . , S

0

= 0)

= P(Xn+1

= j � i|Sn

= i, S

n�1

= i

n��1

, . . . , S

0

= 0)

= P(Xn+1

= j � i) iid

= a

j�i

So S

n

satisfies the Markov property. The process Sn

is called a random walk.End Example

7

A simple random walk (SRW) is a process {Sn

, n � 0}, S0

= 0 where

S

n

=nX

t=1

X

t

where X

t

are iid and

P(Xt

= 1) = p

P(Xt

= �1) = q = 1� p

for 0 < p < 1.One can show that |S

n

| (the distance of the SRW from the origin) is a Markov process.

Consider P(Sn

= i||Sn

| = i, |Sn�1

| = i

n�1

, . . . , |S1

| = i

1

} and let i

0

= 0. Let j =max{k : 0 k n; i

k

= 0} implying that Sj

= 0.

n

j

Since we know S

j

= 0:

P(Sn

= i||Sn

| = i, |Sn�1

| = i

n�1

, . . . , |S1

| = i

1

) = P(Sn

= i||Sn

| = i, |Sn�1

|| = i

n�1

, . . . , |Sj

| = 0)

There are two possible values of the sequence Sj+1

, . . . , S

n

for which |Sj+1

| = i

j+1

, . . . , |Sn

| =i. Since the process does not cross zero between times j+1 & n these are i

j+1

, . . . , i,�ij+1

, . . . ,�i.Assume i

j+1

> 0, to obtain the first sequence we note in the n � j steps there are i

more up steps (+1) than down steps (�1).Let d

s

be the number of down steps (�1). Then

(ds

+ i)| {z }‘up’

+d

s

= n� j

=) d

s

=n� j � i

2

So the probability of thiss equence is

p

n�j�i

2 +i

q

n�j�i

2 = p

n�j+i

2q

n�j�i

2

Similarly, the second sequence has probability

p

n�j�i

2q

n�j+i

2

8

Thus,

P(Sn

= i||Sn

| = i, . . . , |Sj+1

| = i

j+1

) =p

n�j�i

2q

n�j�i

2

p

n�j+i

2q

n�j+i

2 + p

n�j�i

2q

n�j+i

2

=p

i

p

i + q

i

Similarly,

P(Sn

= �i||Sn

| = i, . . . , |Sj+1

| = i

j+1

) =q

i

p

i + q

i

From this, we can see that upon conditioning on whether Sn

= �i or Sn

= i.

P(|Sn+1

| = i+ 1||Sn

| = i, |Sn�1

| = i

n�1

, . . . , |S1

| = i

1

)

= P(Sn+1

= i+ 1|Sn

= i)P(Sn

= i||Sn

| = i

n

, . . . , |S1

|)+P(S

n+1

= �(i+ 1)|Sn

= �i)P(Sn

= �i||Sn

| = i, . . . , |S1

|)

= p · p

i

p

i + g

i

+ q

i · q

i

p

i + q

i

=p

i+1

q

i+1

p

i + q

i

So {|Sn

|, n � 1} is a Markov chain, with transition probabilities

p(i, i+ 1) =p

i+1 + q

i+1

p

i + q

i

p(i, i� 1) =p

i(1� p) + q

i(1� q)

p

i + q

i

8 i > 0

p(0, 1) = 1

2.4 Multistep Transition Probabilities

The probability p(i, j) = P(xn+1

= j|xn

= i) gives the probability of going from state i tostate j in one step. How do we compute the probability of going from i to j in m steps?

P(xn+m

= j|xn

= i) = p

m(i, j) . . .

Recall that social mobility example with one step transition matrix

p =

2

4.7 .2 .1.3 .5 .2.2 .4 .4

3

5

where 1 = ‘lower’, 2 =‘middle’, 3 =‘upper’.

If my grandmother was upper class (state 3) & my parents were middle class (state2) what is the probability that I will be lower class (state 1)?

9

The Markov property tells us that the probability of this is

p(3, 2)p(2, 1) = (.4)(.3) = .12

Let’s convince ourselves of this:

P(x2

= 1, x1

= 2|x0

= 3) =P(x

2

= 1, x1

= 2, x0

= 3)

P(x0

= 3)

=P(x

2

= 1, x1

= 2, x0

= 3)

P(x1

= 2, x0

= 3)

P(x1

= 2, x0

= 3)

P(x0

= 3)

= P(x2

= 1|x1

= 2, x0

= 3)P(x1

= 2|x0

= 3) Markov, so drop x

0

= 3

= P(x2

= 1|x1

= 2)P(x1

= 2|x0

= 3)

= p(2, 1)p(3, 2)

= p(3, 2)p(2, 1)

If my parents were middle class, what is the probability that my children will be upperclass?

To answer this, we have to consider the three possible classes which I could have:

P(x2

= 3|x0

) =3X

`=1

P(x2

= 3, x1

= `|x0

= 2)

=3X

`=1

p(2, `)p(`, 3)

= (.3)(.1) + (.5)(.2) + (.2)(.4)

= .21

What is the probability that my children will be middle class, given my parents areupper class?

P(x2

= 2|x0

= 3) =3X

`=1

P(x2

= 2, x1

= `|x0

= 3)

=3X

`=1

p(3, `)p(`, 2)

= (.2)(.2) + (.4)(.5) + (.4)(.4)

= 0.4

Of course this approach for two step probabilities applies in general.

p

2(i, j) = P(xn+2

= j|xn

= i)

=kX

`=1

p(i, `)p(`, j)

10

If we think of transition matrices P in general, the term p

2(i, j) can be seen as thedot product of the i

th row of P with the j

th column of P . . . or the (i, j)th entry of P 2.

P

2 =

2

4· · · · · · · · ·

p(i, 1) · · · p(i, k)· · · · · · · · ·

3

5

2

664

... p(1, j)...

......

...... p(k, j)

...

3

775

The Chapman-Kolmogorov Equation:The Chapman-Kolmogorov equation is crucial in understanding multi-step transition

probabilities of Markov chains.

This states that:

p

m+n(i, j) =kX

`=1

p

m(i, `)pn(`, j)

Proof:To prove this equation we break things down according to the state at time m.

i

j

time 0 time m

time m+ n

P(xm+n

= j|x0

= i) =kX

`

P(xm+n

= j, x

m

= `|x0

= i)

Now use the definition of conditional probability for the term in the sum:

P(xm+n

= j, x

m

= `|x0

= i) =P(x

m+n

= j, x

m

= `, x

0

= i)

P(x0

= i)

=P(x

m+n

= j, x

m

= `, x

0

= i)

P(xm

= `, x

0

= i)

P(xm

= `, x

0

= i)

P(x0

= i)

= P(xm+n

= j|xm

= `, x

0

= i)P(xm

= `|x0

= i)

By the Markov property the first term on RHS is P(xm+n

= j|xm

= `), so that:

P(xm+n

= j, x

m

= `|x0

= i) = P(xm+n

= j|xm

= `)P(xm

= `|x0

= i)

= p

n(`, j)pm(i, `)

= p

m(i, `)pn(`, j)

11

Thus,

p

m+n(i, j) =kX

`=1

p

m(i, `)pn(`, j)

QEDTake n = 1 in this equation

p

m+1(i, j) =kX

`=1

p

m(i, `)p(`, j)

which is the i

th row of the m step transition matrix by the column of P . So the m + 1step transition matrix is given by P

m+1.

The m-step transition matrix is equal to the 1-step transition matrix tothe power of m.

Example:Let X

n

be the weather on day n in Dublin, either ‘rainy= 1 or ‘not rainy’= 2, withtransition matrix

P =

.8 .2.6 .4

�

The day after tomorrow?A two step transition matrix

P

2 =

.76 .24.72 .28

�

so if it is rainy today, there is a 76% chance it is rainy the day after tomorrow.

P

10 =

.750 .250.749 .251

�

P

20 =

.75 .25.75 .25

�(approx.)

Note the apparent converging behaviour of the entries of the transition matrix as n getslarger and larger. More formally,

limn!1

P

n =

3

4

1

4

3

4

1

4

�

(34

,

1

4

) is called a stationary distribution of the chain.End Example

Consider the general 2 state chain, xn

2 {1, 2}, with transition matrix

P =

1� a a

b 1� b

�

where 0 a 1, 0 b 1

12

What is P n in general?

In other words, what is the limiting behaviour as n!1?

If we can write P = Q⇤Q�1 then

P

n = (Q⇤Q�1)(Q⇤Q�1) . . . (Q⇤Q�1)

= Q⇤n

Q

�1

where ⇤ is a diagonal matrix and Q is a matrix to be found.

We need to find the eigendecomposition of P . We need to find the eigenvalues andeigenvectors of P .

|P � �I| =

��1� a� � a

b 1� b� �

��

= (1� a� �)(1� b� �)� ab

= (1� a)(1� b)� �[1� a+ 1� b] + �

2 � ab

= �

2 � �(2� a� b) + (1� a� b)

)

�

1,2

=(2� a� b)±

p(2� a� b)2 � 4(1� a� b)

2

=(2� a� b)±

p4� 4(a+ b) + (a+ b)2 � 4� 4(a+ b)

2

=(2� a� b)± a+ b

2

=

(1

1� a� b

Matrix ⇤ =

1 00 1� a� b

�

Next step ... eigenvectors

P

y

1

y

2

�=

y

1

y

2

�=)

1� a a

b 1� b

� y

1

y

2

�=

y

1

y

2

�

(1� a)y1

+ ay

2

= y

1

�ay1

= �ay2

y

1

= y

2

= y

So the first eigenvector is

y

y

�

13

Second eigenvector ...

P

z

1

z

2

�= (1� a� b)

z

1

z

2

�

(1� a)z1

+ az

2

= (1� a� b)z1

(1� a� 1 + a+ b)z1

= �az2

z

2

= � b

a

z

1

So the second eigenvector is

z

� b

a

z

�

Now,

Q =

y z

y

�b

a

z

�

Q

�1 =1��b

a

yz � yz

��b

a

z �z�y y

�

= � a

yz(a+ b)

�b

a

z �z�y y

�

P = Q⇤Q�1, so P

n = Q⇤n

Q

�1.

P

n =

y z

y

�b

a

z

� 1 00 (1� a� b)n

�✓� a

yz(a+ b)

◆�b

a

z �z�y y

�

= � a

yz(a+ b)

y z

y

�b

a

z

� �b

a

z �z�(1� a� b)ny (1� a� b)ny

�

= � a

yz(a+ b)

�b

a

yz � (1� a� b)nyz �yz + (1� a� b)nyz�b

a

yz + b

a

(1� a� b)nyz �yz � b

a

(1� a� b)nyz

�

= � a

yz(a+ b)

�yz

�b

a

+ (1� a� b)n�

�yz(1� (1� a� b)n)�yz

�b

a

� b

a

(1� a� b)n��yz

�1 + b

a

(1� a� b)n��

=

a

a+b

�b

a

+ (1� a� b)n�

a

a+b

(1� (1� a� b)n)a

a+b

�b

a

� b

a

(1� a� b)n�

a

a+b

�1� b

a

(1� a� b)n��

So

P

n =

b

a+b

+ a

a+b

(1� a� b)n a

a+b

� a

a+b

(1� a� b)nb

a+b

� b

a+b

(1� a� b)n a

a+b

+ b

a+b

(1� a� b)n

�

What happens as n!1?

If |1� a� b| < 1, then we will know.

(1� a� b)n ! 0 as n!1

14

�1 < 1� a� b < 1

�1 < a+ b� 1 < 1

0 < a+ b < 2

Thus, if 0 < a+ b < 2, then

limn!1

P

n =

b

a+b

a

a+b

b

a+b

a

a+b

�

We know that�

b

a+b

,

a

a+b

�is the stationary distribution of the two state chain.

p

n+1(i, j) =kX

`=1

p

n(i, `)p(`, j)

=2X

`=1

p

n(i, `)p(`, j)

⇡(j)|{z}long-run probability

=2X

`=1

⇡(`)p(`, j)

⇥⇡(1) ⇡(2)

⇤=⇥⇡(1) ⇡(2)

⇤ p(1, 1) p(1, 2)p(2, 1) p(2, 2)

�

Solving for ⇡ gives�

b

a+b

,

a

a+b

�.

3 Properties of Markov Chains

In this chapter we will look at properties which can be used to classify the behaviour ofMarkov chains.

3.1 Decomposability

Definition: Closed Set of StatesA set of states A is closed if

P(Xn+1

2 A|Xn

= x) = 1

for all states x 2 A.

If A is closed, then starting from any value in A, we always stay in A.

Suppose we have two disjoint closed sets A1

and A

2

. If we start the chain in A

1

, i.e.x

a

2 A, then the states outside of A1

are immaterial, and the process (chain) can beanalysed solely through its movement in A

1

. This is the idea of decomposability.

15

A Markov chain is indecomposable if its set of states does not contain twoor more disjoint closed sets of states.

A

1

A

2

If a transition from state i to state j is possible, then we write i �! j , i.e. there issome m such that pm(i, j) > 0.

If additionally there exists n such that p

n(i, j) > 0 in which case it is possible totransition either way, we say then that i communicates with j and write i ! j.

(We will show later that communication is an equivalence relation)

The ideas of communication and indecomposability are closely related.

If for every pair of states i and j, at least one of i �! j or j �! i, then thechain’s set of states is indecomposable.

Proof:To see this suppose there are two disjoint closed sets of states A

1

and A

2

.

Take any two states i 2 A

1

& j 2 A

2

. Suppose that i �! j or j �! i is possible.

Assuming that i �! j, then there is an m such that

p

m(i, j) > 0

But this contradicts the fact that A1

is closed

P(Xn+m

2 A

1

|xn

= i 2 A

1

) = 1

Hence, the chain’s states are indecomposable.QED

3.2 Periodicity

Some Markov chains exhibit periodic behaviour. Suppose that the states are indecompos-able and consider, say the two-step transition probabilities p

2(i, A) (two-step transitionprobabilities in going from state i to states in the set A).

It is possible that the states decompose into two closed sets under this transitionprobability. That is, there are two disjoint sets B

1

&B

2

such that

p

2(i, B1

) = 1 8 i 2 B

1

p

2(i, B2

) = 1 8 i 2 B

2

16

ExampleIn the simple random walk

p(i, i+ 1) = p, p(i, i� 1) = q = 1� p

i� 1 i i+ 1

q p

If i is an odd integer, then the next state will be even. The next state again will beodd. Similarly, if i is even, in two steps the state will be even again. So if we let the evenintegers and B

2

be odd, then

p

2(i, B) = 1 8 i even

p

2(i, B1

) = 1 8 i odd

In general, the periodic behaviour can be summarised by the following:

Let d � 1 be the largest integer such that the states can be decomposed into d disjointsubsets, B

1

, . . . , B

d

, each of which is closed under the d step transition probability. TheMarkov cycles among the the B

1

, . . . , B

d

. If the starting state is in B

1

, the next statewill be in, say B

2

, and so on until the chain transitions from B

2

back to B

1

.End Example

DecomposabilityPeriodicity

� ! Stability

3.3 Stability

Suppose X

0

= x. We want to know what statements can be made about the chain aftera large number of subsequent movements. A key question is that of stability.

Regardless of initial state, can the states that are visited by a chain be representedby some limiting distribution after a large number of steps? If we think of the m-steptransition probabilities pm(x,A) (the probability of being in A after m steps) the questionwe’re asking is:

Is there a limiting distribution ⇡(A) such that pm(x,A) �! ⇡(A) as m!1.

If the answer is yes, we say that the chain is stable. Stability here is a property of thechain’s transition probabilities since pm(x,A)! ⇡(A) regardless of the arbitrary state x.It can be seen that if the chain is decomposable or periodic, we can’t have stability.

Decomposability:Let A

1

, A2

be disjoint closed sets of states. Then for every m � 1

p

m(x,A1

) =

(1 , x 2 A

1

0 , x 2 A

2

with the reverse for m odd. So there’s no limiting value for this probability as m!1.

At the very least the chain must be indecomposable and aperiodic to be stable.

Exercise:Find an example of a decomposable chain and a periodic chain.

17

3.4 Long-Run Regularity

If the chain is stable it has some long-run regularity properties. No matter what state westart from, the proportion of time the chain spends in the set of states A will be ⇡(A).

Count the number of times x1

, . . . , x

m

is in A. Let

f(xt

) =

(1 , x

t

2 A

0 , otherwise

Then1

m

mX

t=1

f(xt

)

is the proportion of time spent in A. We have

E[f(xt

)] = 1 · pt(x,A) + 0 · (1� p

t(x,A))

= p

t(x,A)

So the expected proportion of time spent in A is

1

m

mX

t=1

p

t(x,A)

As time ticks on, as m!1, then since the chain is stable p

m(x,A)! ⇡(A).

If the numbers in a sequence ↵

t

! ↵, then

1

m

mX

t=1

↵

t

! ↵

So that1

m

mX

t=1

p

t(x,A)! ⇡(A)

We also have the law of large numbers:

P ��

1

m

mX

t=1

f(xt

)� ⇡(A)

�� > �

!! 0

as m gets large(r).

3.5 Computing Stable Distributions

Stable distribution ! stationary distribution

Indecomposable ! irreducible

18

How do we compute a stable distribution? By the Markov property and C-K:

p

m+1(x,A) =kX

`=1

p

m(x, `)p(`, A)

By assumption, pm+1(x,A)! ⇡(A) as m!1, and also p

m(x, `)! ⇡(`) as m!1,so ⇡(·) must satisfy the equation

⇡(A) =kX

`=1

⇡(`)p(`, A)

k states, {1, . . . , k}.Stable distribution

⇥⇡(1) . . . ⇡(k)

⇤

P =

2

6664

p(1, 1) p(1, 2) · · · p(1, k)p(2, 1) p(2, 2) · · · p(2, k)

......

. . ....

p(k, 1) p(k, 2) · · · p(k, k)

3

7775

`

th row of p & multiply by ⇡(·)⇥⇡(1) . . . ⇡(k)

⇤=⇥⇡(1) . . . ⇡(k)

⇤P

⇡(j) = ⇡(1)p(1, j) + ⇡(2)p(2, j) + . . .+ ⇡(k)p(k, j)

Solving this matrix equation gives the stable distribution.

Example:Two state chain:

P =

1� a a

b 1� b

�

Find the stable distribution (⇡(1), ⇡(2)).

19

⇥⇡(1) ⇡(2)

⇤=

⇥⇡(1) ⇡(2)

⇤ 1� a a

b 1� b

�

⇡

1

= (1� a)⇡1

+ b⇡

2

⇡

2

= a⇡

1

+ (1� b)⇡2

b⇡

2

= a⇡

1

=) ⇡

2

=a

b

⇡

1

⇡

1

+ ⇡

2

= 1, ⇡1

+a

b

⇡

1

=

⇡

1

✓a+ b

b

◆= 1

⇡

1

=b

a+ b

⇡

2

= 1� b

a+ b

=a

a+ b

)⇥

b

a+b

a

a+b

⇤

End ExampleQuick Recap:Stable, stationary distributions . . .

k ⇥ k transition matrix P

⇡P = ⇡

where ⇡ =⇥⇡

1

. . . ⇡

n

⇤is stable, and P is the transition matrix.

k � 1 linearly independent equations.

kX

`=1

⇡

`

= 1

ExampleWeather on day n in Dublin.

1 =‘rainy’ 2 =‘not rainy’

P =

.8 .2.6 .4

�

What is the stable distribution?

1� a a

b 1� b

�

20

✓b

a+ b

,

a

a+ b

◆=

✓.6

.8,

.2

.8

◆

=

✓3

4,

1

4

◆

From first principles:

⇡P = ⇡

⇥⇡

1

⇡

2

⇤ .8 .2.6 .4

�=

⇥⇡

1

⇡

2

⇤

⇥.8⇡

1

+ .6⇡2

.2⇡1

+ .4⇡2

⇤=

⇥⇡

1

⇡

2

⇤

)

.8⇡1

+ .6⇡2

= ⇡

1

I

.2⇡1

+ .4⇡2

= ⇡

2

II

.⇡

1

+ ⇡

2

= 1 =) ⇡

2

= 1� ⇡

1

.8⇡1

+ .6(1� ⇡

1

) = ⇡

1

.8⇡1

+ .6� .6⇡1

= ⇡

1

.8⇡1

= .6

⇡

1

=6

8=

3

4

⇡

2

=1

4

End ExampleExample:Social mobility. Recall the transition matrix:

P =

2

4.7 .2 .1.3 .5 .2.2 .4 .4

3

5

Does the proportion of families falling into the three social classes approach a stablelimit?

⇡P = ⇡

⇥⇡

1

⇡

2

⇡

3

⇤2

4.7 .2 .1.3 .5 .2.2 .4 .4

3

5 =⇥⇡

1

⇡

2

⇡

3

⇤

21

)

.7⇡1

+ .3⇡2

+ .2⇡3

= ⇡

1

I

.2⇡1

+ .5⇡2

+ .4⇡3

= ⇡

2

II

.1⇡1

+ .2⇡2

+ .4⇡3

= ⇡

3

II

⇡

1

+ ⇡

2

+ ⇡

3

= 1

Solving yields:

⇡

1

=22

47, ⇡

2

=16

47, ⇡

3

=9

47End Example:

Theorem:For an indecomposable, non-periodic chain with transition probabilities p(x,A) such

that any two states x and y communicate, then the system of equations

⇡(j) =kX

`=1

p(`, j)⇡(`) j = 1, . . . , k � 1

kX

`=1

⇡(`) = 1

will give a set of k linearly independent equations with unique solution ⇡.

3.6 Detailed Balance

⇡(·) is said to satisfy detailed balance if

⇡(x)p(x, y) = ⇡(y)p(y, x)

This is a stronger condition than ⇡P = ⇡. If we sum over x on each side of the above:

kX

x=1

⇡(x)p(x, y) =kX

x=1

⇡(y)p(y, x)

= ⇡(y)kX

x=1

p(y, x)

= ⇡(y)

A good way to think of the detailed balance is as follows: imagine a beach wherre⇡(x) gives the amount of sand at mound x. A transition of the chain means a fractionp(x, y) of the sand at x is transferred to y. Detailed balance says that the amount of sandgoing from y to x in one step is completely balanced by the amount going back from y

to x.⇡(x)p(x, y) = ⇡(y)p(y, x)

22

In contrast, ⇡P = ⇡ says that after all transfers of sand, the amount that ends up oneach mound is the same as the amount that started there.

Example:A graph is defined by giving two things:

1. A set of vertices V (finite);

2. An adjacency matrix A(u, v) which is 1 if there is an edge connecting u and v, andis 0 otherwise.

(‘Actors’= group of nodes)

1

3

2

4

A(u, v) =

2

664

0 1 1 01 0 1 01 1 0 00 0 0 0

3

775

The adjacency matrix can be used to describe the topology of the graph.

By convention A(v, v) = 0 8 v 2 V .

34

3 3 3 3

4 5

3 3 3 3

The degree of any vertex is equal to the number of neighbours it has

d(u) =X

u

A(u, v)

since each neighbour of u contributes 1 to this sum. Now consider a random walk X

n

onthis graph.

Define the transition probability by

p(u, v) =A(u, v)

d(u)

i.e., if xn

= u, then we jump randomly to one of its neighbours at time n+ 1.

1

2

1

2

23

Symmetric random walk.

Consider p(u, v) = A(u,v)

d(u)

. This says

d(u)p(u, v) = A(u, v)

since A is a symmetric matrix (non-directed graph).

⇡(u) = p(u, v) = ⇡(u)p(v, u)

If we take ⇡(u) = c for some positive constant c

⇡(u)p(u, v) = cd(u)p(u, v)

= cA(u, v)

= cA(v, u)

= cd(v)p(v.u)

So the RW on the graph satisfies detailed balance. Its stable distribution is

⇡(u) = cd(u)X

v2V

⇡(v) = 1

)c =

1Pv2V d(v)

and ⇡(u) =d(u)Pv2V d(v)

End Example

4 Poisson Processes

Think of e-mail messages arriving to a server. This is an example of events arrivingrandomly in an interval of time. The number of events (e-mail arrivals) that occur overan interval of time, say 1 hour, then this will be a discrete random variable. This discreteRV is often modelled as a Poisson process. The length of the interval between arrivalswill be modelled by an exponential distribution.

4.1 Assumptions of the Poisson Process

First, we assume that we observe the process for a fixed period of time t.

The number of events that occur in this fixed interval (0, t] is a random variable X.X will be discrete and its probability law will depend on the manner in which eventsoccur.

We make the following assumptions about the way in which events occur:

1. In a su�ciently short length of time �t, then either 0 or 1 events occur in that time(two or more simultaneous occurances are impossible).

24

2. The probability of exactly one event occuring in this short time interval of length�t is ��t. So, the probability of exactly one event occuring in the interval isproportional to the length of the interval.

3. Any non-overlapping intervals of length �t are independent Bernoulli trials.

These three assumptions are the assumptions for a Poisson process with parameter �.

4.2 Probability Law of Poisson Process

Suppose our interval of length t, (0, t], is divided into n = t

�t

non-overlapping equallength pieces.

0 1 2 3 n

t

time

|�t|

By assumption 3, these smaller intervals are independent Bernoulli trials. Each ofthese has probability of success (an event occuring) is equal to p = ��t (form assumption2).

Then the probability of no event occurring in the interval is q = 1� ��t.

Then X, the number of events in the interval of length t is binomial (n, p = ��t = �t

n

).

P(X = k) =

✓n

k

◆✓�

t

n

◆k

✓1� �t

n

◆n�k

=n!

k!(n� k)!

(�t)k

n

k

✓1� �t

n

◆n

✓1� �t

n

◆�k

=(�t)k

k!

✓1� �t

n

◆n

✓1� �t

n

◆�k

n(n� 1) . . . (n� k + 1)

n

k

Now, examine the limiting cae as �t! 0, the same as n!1.

limn!1

✓1� �t

n

◆�k

= 1

limn!1

✓1� �t

n

◆n

= e

��t

n(n� 1)(n� 2) . . . (n� k + 1)

n

k

= 1

✓1� 1

n

◆✓1� 2

n

◆. . .

✓1� k � 1

n

◆! 1

)lim�t!0

P(X = k) =(�t)k

k!e

��t

so in the limit as �t! 0 we retrieve the Poisson probability law.

25

Recall the Taylor expansion of ex:

e

x = 1 + x+x

2

2!+ . . . =

1X

j=0

x

j

j!

, valid for x 2 R.If we sum the probability law over all possible values we get

1X

k=0

P(X = k) =1X

k=0

(�t)k

k!e

��t

e

�t = 1

, as we’d expect.

4.3 Moments of the Poisson Distribution

For any ` � 1 (` 2 Z) we can show that

E[X(X � 1) . . . (X � `+ 1)] = (�t)`

To see this observe that

X(X � 1) . . . (X`

+ 1) = 0

if X `� 1.

E[X(X � 1) . . . (X � `+ 1)] =1X

k=`

(�t)k

k!e

��t

�⇥ k(k � 1) . . . (k � `+ 1)

=1X

k=`

(�t)k�`(�t)`

(k � `)!e

��t

= e

��t(�t)`1X

j=0

(�t)j

j!

= e

��t(�t)`e�t

= (�t)`

So then,

E[X] = (�t)1 = �t

E[X(X � 1)] = (�t)2

Var[X] = E[X(X � 1)]� (E[X])2 + E[X]

= (�t)2 � (�t)2 + �t

= �t

Example:

26

Assume molecules in a rare gas occur at an average rate of ↵ per cubic metre. If itis reasonable to assume that these molecules of the gas are distributed independently inthe air, then the molecules in a cubic metre of air is a Poisson random variable with rateparameter ↵. If we wanted to be 100(1� �)% confident of finding at least one moleculeof the gas in a sample of air, what sample size of air would we need to take?

Let the sample size be s cubic metres. let the number of molecules be X which isPoisson distributed with rate ↵s. So we would require

P(X � 1) = 1� P(X = 0)

(↵s)0e�↵s

0!

�

= 1� e

�↵s

� 1� �

So,e�↵s �

�↵s log �

s � � 1

↵

log �

cubic metres of air is the sample size we would need to take.End Example

Recall the assumptions for a Poisson process with parameter �. The exponentialrandom variable is easily defined on this process. In a Poisson process events are occuringindependently at random and a uniform rate per unit of time. Assume that we begin toobserve the Poisson process at time zero and let T be the time of the first event. T is acontinuous random variable and its range is R

T

= {t : t � 0}. Let t be any fixed positivenumber and consider the event {T > t}, that the first event is greater than t. This eventoccurs if there are zero events in the fixed interval (0, t]. The probability of zero eventsoccuring is

P(X = 0) =(�t)0e��t

0!= e

��t

These events are equivalent and so have equal probability so

P(T > t) = e

��t = 1� F

T

(t)

P(T t) = F

T

(t) distribution function

P(T > t) = 1� F

T

(t) survival function

from which we find the distribution function for T :

F

T

(t) = 1� e

��t

, t > 0

and its density function

f

T

(t) =d

dt

F

T

(t) = �e

��t

,i.e. the exponential density function.

27

So the time to the first event in a Poisson process is exponentially distributed withparameter �.

The expected value for an exponential random variable is

E(T ) =

Z 1

0

t�e

��t

dt

= � (�t+ 1)

�

e

��t

��1

0

=1

�

and the moment generating function is

m

T

(t) = E(etT )

=

Z 1

0

e

ts

�e

��s

ds

=

Z 1

0

�e

�s(��t)

ds

=��e�s(��t)

�� t

��1

0

=�

�� t

, t < �

E(X) =d

dt

m

T

(t)

��t=0

E(Xj) =d

j

dt

j

m

T

(t)

��t=0

The factorial generating function from above can be used to verify that

Var(T ) =1

�

2

and so the standard deviation is the same as the mean.

Example:

Students arriving at a lecture, at a rate of 2 per minute. If I observe for 3 minutes,what is the probability of no students arriving?

P(X = 0) =(�t)0e��t

0!

= e

�6

= 0.0025

So the probability of observing no students arriving in this interval is incredibly small.End Example

28

4.4 Times of First Arrival

4.5 Memoryless Property of the Exponential Distribution

The exponential probability law has the memoryless property. If T is an exponential withparameter � and let a & b be positive constants, then

P(T > a+ b|T > a) =P(T > a+ b)

P(T > a)

=e

��(a+b)

e

��a

= e

��b

= P(T > b)

The exponential distribution is the only continuous probability law with the memorylessproperty. There are some similarities between the exponential and geometric probabilitydistribution.

X

1

, . . . , X

n

� independent Bernoulli trials

# of trials to first success is a geometric random variable.

Geometric is the number of trials to first success while the exponential represents thetime to first event in a Poisson process. If Y is a geometric RV with parameter p, then

P(Y > n) = (1� p)n

In deriving the Poisson process we set p = ��t = �t

n

having subdivided (0, t] into n piecesof length �t. But then the events {Y > n} and {T > t} are equivalent and

P(T > t) = limn!1

P(Y > n)

= limn!1

✓1� �t

n

◆n

= e

��t

so the exponential distribution is the limit of the geometric distribution function.

4.6 Time to Occurrence of rth Event

Suppose we begin observing a Poisson process at time zero and let T

n

be the time tooccurrence of the r

th event, r � 1. This random variable is analogous to a negativebinomial random variable. Again t be any fixed number and consider the event {T

r

> t}(time to r

th event greater than t). {Tr

> t} is equivalent to the event {X r� 1} whereX is the number of events in (0, t], since T

r

can only exceed t if there are r � 1 or fewerevents in (0, t]. X is Poisson with parameter �t so

P(Tr

> t) = P(X r � 1)

=r�1X

k=0

(�t)k

k!e

��t

29

and the distribution function for Tr

is

F

T

r

= P(Tr

t)

= 1� P(Tr

> t)

= 1�r�1X

k=0

(�t)k

k!e

��t

T

r

is called an Erlang random variable with parameters r and �. The density functionfor T

r

is

f

T

r

(t) =d

dt

F

T

r

(t)

=d

dt

✓1� e

��t � �te

��t � (�t)2

2!e

��t � . . .� (�t)r�1

(r � 1)!e

��t

◆

= �e

��t � �e

��t + �

2

e

��t � �

2

e

��t +�

3

t

2

2!e

��t � . . .� �

r�2

t

r�2

(r � 2)!e

��t +�

r

t

r�1

(r � 1)!e

��t

=�

r

t

r�1

(r � 1)!e

��t

, t > 0

=�

r

t

r�1

�(r)e

��t

, t > 0

since�

↵

�(↵)t

r�1

e

��t

is the density of the gamma distribution.

(Revise the Gamma distribution)

The Erlang probability law is a particular case of the gamma distribution. So thetime to the r

th occurrence in a Poisson is gamma distributed with shape parameter r &rate �.

Example:

The instant at which telephone calls are made to a call centre form a Poisson processwith � = 120/hr.

Let T

10

be the time to the tenth call made starting from 9am using minutes as theunit of time. Then T

10

is gamma distributed with shape r = 10 and rate 2/min. Theexpected time of the 10th call is

E(T10

) =10

2= 5mins

so at 9.05am.

30

The probability of the tenth call occurs before 9.05am is

P(T10

< 5) = 1�9X

k=0

(5.2)k

k!e

�5(2)

= 1�9X

k=0

10k

k!e

�10

= .542

The probability that the tenth call is received between 9.05am & 9.07am.

P(5 < T

10

7) =

1�

9X

k=0

(14)k

k!e

�14

!� 1�

9X

k=0

10k

k!e

�10

!

= .349

End Example

4.7 Summary of Inter-Arrival Times

We have seen some results concerning the distributions of the time between occurancesin a Poisson process:

1. The distribution of the time to the first event is exponential (�);

2. Times between events are exponential (�);

3. The time to the r

th event is gamma distributed, shape= r, rate= �

�scale= 1

rate

�.

So far we have assumed that the rate of occurrence � is constant. This process iscalled a time homogeneous Poisson process.

Let X(t) denote the number of events in (0, t]. Then the Poisson process is said tohave independent increments.

Let T1

, T

2

, T

3

, T

4

, . . . denote the arrival times of the process, and define T

0

= 0, thenX(T

1

)�X(T0

), X(T2

)�X(T1

), X(T3

)�X(T2

), are independent random variables.

This is since X(t+ s)�X(s), t 0 is a rate � Poisson process and is independent ofX(r), 0 r < s.

4.8 General Poisson Process

Let X(t) be the number of events in (0, t]. We say that X(t) is a Poisson process withrate �(t) if:

1. X(0) = 0;

2. X(t) has independent increments;

31

3. X(t)�X(s) for s < t, this is a Poisson process with mean

Zt

s

�(r)dr

Note that if �(r) = �, a constant.

In this case the mean of the process X(t)�X(s) is

Zt

s

�(r)dr =

Zt

s

�dr = �(t� s)

which is just the Poisson process we have studied up until now.

For an time homogeneous process we have shown that the times between arrivalsfollow an exponential distribution. If �(t) depends explicitly on t, and hence in general,this isn’t the case.

Let T1

be the time to the first arrival

P(T1

> t) = P(X(t) = 0) (X(t) Poisson , mean=

Zt

0

�(r)dr)

=

hRt

0

�(r)dri0

exp⇣�R

t

0

�(r)dr⌘

0!

= exp

✓�Z

t

0

�(r)dr

◆

What’s the distribution of T1

, the cumulative distribution function.

F

T1(t) = P(T1

t)

= 1� P(T1

> t)

= 1� exp

✓�Z

t

0

�(r)dr

◆

and the density function of T1

f

T1(t) =d

dt

F

T1(t)

=d

dt

Zt

0

�(r)dr

�exp

✓�Z

t

0

�(r)dr

◆

= �(t) exp

✓�Z

t

0

�(r)dr

◆

If we call µ(t) =R

t

0

�(r)dr, then we can see f

T1(t) = �(t)e�µ(t) will not in general be anexponential distribution.

32

Aside: Time homogeneous.

�(t) = �

µ(t) =

Zt

0

�(r)dr = �

Zt

0

dr = �t

f

T1 = �e

��t

When �(t) depends explicitly on t, i.e. non-constant then we term this a time non-homogeneous Poisson process. Change point:

�(t) =

(�

1

, t < ⌧

�

2

, t � ⌧

Showing that a Poisson process satisfies the Markov property in general follows from theindependent increments property (2).

However, a Poisson process is a continuous time process, so we need to formally saywhat we mean by the Markov property in continuous time. In discrete time we observeour process at time points 0, 1, 2, 3, . . . , n, n+ 1, . . .

For continuous time we observe the process at arbitrary points in time [R+].

0 = s

0

< s

1

< s

2

< . . . < s

k

< s < t < t

1

< . . . < t

n

, with states i0

, i

1

, i

2

, . . . , i

k

, i, j, j

1

, . . . , j

n

. We say that the Markov property holds if forthese arbitrary points in time:

P(X(t) = j,X(t1

) = j

1

, . . . , X(tn

) = j

n

|X(s0

) = i

0

, . . . , X(sk

) = i

k

, X(s) = i)

= P(X(t) = j,X(t1

) = j

1

, . . . , X(tn

) = j

n

|X(s) = i)

Compare to discrete time definition.

For the Poisson process

P(X(t) = j|X(s) = i) =P(X(t) = j,X(s) = i)

P(X(s) = i)

(independent increments) =P(X(t)�X(s) = j � i)P(X(s) = i)

P(X(s) = i)

= P(X(t)�X(s) = j � i)

=

⇣Rt

s

�(r)dr⌘j�i

exp⇣�R

t

s

�(r)dr⌘

(j � i)!

Therefore, satisfies the Markov property.

We will denote P(X(t) = j|X(s) = i) by P

s,t

(i, j) for continuous time processes.

In the next chapter we will meet examples where the states form a continous randomvariable.

33

4.9 Compound Poisson Processes

A compound Poisson process associates an independent, identically distributed variableY with each arrival of the Poisson process. The Y

i

are assumed independent of the Poissonprocess of arrivals, and independent of each other.

Example 1:Consider messages arriving at a central computer before being transmitted over the

internet. If we imagine a large number of users at separate terminals, we can assume thatmessages arrive at the central computer according to a Poisson process.

If we let Y

i

be the size (in bytes) of the i

th message, then again it’s reasonable toassume the Y

i

s are iid and independent of the Poisson process of arrivals.End Example

Example 2:Claims which come in to a large insurance company. Assume claims arrive according

to a Poisson process and the sizes of claims (Yi

) can be assumed independent of eachother. The compound process will give an idea of total liability.

End Example

It is natural to consider the sum of all the Y

i

s up to time t.

At time t: X(t) events of the Poisson process Y1

, Y

2

, . . . , Y

X(t)

.

S(t) = Y

1

+ Y

2

+ . . .+ Y

X(t)

where we set S(t) = 0 if X(t) = 0.

Example 1:S(t) = total (bytes) information transmitted

Example 2:S(t) = total liability for the company

We have the following results:

Theorem:Let Y

1

, . . . , Y

X(t)

be iid and S(t) =P

X(t)

i=1

Y

i

.Then:

1. If E[Yi

] <1 and E[X(t)] <1, then

E[S(t)] = E[X(t)]E[Y ]

2. If E[Y 2

i

] <1 and E[X(t)2] <1 then

Var[S(t)] = E[X(t)]Var[Y ] + Var[X(t)]E[Y ]2

Proof:When X(t) = n, then S(t) = Y

1

+ . . .+ Y

n

, and E[S(t)] = nE[Y ].

34

Breaking things down according to the value of X(t):

E[S(t)] =1X

n=0

E[S(t)|X(t) = n]P(X(t) = n)

=1X

n=0

nE(Y )P(X(t) = n)

= E[Y ]1X

n=0

nP(X(t) = n)

= E[Y ]E[X(t)]

For 2, again if X(t) = n, then

Var[S(t)] = Var[Y1

+ . . .+ Y

n

]

= nVar[Y ]

Hence,

E[S(t)2] =1X

n=0

E[S(t)2|X(t) = n]P(X(t) = n)

=1X

n=0

(nVar[Y ] + E[S(t)|X(t) = n]2)P(X(t) = n)

=1X

n=0

(nVar[Y ] + n

2

E[Y ]2)P(X(t) = n)

= Var[Y ]E[X(t)] + E[Y ]2E[X(t)2]

Var[S(t)] = E[S(t)2]� E[S(t)]2

= Var[Y ]E[X(t)] + E[Y ]2E[X(t)2]� E[Y ]2E[X(t)]2

= Var[Y ]E[X(t)] + E[Y ]2(E[X(t)2]� E[X(t)]2)

= Var[Y ]E[X(t)] + E[Y ]2Var[X(t)]

QEDExample:Suppose the number of customers at an o↵-licence in a day is Poisson with mean 81

and the amount that each customer spends on average is AC10 with a standard deviationof AC6. The expected revenue in one day is 81(10) = AC810. The variance of the totalrevenue is

(81)(62) = (102)(81) = 11016

End Example

35

5 Some Continuous Time Processes

5.1 Brownian Motion

Consider the simple symmetric random walk (2.3) which takes a step to either the leftor right (up or down) with equal probability, i.e., S

n

=P

n

j=1

X

j

where the X

j

s are iidrandom variables with

x

j

=

(+1 , p = 1

2

�1 , p = 1

2

If we think about speeding up this process, i.e., looking at it in smaller and smallertime intervals for smaller and smaller increments to the left and right we’ll get a contin-uous time process.

t

S

t

�2

�1

1

2

1 2 3 4

In this regard, consider the symmetric random walk taking steps over short intervalsof length �t, with steps of size �x. Let X(t) be the value of the process at time t, andwe’ll imagine we have n = t

�t

time intervals.

t

X(t)

�2�x

��x

�x

2�x

�t 2�t 3�t 4�t

Then,

X(t) = �xX

1

+�xX

2

+ . . .+�xX

[t/�t]

= �x[X1

+X

2

+ . . .+X

[t/�t]

36

Consider the mean and variance of X(t).

E[X(t)] = �x

✓t

�t

◆E[X

1

]

= 0

since E[X1

] = 1

2

(1) + 1

2

(�1) = 0, and

Var[X(t)] = (�x)2✓

t

�t

◆Var(X

1

)

= (�t)2✓

t

�t

◆

since E[X2

1

] = 1

2

(1) + 1

2

(�1)2 = 1.Now we want to take the limit as �x and �t tend to 0. Let

�x = c

p�t

where c is some positive constant, so

Var[X(t)] = c

2�t

✓t

�t

◆

= c

2

t

The process that we’re left with in the limit is Brownian motion.

Observe some more properties of this process:

1. Since X(t) = �x(X1

+ X

2

+ . . . + X

[t/�t]

), by the Central Limit Theorem, X(t)follows a normal distribution with mean 0 and variance c

2

t.

2. As the distribution of the change in position of the random walk is independent overnon-overlapping time intervals, then this implies that {X(t), t � 0} has independentincrements.

3. This process also has stationary increments, since the change in the process value,i.e. X(t) ⇠ N(0, ct2), over a given time interval depends only on the length of theinterval.

The standard Brownian motion (c = 1) is sometimes called thw Wiener process. It isone of the most widely used processes in applied probability.

The independent increments assumption implies that the change in the value of theprocess between times s and t + s, i.e. X(t + s) � X(t), is independent of the processvalues before time s.

P(X(t+ s) a|X(s) = x,X(u)0 u < s) = P(X(t+ s)�X(s) a� x|X(s) = x,X(u) 0 u < s)

= P(X(t+ s)�X(s) a� x) (independence)

= P(X(t+ s) = a|X(s) = x)

37

So this tells us that Brownian motion satisfies the Markov property (showed that a simplerandom walk satisfies the Markov property).

Let X(t) be standard Brownian motion, then X(t) ⇠ N(0, t). So, the density of X(t)is

f

t

(x) =1p2⇡

e

�x

2

2t

Since Brownian motion has stationary and independent increments we can write downthe joint distribution of X(t

1

), X(t2

), . . . , X(tn

). This is:

f(x1

, . . . , x

n

) = f

t1(x1

)ft2�t1(x2

� x

1

)ft3�t2(x3

� x

2

) . . . ft

n

�t

n�1(xn

� x

n�1

)

Using this we can compute many probabilities of interest.

Quick Recap: Brownian Motion

Limit of SRW, ‘speeding up’

{X(t), t � 0} , X(t) ⇠ N(0, t)

Stnadrad Brownian motion c = 1.

X(t) ⇠ N(0, c2)

Independent increments:

P(X(t+ s a|X(s) = x,X(u), 0 u < s) = P(X(t+ s) a|X(s) = x)

=) Brownian motion satisfies the Markov property.

x(t1

), . . . , x(tn

) t

1

< t

2

< . . . < t

n

f(x1

, . . . , x

n

) = f

t1(x1

)ft2�t1(x2

� x

1

) · · · ft

n

�t

n�1(xn

� x

n�1

)

38

For example, conditional distribution of X(s) given that X(t) = B, where s < t is

f

s|t(x|B) =f

s,t

(x,B)

f

t

(B)

=f

s

(x)ft�s

(B � x)

f

t

(B)

=

1p2⇡s

e

�x

2

2s · 1p2⇡(t�s)

e

� (B�x)2

2(t�s)

1p2⇡t

e

�B

2

2t

=1q

2⇡ s(t�s)

t

exp

✓�1

2

x

2

s

+(B � x)2

t� s

� B

2

2t

�◆

=1q

2⇡ s(t�s)

t

exp

✓�1

2

x

2

s

+B

2 � 2Bx+ x

2

t� s

� B

2

t

�◆

=1q

2⇡ s(t�s)

t

exp

✓�1

2

✓1

s

+1

t� s

◆x

2 � 2B

t� s

x+B

2

✓1

t� s

� 1

t

◆�◆

=1q

2⇡ s(t�s)

t

exp

✓�1

2

t

s(t� s)x

2 � 2B

t� s

x+sB

2

t(t� s)

�◆

=1q

2⇡ s(t�s)

t

exp

� 1

2 s(t�s)

t

x

2 � 2Bs

t

x+s

2

B

2

t

2

�!

=1q

2⇡ s(t�s)

t

exp

� 1

2 s(t�s)

t

✓x� Bs

t

◆2

!

which is the density of a normal distribution with mean Bs

t

and variance s(t�s)

t

. So thistells us that

E[X(s)|X(t) + B) =Bs

t

Var[X(s)|X(t) = B) =s(t� s)

t

(independent of B)

Interestingly, the variance here does not depend on B. If we set ↵ = s

t

, then since s < t,0 < ↵ < 1 and the mean is ↵X(t), and the variance is ↵(1� ↵)t.

When we can consider the process only between 0 and 1 conditional on X(1) =), thisnew process is known as the Brownian bridge.

39

X(0) = 0 t

X(1) = 0

DIAGRAM TO BE FINISHED

This is used in the analysis of empirical distribution functions.

5.2 Gaussian Processes

Any stochastic process {X(t), t � 0} is called a Gaussian process if X(t1

), . . . , X(tn

),t

1

< . . . < t

n

, has a multivariate normal distribution for all t1

, . . . , t

n

.

Recall that the multivariate normal distribution is defined for a random vector x =(X(t

1

), . . . , X(tn

)) by

f

x

(x) =1

(2⇡)n

2 |⌃| 12exp

✓�1

2(x� µ)T⌃�1(x� µ)

◆

where ⌃ is an n⇥ n-covariance matrix & µ = (µ1

, . . . , µ

n

) is the mean vector.

Example:

If X1

, . . . , X

n

⇠ N(µ, �2) iid, then

⌃ = diag(�2

, . . . , �

2)

µ = (µ, . . . , µ)

f

x

(x) =1

(2⇡)n

2 |⌃| 12exp

0

B@�1

2(x� µ)T⌃�1(x� µ)

| {z }Mahalanobis distance

1

CA

|⌃| = (�2)n ⌃�1 = diag

✓1

�

2

, . . . ,

1

�

2

◆

40

(x� µ)T⌃�1(x� µ) =1

�

2

(x� µ)T I(x� µ)

=1

�

2

(x� µ)T (x� µ)

⇥x

1

� µ x

2

� µ · · · x

n

� µ

⇤

2

6664

x

1

� µ

x

2

� µ

...x

n

� µ

3

7775=

nX

i=1

(xi

� µ)2

)f

X

(x) =1

(2⇡)n

2(�2)

n

2 exp

� 1

2�2

nX

i=1

(xi

� µ)2!

The likelihood function for an iid normal sample.

Recall that the likelihood function is:

L(µ, �2) =nY

i=1

f(xi

)

End Example

Recall that the joint density function of X(t1

), . . . , X(tn

) for Brownain motion was

f(x1

, . . . , x

n

) = f

t1(x1

)ft2�t1(x2

� x

1

)ft3�t2(x3

� x

2

) . . . ft

n

�t

n�1(xn

� x

n�1

)

It follows from this that Brownian motion is a Gaussian process.

5.3 Brownian Motion With Drift

We say that {X(t), t � 0} is a Brownian motion process with drift coe�cient µ if:

1. X(0) = 0;

2. {X(t), t � 0} has stationary and independent increments;

3. X(t) is normally distributed with mean µt and variance t.

41

t

0

DIAGRAM TO BE FINISHED

It can be written asX(t) = µt+W (t)

where W (t) is a standard Brownian motion.

5.4 Finance Applications

Alas, no time.

6 Applications of Stochastic Processes: Bayesian ModelEstimation Through Markov Chain Monte Carlo

6.1 Likelihood and Maximum Likelihood

Likelihood and maximum likelihood were proposed by R. A. Fisher in 1921.

When one assumes a specific probability law/distribution for observed data then wecan form what is called the likelihood function. Maximum likelihood finds the parametervalues which maximise the likelihood. Assume X

1

, . . . , X

n

are a random sample of arandom variable X, which we assume has density f(x|✓) where ✓ are the unknownparameter(s). Alternatively, if X is discrete, a probability mass function. Then the

42

likelihood function is

⇡(x|✓) = f(x1

|✓)f(x2

|✓) . . . f(xn

|✓)

=nY

i=1

f(xi

|✓)

This can be thought of as the probability of observing the given random sample withparameters ✓.

Example:Suppose that the time to failure of a vital component in an electronic device is ex-

ponentially distributed. A sample of n failure times is X = (x1

, . . . , x

n

). The likelihoodfunction is:

⇡(x|�) =nY

i=1

�e

��x

i = �

n

e

��

Pn

i=1 xi

End Example

Maximum likelihood proceeds by maximising the likelihood with respect to the un-known parameter ✓.

Usually, we work with the log-likelihood

log ⇡(x|✓) = log

"nY

i=1

f(xi

|✓)#

=nX

i=1

log f(xi

|✓)

Then take the gradient of the log-likelihood and set this equal to zero:

r✓

log ⇡(x|✓) = 0

The value of ✓ which satisfies this ✓̂ is the maximum likelihood estimate.

Example:

log ⇡(x|�) = n log ��

X

i

x

i

d

d�

log ⇡(x|�) =n

�

�X

i

x

i

n

�̂

�X

i

x

i

= 0

=) �̂ =

Pi

x

i

n

= x

43

End Example

Example:

Assume X

1

, . . . , X

n

⇠ Bernoulli (p). What is the MLE of p?

f(x|p) = p

x(1� p)1�x

⇡(x|✓) =nY

i=1

p

x

i(1� p)1�x

i

= p

Pi

x

i(1� p)P

i

1�x

i

= p

Pi

x

i(1� p)n�P

i

x

i

log ⇡(x|✓) = log pP

i

x

i(1� p)n�P

i

x

i

= log pP

i

x

i + log(1� p)n�P

i

x

i

=X

i

x

i

log p+

n�

X

i

x

i

!log(1� p)

d

dp

log ⇡(x|p) =d

dp

X

i

x

i

log p+

n�

X

i

x

i

!log(1� p)

!

=

Pi

x

i

p

� (n�P

i

x

i

)

1� p

Pi

x

i

p̂

� (n�P

i

x

i

)

1� p̂

= 0

X

i

x

i

� p̂ (n�P

i

x

i

)

1� p̂

= 0

p̂

1� p̂

=

Pi

x

i

n�P

i

x

i

1� p̂

p̂

=n�

Pi

x

iPi

x

i

1

p̂

� 1 =nPi

x

i

� 1

p̂ =

Pi

x

i

n

=1

n

nX

i=1

x

i

= x

End Example

44

Example:X

1

, . . . , X

n

⇠ N(µ, �2). What are the maximum likelihood estimates of µ and �

2?

f(x;µ, �2) =1p2⇡�2

e

� 12�2 (x�µ)

2

⇡(x;µ, �2) =nY

i=1

1p2⇡�2

e

� 12�2 (xi

�µ)

2

=

1

(2⇡�2)12

!n

e

Pi

� 12�2 (xi

�µ)

2

=1

(2⇡�2)n

2e

Pi

� 12�2 (xi

�µ)

2

log ⇡(x;µ, �2) = log

1

(2⇡�2)n

2e

Pi

� 12�2 (xi

�µ)

2

�

= log

✓1

(2⇡�2)n

2

◆+ log

⇣e

Pi

� 12�2 (xi

�µ)

2⌘

= log�(2⇡�2)�

n

2�+

X

i

� 1

2�2

(x1

� µ)2!log e

= �n

2log(2⇡�2)� 1

2�2

nX

i=1

(xi

� µ)2

MLE for µ:

d

dµ

log ⇡(x;µ, �2) = � 2

2�2

nX

i=1

(xi

� µ)⇥ (�1)

= +1

�

2

nX

i=1

(xi

� µ)

1

�

2

X

i

(xi

� µ̂) = 0

1

�

2

X

i

x

i

� nµ̂

!= 0

X

i

x

i

�Nµ̂ = 0

µ̂ =1

n

nX

i=1

x

i

= x

45

MLE for �2:

d

d�

2

log ⇡(x;µ, �2) = �n

2· 4⇡�

2⇡�2

+2

2�3

X

i

(xi

� µ)2

= �n

�

+1

�

3

X

i

(xi

� µ)

�n

�̂

+1

�̂

3

X

i

X

i

(xi

� µ)2 = 0

��̂2

n+X

i

(xi

� µ)2 = 0

�̂

2

n =X

i

(xi

� µ)2

�̂

2 =1

n

nX

i=1

(xi

� µ̂)2

=1

n

nX

i=1

(xi

� x)2

A biased estimator. Recall that

s

2 =1

n� 1

nX

i=1

(xi

� x)2

E(s2) = �

2, whereas here E(�̂2) =? Show that E(s) 6= �.End Example

Example:Let X

1

, . . . , X

n

⇠ Gamma (↵,�). What are the MLEs of ↵ and �?

46

✓ = (↵,�)

f(x|✓) =�

↵

�(↵)x

↵�1

e

��x

⇡(x|✓) =nY

i=1

f(xi

|✓)

=nY

i=1

�

↵

�(↵)x

↵�1

i

e

��x

i

=↵

n↵

[�(↵)]n

"nY

i=1

x

i

#↵�1

e

��

Pn

i=1 xi

log ⇡(x|✓) = n↵ log �� n log�(↵) + (↵� 1) log

"nY

i=1

#� �

X

i

x

i

= n↵ log �� n log�(↵) + (↵� 1)X

i

log xi

� �

X

i

x

i

ML for �:

@

@�

log ⇡(x|✓) =n↵

�

�X

i

x

i

=) n↵̂

�̂

=X

i

x

i

=) �̂ =n↵̂Pi

x

i

ML for ↵:

@

@↵

log ⇡(x|✓) = n log �� n�0(↵)

�(↵)+X

i

log xi

=) ↵̂ is the solution of:

n log �̂+X

i

log xi

=n�0(↵̂)

�(↵̂)

No closed form solution for the MLEs; use numerical methods to solve for ↵̂, �̂. Thiscan be done quite easily using the R package optim.

End Example

6.2 Prior Distributions

In finding the maximum likelihood estimates in the previous section, only the observedsample values x

1

, . . . , x

n

are used to construct the estimate of ✓. ML does not require

47

any other information to estimate ✓ other than the sample values. If we did have someprior information about the possible values that ✓ may take, such as expert opinion, itwould have been impossible to incorporate this. In many situations such information willbe available. We can use this information to inform a prior distribution for ✓. andthen use the Bayesian approach for estimation. The prior distribution of a parameter ✓is a probability function/density expressing our degree of belief about the value of ✓ priorto observing a sample of a random variable X, whose distribution function depends on✓. The prior distribution makes use of information available above and beyond what’s inthe random sample.

Example:Suppose we have a brand new 50 cent coin and we want to estimate the probability

of a head. We know ✓ has to lie between 0 and 1. A prior for ✓ could be uniform overthe interval from 0 to 1.

⇡(✓) =

(1 , ✓ 2 (0, 1)

0 , otherwise

✓

⇡(✓)

1

1

0

This corresponds to an assumption of total ignorance; we feel that all values of ✓ areequally likely. On the other hand, one may feel justified in assuming a priori ✓ 2 (.4, .6)since the coin appears quite symmetric. Then the following prior corresponds to a beliefthat any value in (.4, .6) is equally likely:

⇡(✓) =

(5 , ✓ 2 (.4, .6)

0 , otherwise

✓

⇡(✓)

5

4 60

48

Finally, we may only have values .4, .5, .6 with .5 twice as likely giving prior.

✓

⇡(✓)

1

2

1

4

.4 .5 .60

Note in this example that the priors are di↵erent and depend on the assumptions weare willing to make regarding the unknown ✓. Often these assumptions will be informedusing expert opinion on the problem.

End Example

⇡(✓) �! prior beliefs about where ✓ may lie in ⇥ space.

⇥: Normal(µ̂, �2) ⇥ = R⇥ R+

⇥: Bernoulli(✓) ⇥ = (0, 1)

Prior choice is a subjective task.

The final result of a Bayes’ technique is generally dependent on the prior assumed.Hence, care should be taken when eliciting priors.

6.3 Posterior Distributions

Having observed a sample x = (x1

, . . . , x

n

) we can write down the likelihood for x giventhe value of ✓:

Likelihood = ⇡(x|✓)

=nY

i=1

f(xi

|✓)

By taking a prior on ✓ we are in essence acting as if the probability law of X is itself arandom variable, though its dependence on ✓.

Hence, we speak of the likelihood as the distribution of x conditional on ✓.

Given a prior density for ✓, ⇡(✓) and the conditional density of the elements of asample (likelihood) ⇡(x|✓). The joint density for the sample and parameter, is simply theproduct of these two functions

⇡(x, ✓) = ⇡(x|✓)⇡(✓)

from the definition of conditional probability:

⇡(x|✓) = ⇡(x, ✓)

⇡(✓)

49

,i.e. the product of the likelihood and the prior. Then the marginal density of the samplevalues, which is independent of ✓, is given by the integral of the joint density over thespace ⇥.

Thus,

⇡(x) =

Z

⇥

⇡(x, ✓)d✓

=

Z

⇥

⇡(x|✓)⇡(✓)d✓

This is called the marginal or the likelihood of the sample.

The posterior density for ✓ is the conditional density of ✓ given the sample values.Thus,

⇡(✓|x) = ⇡(x, ✓)

⇡(x)=

⇡(x|✓)⇡(✓)⇡(x)

The prior density expresses our degree of belief about ✓ before any experiment while thepsoterior expresses our beliefs given the result of the sample. Notice that the marginallikelihood ⇡(x) is the normalising constant of ⇡(x|✓)⇡(✓), i.e.,

Z

✓

⇡(x|✓)⇡(✓)⇡(x)

d✓ = 1

(The bottom line makes it a proper density)

But the marginal doesn’t depend explicitly on ✓. We will often write:

⇡(✓|x) / ⇡(x|✓)

Posterior / likelihood⇥ prior

In many cases ⇡(x) will not be available analytically. This is what leads us to numer-ical methods such as Markov chain Monte Carlo (MCMC).

Example:Suppose X

1

, . . . , X

n

⇠iid

N(µ, �2).

Assume a prior for µ which is N(⇠, ⌧ 2) and a prior for �2 which is Inv Gamma (↵, �).

Y ⇠ Gamma( ↵|{z}shape

, �|{z}rate

) where scale = 1

rate

. Note that R uses 1

�

= � = scale.

) 1

Y

⇠ Inv Gamma

where

f

1Y

(t) =�

↵

�(↵)t

�(↵+1)

e

��

t

50

It is a good exercise to derive this:

F

Y

(t) = P(Y t)

F

1Y

(t) = P( 1Y

t)

= P(Y � 1

t

)

= 1� P(Y t)

= 1� F

Y

(1

t

)

f

1Y

(t) =d

dt

F

1Y

(t)

= � d

dt

F

Y

(1

t

)

=�(�1)

t

2

· fY

(1

t

)

f

1Y

(t)

=1

t

2

· �

↵

�(↵)

✓1

t

◆↵�1

e

��( 1t

)

=1

t

2

�

↵

�(↵)t

�↵+1

e

��

t

=�

↵

�(↵)t

�(↵+1)

e

��

t

Now, back to the example:

⇡(µ) =1p2⇡⌧ 2

exp

✓� 1

2⌧ 2(µ� ⇠)2

◆

⇡(�2) =�

↵

�(↵)(�2)�(↵+1)

e

� �

�

2

⇡(x|µ, �2) =nY

i=1

1p2⇡�2

exp

✓� 1

2�2

(xi

� µ)2◆

= (2⇡�2)�n

2 exp

� 1

2�2

nX

j=1

(xi

� µ)2!

51

Posterior / Likelihood⇥ Prior

⇡(µ, �2|x) / ⇡(x|µ, �2)⇡(µ)⇡(�2) (independence)

/ (2⇡�2)�n

2 exp

� 1

2�2

nX

i=1

(xi

� µ)2!⇥ (2⇡⌧ 2)

�12 exp

✓� 1

2⌧ 2(µ� ⇠)2

◆

⇥ �

↵

�(↵)(�2)�(↵+1) exp

✓� �

�

2

◆

/ (�2)�(

n

2+↵+1) exp

� 1

2�2

"nX

i=1

(xi

� µ)2 + 2�

#!⇥ exp

✓� 1

2⌧ 2(µ� ⇠)2

◆

/ (�2)�(

n

2+↵+1) exp

�1

2

"1

�

2

X

i

x

2

i

� 2µX

i

x

i

+ µ

2

!+

2�

�

2

+1

⌧

2

(µ2 � 2µ⇠ + ⇠

2)

#!

/ (�2)�(

n

2+↵+1) exp

✓�1

2

✓n

�

2

+1

�

2

◆µ

2 � 2

✓Pi

x

i

�

2

+⇠

⌧

2

◆µ+

Pi

x

i

�

2

+2�

�

2

+⇠

2

⌧

2

�◆

End Example

Computing a marginal likelihood ⇡(x): only possible in the sinplest of cases/models.X

1

, . . . , X

n

⇠ N(µ, �2), �2 known.

52

Prior µ which is N(⇠, ⌧ 2).

⇡(x|µ, �2) =nY

i=1

(2⇡�2)�12 exp

✓� 1

2�2

(xi

� µ)2◆

= (2⇡�2)�n

2 exp

� 1

2�2

nX

i=1

(xi

� µ)

!

⇡(µ) = (2⇡⌧ 2)�12 exp

✓� 1

2⌧ 2(µ� ⇠)2

◆

⇡(x) =

Z+1

�1⇡(x|µ, �2)⇡(µ)dµ

=

Z+1

�1(2⇡�2)�

n

2 (2⇡⌧ 2)�12

| {z }C

exp

� 1

2�2

nX

i=1

(xi

� µ)2 � 1

2⌧ 2(µ� ⇠)2

!dµ

= C

Z+1

�1exp

� 1

2�2

"X

i

x

2

i

� 2µX

i

x

i

+ nµ

2

#� 1

2⌧ 2⇥µ

2 � 2⇠µ+ ⇠

2

⇤!dµ

= C

Z+1

�1exp

0

BB@�1

2

2

664

✓n

�

2

+1

⌧

2

◆µ

2 � 2

✓Pi

x

i

�

2

+⇠

⌧

2

◆µ+

Pi

x

2

i

�

2

+⇠

2

⌧

2| {z }C2

3

775

1

CCA dµ

= C exp

✓�1

2

✓Pi

x

2

i

�

2

+⇠

2

⌧

2

◆◆Z+1

�1exp

✓�1

2

✓n

�

2

+1

⌧

2

◆µ

2 � 2

✓Pi

x

i

�

2

+⇠

⌧

2

◆µ

�◆dµ

= C

0

Z+1

�1exp

0

@��

n

�

2 +1

⌧

2

�

2

2

4µ

2 � 2

⇣Pi

x

i

�

2 + ⇠

⌧

2

⌘

�n

�

2 +1

⌧

2

�µ

3

5

1

Adµ

= C

0

Z+1

�1exp

0

@��

n

�

2 +1

⌧

2

�

2

0

@"µ�

Pi

x

i

�

2 + ⇠

⌧

2

n

�

2 +1

⌧

2

!#2

�

Pi

x

i

�

2 + ⇠

⌧

2

n

�

2 +1

⌧

2

1

A2

1

Adµ

= C

0

exp

0

@

⇣Pi

x

i

�

2 + ⇠

⌧

2

⌘

2�

n

�

2 +1

⌧

2

�

1

AZ

+1

�1exp

0

@��

n

�

2 +1

⌧

2

�

2

0

@µ�

Pi

x

i

�

2 + ⇠

⌧

2

n

�

2 +1

⌧

2

!2

1

A

1

Adµ

= C

0

exp

0

B@

⇣Pi

x

i

�

2 + ⇠

⌧

2

⌘2

2�

n

�

2 +1

⌧

2

�

1

CA

s2⇡

n

�

2 +1

⌧

2

(1)|{z}integral over range of normal

0

@Var =1

n

�

2 +1

⌧

2

·

sn

�

2 +1

⌧

2

2⇡

1

A

53

6.4 Posterior Quantities of Interest

There are many quantities of interest that we may want to get from a Bayesian analysis.For example, the mean of the posterior distribution ✓

⇤ is a widely used Bayesian estimator.The mode of the posterior ✓̂ is called the maximum a posteriori (MAP) estimate of✓.

If ✓ is of dimension p, (✓1

, . . . , ✓

p

), we may be interested in the marginal density of✓

j

:

⇡(✓j

|✓�j

, x) =

Z

⇥�j

⇡(✓|x)d✓�j

, j = 1, . . . , p

where ✓�j

= (✓1

, . . . , ✓

j�1

, ✓

j+1

, . . . , ✓

p

) is ✓ with the j

th element removed.

Consider the posterior expectation of ✓, ✓⇤.

✓

⇤ =

Z

⇥

✓⇡(✓|x)d✓ = E

✓|x[✓]

=

Z

⇥

✓

⇡(x|✓)⇡(✓)⇡(x)

d✓

This calculation requires knowing ⇡(x) which will be intractable in most cases.

This is a big problem!

We will face these integrals in each problem we look at.

What if we could simulate values of ✓, say ✓

(1)

, ✓

(2)

, . . . , ✓

(N), from ⇡(✓|x)? Instead ofdoing these integrals analytically, we could approximate them numerically:

E

✓|x[✓] =

Z

⇥

✓⇡(✓|x)d✓ ⇡ 1

N

NX

k=1

✓

(k)

In fact we could use the same approach to approximate the posterior expectation ofany function g(✓) of ✓.

E

✓|x[g(✓)] =

Z

⇥

g(✓)⇡(✓|x)d✓ ⇡ 1

N

NX

k=1

g(✓(k))

The main idea of Markov Chain Monte Carlo is to approximately generate samplesfrom the posterior ⇡(✓|x), and then use these to approximate integrals.

6.5 MCMC: The Key Ideas

The key idea of MCMC is simple. We want to generate samples from ⇡(✓|x) but wecan’t do this directly. However, suppose we can construct a Markov chain (through itstransition probabilities) with state space ⇥ (all values of ✓ which is straightforward tosimulate from, and it has stable (stationary) distribution which is the posterior ⇡(✓|x).

✓

(0)

, ✓

(1)

, ✓

(2)

, ✓

(3)

, . . . , ✓

(t)

, . . . , ✓

(N)

54

6.6 The Gibbs Sampling Algorithm

Julian Besag (1974) discussion paper in JRSSB.

Let ✓ = (✓1

, . . . , ✓

p

) and we want to obtain inferences from ⇡(✓|x), but sampling isn’teasy.

We can recast the problem as one of iterative sampling from appropriate conditionaldistributions.

Consider the full conditional densities

⇡(✓j

|x, ✓�j

) j = 1, . . . , p

where ✓�j

= {✓i

: i 6= j}. These are densities of the individual components given thedata and the specified values of the other components of ✓.

They can be typically recognised as standard densities, e.g. normal, gamma, etc., in✓

j

.

Suppose we have an arbitrary set of starting values ✓(0) = (✓(0)1

, . . . , ✓

(0)

p

).

For the unknowns we implement the following interative procedure:

1st iteration

8>>>>>>>><

>>>>>>>>:

draw ✓

(1)

1

from ⇡(✓1

|✓(0)2

, . . . , ✓

(0)

p

, x)

draw ✓

(1)

2

from ⇡(✓2

|✓(1)1

, ✓

(0)

3

, . . . , ✓

(0)

p

, x)

draw ✓

(1)

3

from ⇡(✓3

|✓(1)1

, ✓

(1)

2

, ✓

(0)

4

, . . . , ✓

(0)

p

, x)...

draw ✓

(1)

p

from ⇡(✓p

|✓(1)1

, . . . , ✓

(1)

p�1

, x)

2nd iteration

(draw ✓

(2)

1

from ⇡(✓1

|✓(0)2

, . . . , ✓

(0)

p

, x)...

Now suppose this procedure is contained through t iterations resulting in the sampledvector ✓(t) = (✓(t), . . . , ✓(t)

p

) is a realisation of a Markov chain with transition probabilities

p(✓(t), ✓(t+1)) =pY

j=1

⇡(✓(t+1)

j

)✓(t+1)

`

, ` < j, ✓

(t)

`

, ` > j, x)

The transition (Gibbs) kernel.

Then as t ! 1, (✓(t)1

, . . . , ✓

(t)

p

) tends to the distribution of a random vector whosejoint density is ⇡(✓|x).

(Throw away the initial part, called the burn-in)

In particular, ✓(t)j

tends in distribution to a random quantity whose density is ⇡(✓j

|x).Example:A popular application of Gibbs sampler is in finite mixture models used for model-

based clustering. In R use package mclust (Raftery).

55

For Gaussian finite mixture the density of an observation x is given by

f

x

(x) =GX

g=1

w

g

f(x|µg

, �

2

g

)

where w

g

are the mixture weights and

GX

g=1

w

g

= 1

andf(x|µ

g

, �

2

g

) is N(µg

, �

2

g

)

The likelihood for n observations x1

, . . . , x

n

is

⇡(x|✓) =nY

i=1

GX

g=1

w

g

f(xi

|µg

, �

2

g

)

!

The likelihood is very di�cult to work with. Thus we usually complete the data withcomponent labels z = (z

1

, . . . , z

n

), which tells us which component each observationbelongs to, i.e. z

i

= g, then x

i

arises from a N(µg

, �

2

g

). Of course the labels give theclustering of the data, but can’t be observed directly. We can’t include these as unknownsin the Gibbs sampler.

(K-means)

The likelihood of the complete data is:

⇡(x, z|✓) =GY

g=1

Y

i:z

i

=g

w

g

1p2⇡�2

g

exp

✓�(x

i

� µ

g

)2

2�2

g

◆

=GY

g=1

w

n

g

g

(2⇡�2

g

)�n

g

2 exp

� 1

2�2

g

X

i:z

i

=g

(xi

� µ

g

)2!

where n

g

= # of i’s such that zi

= g.

Priors: weights.

Standard assumption is to assume that the weights follow a Dirichlet distribution,which is given by

GX

g=1

w

g

= 1�(↵ + �)

�(↵)�(�)x

↵�1(1� x

��1)

!

⇡(w1

, . . . , w

G

) =�(� + � + . . .+ �)

�(�)�(�) . . .�(�)w

��1

1

w

��1

2

. . . w

��1

G

=�(G�)

�(�)G

GY

g=1

w

��1

g

56

Usually one assumes that the means µg

arise from a N(⇠, ⌧ 2) a priori and independently.

⇡(µ1

, . . . , µ

G

) =GY

g=1

1p2⇡⌧ 2

exp

✓� 1

2⌧ 2(µ

g

� ⇠)2◆

Finally, we’ll assume that the variances arise from an inverse gamma distribution inde-pendently:

⇡(�2

1

, . . . , �

2

G

) =GY

g=1

�

↵

�(↵)(�2

g

)�(↵+1) exp

✓� �

�

2

g

◆

⇡(✓, z|x) / ⇡(x|✓)⇡(✓)

/GY

g=1

w

n

g

g

(2⇡�2

g

)�n

g

2 exp

� 1

2�2

g

X

i:z

i

=g

(xi

� µ

g

)2!

| {z }likelihood

w

��1

g|{z}prior

weights

exp

✓� 1

2⌧ 2(µ

g

� ⇠)2◆

| {z }prior means

⇥ (�2

g

)�(↵+1) exp

✓��

2

g

◆

| {z }prior variances

The next step in implementing a Gibbs sampler for this model is to derive the fullconditionals. We want to iteratively sample the lables, weights, means & variances.

Labels full conditional:

P(z + i = k|everything else) / w

k

(2⇡�2

k

)�12 exp

✓� 1

2�2

k

(xi

� µ

k

)2◆

/ w

k

�

k

exp

✓� 1

2�2

k

(xi

� µ

k

)2◆

We compute this for each value of k = 1, . . . , G, then renormalise to get a discretedistribution for the label which we can sample from.

Full conditional weights:

⇡(w1

, . . . , w

G

|everything else) /GY

g=1

w

n

g

+��1

g

which is the form of a Dirichlet distribution (n1

+ �, n

2

+ �, . . . , n

G

+ �).

⇡(µg

|everything else) / exp

� 1

2�2

g

X

i:z

i

=g

(xi

� µ

g

)2 � 1

2⌧ 2(µ

g

� ⇠)2!

/ exp

�1

2

"✓n

g

�

2

g

+1

⌧

2

◆2

µ

2

g

� 2

✓Pi:z

i

=g

x

i

�

2

g

+⇠

⌧

2

◆µ

g

#!

/ exp

0

@�

⇣n

g

�

2g

+ 1

⌧

2

⌘

2

"µ

g

�

Pi:z

i

=g

x

i

�

2g

+ ⇠

⌧

2

n

g

�

2g

+ 1

⌧

2

#2

1

A

57

So the full conditional for µg

is

N

Pi:z

i

=g

x

i

�

2g

+ ⇠

⌧

2

n

g

�

2g

+ 1

⌧

2

,

1n

g

�

2g

+ 1

⌧

2

!

Finally, the full conditional for the �

2

g

is

⇡(�2

g

|everything else) / (�2

g

)�(n

2+↵+1) exp

� 1

�

2

g

"1

2

X

i:z

i

=g

(xi

� µ

g

)2 + �

#!

which is an inverse gamma distribution

Inv Gamma

n

2+ ↵,

1

2

X

i:z

i

=g

(xi

� µ

g

)2 + �

!

End Example

6.7 The Metropolis-Hastings Algorithm

This algorithm constructs a Markov chain ✓

(1)

, ✓

(2)

, . . . , ✓

(t)

, . . . by defining the transitionprobability from ✓

(t) to ✓

(t+1) as follows:

Let q(✓, ✓0) denote a proposal distribution such that if ✓ = ✓

(t), then ✓

0 is a proposednext value for the chain, i.e. ✓

0 is a proposed value for ✓

(t+1). However, a furtherrandomization then takes place.

With some probability ↵(✓, ✓0) we actually accept ✓

(t+1) = ✓

(t). This constructiondefines a Markov chain with transition probabilities given by

p(✓, ✓0) = q(✓, ✓0)↵(✓, ✓0) + I(✓0 = ✓)

1�

Zq(✓, ✓00)↵(✓, ✓00)d✓00

�

where I(·) is an indicator function.

If we now set

↵(✓, ✓0) = min

⇢1,

⇡(✓0|x)q(✓0, ✓)⇡(✓|x)q(✓, ✓0)

�

Then one can show⇡(✓|x)q(✓, ✓0) = ⇡(✓|x)q(✓0, P✓)

This is called the detailed balance condition, & it is a su�cient condition to to ensurethat ⇡(P✓|x) is the stable distribution of the chain. Thus, we only require the functionalform of the posterior.

In practice we generally assume that q(✓, ✓0) is a normal distribution

N(✓, �2

prop

I)

where I is the identity matrix. The value of the chain will depend on the value of �2

prop

,generally we tune this to give 25� 40%.

58

7 Spatial Processes

A Tutorials

A.1 Tutorial 1

Problems 1

1. 5 white 5 black

2 urns – five balls in each.

# white in left urn= X

n

Each step pick at random 1 ball from each urn & drop into the other urn.

P(Xn+1

= i+ 1|Xn

= i) =5� i

5⇥ 5� i

5=

(5� i)2

25

Take white from right urn & black from left.

P(Xn+1

= i� 1|Xn

= i) =5� (5� i)

5⇥ i

5=

i

5⇥ i

5=

i

2

25

Take black from right urn & white from left.

P(Xn+1

= i|Xn

= i) =i

5⇥ 5� i

5+

5� i

5⇥ i

5=

2i(5� i)

25

Take 2 of the same colour from the urns

Thus, we have the transition probabilities of Xn

.

Extension: Bernoulli-Laplace model of di↵usion.

b black balls & 2m� b white balls.

Draw in the same way but such that there’s m balls in each urn.

Let Xn

= # black balls in left urn

P(Xn+1

= i+ 1|Xn

= i) =m� i

m

· m� (b� i)

m

2

(Draw black from right (b black b� i in right)

white from left)

=(m� i)(m� b+ i

m

2

Exercise:P(X

n+1

= i� 1|Xn

= i) = P(Xn+1

= i|Xn

= i) =?

59

2. (Gambler’s ruin N = 4)

p(i, i+ 1) = .4

p(i, i� 1) = .6

Stop if i reaches 4 (p(4, 4) = 1) or if i reaches 0 (p(0, 0) = 1).

Since the games are independent:

p

3(1, 4) = (.4)3

1 2 3 4X

n

X

n+1

X

n+2

X

n+3

p

3(1, 0) = .6 + .4(.62) = .744

1 0 0 0 0X

n

X

n

X

n+1

X

n+2

X

n+3

L L L L

or

1 2 1 0X

n

X

n+1

X

n+2

X

n+3

W L L

Alternative method:

P =

2

66664

1 0 0 0 0.6 0 .4 0 00 .6 0 .4 00 0 .6 0 .40 0 0 0 1

3

77775

Compute P

3, the matrix of 3-step transition probabilities, and simply read o↵ therequired values.

3. General two-state chain; state space S = {1, 2}.

P =

1� a a

b 1� b

�

Use the Markov property to show that

P(Xn+1

= 1)� b

a+ b

= (1� a� b)

P(X

n

= 1)� b

a+ b

�

Now,

P(Xn+1

= 1) = P(Xn

= 1)P(Xn+1

= 1|X � n) + P(Xn

= 2)P(Xn+1

= 1|Xn

= 2)

= P(Xn

= 1)(1� a) + P(Xn

= 2)b

= P(Xn

= 1)(1� a) + (1� P(Xn

= 1))b

= (1� a� b)P(Xn

= 1) + b

60

P(Xn+1

= 1)� b

a+ b

= (1� a� b)P(Xn

= 1) +b(a+ b)� b

a+ b

P(Xn+1

= 1)� b

a+ b

= (1� a� b)

P(X

n

= 1)� b

a+ b

�

And hence,

P(Xn

= 1) =b

a+ b

+ (1� a� b)nP(X

n

= 1)� b

a+ b

�

PXn

= 1)� b

a+ b

= (1� a� b)

P(X

n�1

= 1)� b

a+ b

�

= (1� a� b)2P(X

n�2

= 1)� b

a+ b

�

= . . .

= (1� a� b)nP(X

0

= 1)� b

a+ b

�

limn!1

P(Xn

= 1) =b

a+ b

+ limn!1

(1� a� b)nP(X

0

= 1)� b

a+ b

�

If |1� a� b| < 1() 0 < a+ b < 2, then

limn!1

P(Xn

= 1) =b

a+ b

A.2 Tutorial 2

Problems 2

1.

⇡

1

=

✓11

47,

19

47,

17

47

◆

⇡

2

=

✓1

3,

1

3,

1

3

◆

2.

⇡

1

= (.4, .6)

⇡

2

=

✓6

35,

7

35,

22

35

◆

61

3. Machine; Shocks i = 1, 2, 3 Poisson �

i

.

Part 1 �! shocks type 1&3

Part 2 �! shocks type 2&3

Let U and V be the failure times of parts 1 and 2, respectively.

(a) P(U > s, V > t)

For U > s, V > t,

i. No shocks of type 1 before time s;

ii. No shocks of type 2 before time t;

iii. No shocks of type 3 before time max(t, s).

Shocks arrive according to a Poisson process: we know that the time to firstarrival is exponential . . .

Time to shock of type i ⇠ exp(�i

)

T ⇠ exp(�) =) F

T

(t) = P(T t) = 1� e

��t

S

T

= P(T > t) = e

��t

For i., ii., iii. to occur:

i. has probability e

��1s;

ii. has probability e

��2t;

iii. has probability e

��3 max(t,s).

=) P(U > s, V > t) = e

��1te

��2te

��3 max(t,s)

= e

��1t��2t��3 max(t,s)

(b) U&V are times to first arrival in a Poisson process =) exponential.

U ⇠ exponential (�1

+ �

3

)

V ⇠ exponential (�2

+ �

3

)

(c) Are V&V independent?

f

U,V

(s, t) = f

U

(s)fV

(t)

62

if V&V independent.

P(U > s, V > t) =

Z 1

s

Z 1

t

f

U,V

(s1

, t

1

)ds1

dt

1

=

Z 1

s

Z 1

t

(�1

+ �

3

)e�(�1+�2)s1(�2

+ �

3

)e�(�2+�3)tds

1

dt

1

= (�1

+ �

3

)(�2

+ �

3

)

Z 1

s

e

�(�1+�3)s1ds

1

� Z 1

t

e

�(�2+�3)t1dt

1

�

=⇥�e�(�1+�3)s1

⇤1s

⇥�e�(�2+�3)t1

⇤1t

= e

�(�1+�3)se

�(�2+�3)t

= e

��1s��2t��3(s+t)

6= P(U > s, V > t)

= e

��1s��2t��3 max(t,s)

So U&V are not independent.

4. T

1

, . . . , T

n

independent exp �

1

, . . . ,�

n

.

(a) T⇤ = min(T1

, . . . , T

n

) ⇠ exp⇣P

n

j=1

�

j

⌘

Distribution of T ⇤ () P(T ⇤ t) � F

T

⇤(t).

Consider P(T ⇤> t), if T ⇤

> t, then we must have that each of Tj

is greaterthan t.

P(T ⇤> t) =

nY

j=1

P(Tj

> t)

=nY

j=1

e

��

j

t

= e

�(P

n

j=1 �j

)t

which is the survival function of an exponential RV with rateP

n

j=1

�

j

.

63

(b) Show

P(Ti

< T

j

) =�

i

�

i

+ �

j

i 6= j

P(Ti

< T

j

) =

Z 1

0

f

T

i

(t)P(Tj

> t)dt

=

Z 1

0

�

i

e

��

i

t��

j

t

dt

= �

i

Z 1

0

e

�(�

i

+�

j

)t

dt

=�

i

�

i

+ �

j

Z 1

0

(�i

+ �

j

)e�(�

i

+�

j

)t

dt (dist of exp(�i

+ �

j

))

=�

i

�

i

+ �

j

(1)

=�

i

�

i

+ �

j

(c) T

1

, . . . , T

n

⇠ exponential �1

, . . . ,�

n

. Show

P(Ti

= min(T1

, . . . , T

n

)) =�

iPn

j=1

�

j

P(Ti

= min(T1

, . . . , T

n

)) =

Z 1

0

f

T

i

(t)Y

j 6=i

P(Tj

> t)dt

=

Z 1

0

�

i

e

��

i

t

Y

j 6=i

e

��

j

t

dt

= �

i

Z 1

0

e

�(P

n

j=1 �j

)tdt

= �

i

"� 1P

n

j=1

�

j

e

�(P

j

�

j

)t#1

0

=�

iPn

j=1

�

j

5. X

1

⇠ Poisson (µ1

), X2

⇠ Poisson (µ2

).

64

X

1

+X

2

⇠ Poisson (µ1

+ µ

2

).

P(X1

+X

2

= k) =kX

m=0

P(X1

= m)P(X2

= k �m)

=kX

m=0

µ

m

1

e

�µ1

m!· µ

k�m

2

e

�µ2

(k �m)!

= e

�(µ1+µ2)

kX

m=0

1

m!(k �m)!µ

m

1

µ

k�m

2

=e

�(µ1+µ2)

k!

kX

m=0

k!

m!(k �m)!µ

m

1

µ

k�m

2

=e

�(µ1+µ2)

k!

kX

m=0

✓k

m

◆µ

m

1

µ

k�m

2

=e

�(µ1+µ2)

k!

kX

m=0

(µ1

+ µ

2

)m

=) probability mass function of Poisson (µ1

+ µ

2

).

6. Later

7. Later

65

stochastic models in space and time i

Documents