probability model / applied analysis 2019 1 2fukuizumi/2019-gsis-probmodel.pdf · probability model...

Probability Model / Applied Analysis 2019 1

by

Reika Fukuizumi 2

Probability models are essential in mathematical analysis of random phenomena. Inthese lectures, we focus on Markov chains as basic models of random time evolution.Starting with fundamental concepts in probability theory (random variables, probabil-ity distributions, etc.), we study fundamentals on Markov chains (transition probability,recurrence, stationary distributions, etc.). Background knowledge on elementary proba-bility is required.

The lectures will follow the book Essence of Probability models by Nobuaki Obata,(published from Makino shoten in 2012, in Japanese). Other references including Eng-lish/French texts are for example:

1. W. Feller, “An Introduction to Probability Theory and Its Applications,” Vol.1Wiley, 1957.

2. R. Durrett, “Probability: Theory and Examples” fourth edition, Cambridge Uni-versity Press 2010.

3. T. Bodineau, “PROMENADE ALEATOIRE : Chaınes de Markov et martingales,”MAP432, Ecole Polytechnique 2013.

To obtain this course’s credit, you are required to choose by yourself fiveproblems among the problems given during the lectures, and to submit an-swers to the five problems as a report.

The report should be handed in to the report box aside the educationalaffairs office of GSIS.Deadline: January 27, 2020.

No exception will be made.

The 1-4 sections of this course consist of a review of what we have already learned in theundergraduate program (at least at Tohoku University we hold a course “MathematicalStatics.”)

1updated: Dec.20 20192Email: [email protected]

1

2

1. Probability Spaces and Random Variables

Ω: sample space consisting of elementary events (or sample points).F : the set of events.

Definition (Probability). A function P : A ∈ F 7→ P(A) ∈ [0, 1] is said to be theprobability on F (or with domain F) if the following (P1)-(P3) are satisfied:

(P1) 0 ≤ P(A) ≤ 1 for any A ∈ F .(P2) P(∅) = 0 and P(Ω) = 1.(P3) For A1, A2, .... ∈ F (infinite sequence) with Aj ∩ Ak = ∅ when j 6= k,

P

(∞⋃n=1

An

)=∞∑n=1

P(An).

Examples: Coin toss, Dise throwing, random cut.

Definition (Probability Space). Let Ω be a non-empty set, and P be a probability onF . We call (Ω,F ,P) a probability space.

Theorem 1.1 Let A1, A2, ... be a sequence of events.

(1) If A1 ⊂ A2 ⊂ A3 ⊂ · · · , then

P

(∞⋃n=1

An

)= lim

n→∞An.

(2) If A1 ⊃ A2 ⊃ A3 ⊃ · · · , then

P

(∞⋂n=1

An

)= lim

n→∞An

.

Definition (Discrete random variables). A random variable X is called discrete ifthe number of values that X takes is finite or countably infinite. To be more precise, fora discrete random variable X there exist a (finite or infinite) sequence of real numbers a1,a2,... and corresponding nonnegative numbers p1, p2, ..... such that

P(X = ai) = pi, pi ≥ 0,∑i

pi = 1.

In this case,

µX :=∑

piδai

is called the (probability) distribution of X. Here, for a Borel set B ⊂ R,

δa(B) =

1, a ∈ B0, otherwise

3

is the Dirac measure at a ∈ R. Obviously,

P(a ≤ X ≤ b) =∑

i:a≤ai≤b

pi.

Examples. Coin toss, Waiting time.

Definition (Continuous random variables). A random variable X is called continu-ous if P(X = a) = 0 for all a ∈ R. If there exists a function f(x) such that

P(a ≤ X ≤ b) =

∫ b

a

f(x)dx, a < b, a, b ∈ R,

we say that X admits a probability density function f(x), and denote f(x) by fX(x).Note that ∫ +∞

−∞fX(x)dx = 1, fX(x) ≥ 0

In this case,

µX(dx) := fX(x)dx

is called the (probability) distribution of X.

It is useful to consider the distribution function

FX(x) := P(X ∈ (−∞, x]) =

∫ x

−∞fX(t)dt, x ∈ R.

Then, if FX is continuous and piecewise differentiable, we have

fX(x) =d

dxFX(x).

Remark.

(1) A continuous random variable does not necessarily admit a probability densityfunction. But many continuous random variables in practical applications admitprobability density functions.

(2) There is a random variable which is neither discrete nor continuous. But mostrandom variables in practical applications are either discrete or continuous.

Examples. random cut.

4

Definition (mean value). The mean or expectation value of a random variable X isdefined by

m = E[X] :=

∫ +∞

−∞xµX(dx)

=

∑i

aipi if X is discrete,∫ +∞

−∞xfX(x)dx if X is continuous

which admits a probability density function fX(x)

Proposition 1.1. For a (measurable) function ϕ(x) we have

E[ϕ(X)] =

∫ +∞

−∞ϕ(x)µX(dx).

For example,

• (m-th moment) E[Xm] =∫ +∞−∞ xmµX(dx).

• (characteristic function) E(eitX) =∫ +∞−∞ eitxµ(dx), t ∈ R.

Definition (variance). The variance of a random variable X is defined by

σ2 = V[X] = E[(X − E[X])2] = E[X2]− E[X]2,i.e.

σ2 = V[X] =

∫ +∞

−∞(x− E[X])2µX(dx) =

∫ +∞

−∞x2µX(dx)−

(∫ +∞

−∞xµX(dx)

)2

.

Examples. Waiting time, random cut.

5

2. Probability Distributions

We introduce some classical examples of one-dimensional distributions.

• Discrete distributions– Bernoulli distribution

For 0 ≤ p ≤ 1, the distribution (1− p)δ0 + pδ1 is called Bernoulli distributionwith success probability p.

m = p, σ2 = p(1− p)

– Binomial distribution B(n, p)For 0 ≤ p ≤ 1 and n ≥ 1, the distribution

n∑k=0

(n

k

)pk(1− p)n−kδk

is called the Binomial distribution. The quantity(nk

)pk(1− p)n−k is typically

the probability, that n coin tosses with probabilities p for heads and q = 1−pfor tails, resulting in k heads and n− k tails.

m = np, σ2 = np(1− p)

0 10 20 30 40

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Binomial distributions (size n fixed)

x value

Den

sity

size 40 fixed

proba=0.3proba=0.5proba=0.7

0 10 20 30 40

0.00

0.05

0.10

0.15

Binomial distributions (proba fixed)

x value

Den

sity

success proba 0.5 fixed

size n=40size n=20


size n=40size n=20


size n=40size n=20

– Geometric distributionFor 0 ≤ p ≤ 1, the distribution

+∞∑k=1

p(1− p)k−1δk

is called the geometric distribution with success probability p.

m =1

p, σ2 =

1− pp2

6

by the computation of the probability generating function,

G(z) =∞∑k=1

p(1− p)k−1zk, G′(z) =p

1− (1− p)z2, G′′(z) =

2p(1− p)1− (1− p)z3

.

m = G′(1) = 1/p, σ2 = G′′(1) +G′(1)− G′(1)2 = (1− p)/p2.

0 10 20 30 40

0.0

0.1

0.2

0.3

0.4

0.5

Geometric distributions

x value

Den

sity

Distributions

proba=0.1proba=0.25proba=0.75proba=0.5

Remark. In some literatures, the geometric distribution with parameter p isdefined by

+∞∑k=0

p(1− p)kδk.

In this case, the mean is (1− p)/p and the variance is (1− p)/p2.

– Poisson distributionFor λ > 0 the distribution

+∞∑k=0

e−λλk

k!δk

is called the Poisson distribution with parameter λ.

m = λ, σ2 = λ

7

0 10 20 30 40

0.0

0.1

0.2

0.3

Poisson distributions

x value

Den

sity

Distributions

parameter=4parameter=10parameter=1

Problem 1. Find the mean value and variance of the discrete distributions intro-duced above.

• Continuous distributions– Uniform distribution

For a finite interval [a, b], the function

f(x) =

1

b− a, a ≤ x ≤ b

0, otherwise.

becomes a density function, which determines the uniform distribution on[a, b].

m =

∫ b

a

xdx

b− a=b+ a

2, σ2 =

∫ b

a

x2dx

b− a−m2 =

(b− a)2

12.

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

Uniform distributions

x value

Den

sity

Distributions

[−1,1][1,4][0,1]

8

– Exponential distributionThe exponential distribution with parameter λ > 0 is defined by the densityfunction

f(x) =

λe−λx, x ≥ 0,0, otherwise.

The mean value and variance are

m =1

λ, σ2 =

1

λ2.

0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

Exponential distributions

x value

Den

sity

Distributions

rate=5rate=10rate=1

– Normal distribution N (m,σ2)For m ∈ R and σ > 0, we see that

f(x) =1√

2πσ2exp

−(x−m)2

2σ2

becomes a density function. The distribution defined by this density functionis called the normal distribution or Gaussian distribution. When m = 0 andσ2 = 1, i.e. N (0, 1) is called the standard normal distribution or the standardGaussian distribution.

9

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Normal distributions (same variance)

x value

Den

sity

variance=1

m=1m=2m=0

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Normal distributions (same mean)

x value

Den

sity

mean=0

sd=2sd=3sd=1

Recall:

∫ +∞

0

e−tx2

dx =

√π

2√t.

Thus,

1√2πσ2

∫ +∞

−∞x exp

−(x−m)2

2σ2

dx = m,

1√2πσ2

∫ +∞

−∞(x−m)2 exp

−(x−m)2

2σ2

dx = σ2.

Problem 2. Choose randomly a point A from the disc with radius one and letX be the radius of the inscribed circle with center A.(1) Find the probability P(X ≤ x), x ≥ 0.(2) Find the probability density function fX(x) of X.(3) Calculate the mean and variance of X.(4) Calculate the mean and variance of the area of the inscribed circle: S = πX2.

10

3. Independence and Dependence

• Independent events and conditional probability

Definition (Pairwise independence). A (finite or infinite) sequence of eventsA1, A2, .... is called pairwise independent if any pair of events Ai1 , Ai2(i1 6= i2)verifies

P(Ai1 ∩ Ai2) = P(Ai1)P(Ai2).

Definition (Independence). A (finite or infinite) sequence of events A1, A2, ....is called independent if any choice of finitely many events Ai1 , ....., Ain(i1 < i2 <· · · < in) satisfies

P(Ai1 ∩ Ai2 ∩ · · · ∩ Ain) = P(Ai1)P(Ai2) · · ·P(Ain).

Example. Drawing randomly a card from a deck of 52 cards.

Remark. It is allowed to consider whether the sequence of events A,A is inde-pendent or not. If they are independent, by definition we have P(A) = P(A)P(A),from which P(A) = 0 or P(A) = 1 follows. Notice that P(A) = 0 does not implyA = ∅ (empty event). Similarly, P(A) = 1 does not imply A = Ω (whole event).

Definition (Conditional probability). For two events A,B, the conditionalprobability of A relative to B (or on the hypothesis B, or for given B) is definedby

P(A|B) =P(A ∩B)

P(B), whenever P(B) > 0.

Theorem. Let A,B be events with P(A) > 0 and P(B) > 0. A and B areindependent iff

P(A|B) = P(A), and P(B|A) = P(B).

• Independent random variables

Definition. A (finite or infinite) sequence of random variables X1, X2, .... is inde-pendent (resp. pairwise independent) if so is the sequence of events X1 ≤ a1, X2 ≤a2, ...... for any a1, a2, .... ∈ R.

In other words, a (finite or infinite) sequence of random variables X1, X2, .... isindependent if for any finite Xi1 , ...., Xin (i1 < i2 < .... < in) and constant numbersa1, ...., an, their joint probability has the following property.

P(Xi1 ≤ a1, Xi2 ≤ a2, ..., Xin ≤ an) = P(Xi1 ≤ a1)P(Xi2 ≤ a2) · · ·P(Xin ≤ ain). (0.1)

11

Similar assertion holds for the pairwise independence. If random variablesX1, X2, ....are discrete, (0.1) may be replaced with

P(Xi1 = a1, Xi2 = a2, ..., Xin = an) = P(Xi1 = a1)P(Xi2 = a2) · · ·P(Xin = ain).

Example. Choose at random a point from the rectangle.

Problem 3.– (1) A box contains four balls with numbers 112, 121, 211, 222. We draw a ball

at random and let X1 be the first digit, X2 the second digit, and X3 the lastdigit. For i = 1, 2, 3 we define an event Ai by Ai = Xi = 1. Show thatA1, A2, A3 is pairwise independent but is not independent.

– (2) Two dice are tossed. Let A be the event that the first die gives a 4, B bethe event that the sum is 6, and C be the event that the sum is 7. CalculateP(B|A) and P(C|A), and study the independence among A,B,C.

Example.(Bernoulli trials) This is a model of coin-toss and is the most funda-mental stochastic process. A sequence of random variables (or a discrete-timestochastic process) X1, X2, ....., Xn, .... is called the Bernoulli trials with successprobability p (0 ≤ p ≤ 1) if they are independent and have the same distributionas

P(Xn = 1) = p, P(Xn = 0) = 1− p.By definition of independence, we have

P(X1 = a1, X2 = a2, ..., Xn = an) = Πnk=1P(Xk = ak),

for all a1, ...., an ∈ 0, 1.In general, statistical quantity in the LHS is called the finite dimensional distri-

bution of the stochastic process Xn. The total set of finite dimensional distri-butions characterizes a stochastic process.

• Covariance and correlation coefficients

Recall that the mean of a real-valued (1-dim) random variable X is defined by

m = E(X) =

∫ ∞−∞

xµX(dx).

If X = (X1, ..., Xn) ∈ Rn, for a measurable function ϕ : Rn → R,

E(ϕ(X)) =

∫Rnϕ(x)µX(dx), dx = dx1dx2....dxn.

Theorem. For two random variables X, Y and two constant numbers a, b it holdsthat

E(aX + bY ) = aE(X) + bE(Y ).

12

Theorem. If random variables X1, X2, ...., Xn are independent, we have

E(X1X2 · · ·Xn) = E(X1)E(X2) · · ·E(Xn).

Remark. E(XY ) = E(X)E(Y ) is not a sufficient condition for the random vari-ables X and Y being independent.

Definition (Covariance). The covariance of two random variables X, Y is de-fined by

Cov(X, Y ) = σXY = E[(X − E(X))(Y − E(Y ))] = E(XY )− E(X)E(Y ).

In particular, σXX = σ2X becomes the variance of X. The correlation coefficient

of two random variables X, Y is defined by

ρXY =σXYσXσY

whenever σX > 0 and σY > 0.

Definition. X, Y are called uncorrelated if σXY = 0. They are called positively(resp. negatively) correlated if σXY > 0 (resp. σXY < 0).

Theorem. If two random variables X, Y are independent, they are uncorrelated.

Remark. The converse of this Theorem is not true in general. (see Problem 5.)

Theorem. Let X1, X2, ...., Xn be independent random variables. Then

V

[n∑k=1

Xk

]=

n∑k=1

V(Xk).

Theorem. |ρXY | ≤ 1 for two random variables X, Y with σX > 0, σY > 0.

Problem 4. Throw two dice and let L be the larger spot and S the smaller. (Ifdouble spots, set L = S.)

– (1) Show the joint probability of (L, S) by a table.– (2) Calculate the correlation coefficient ρLS and explain the meaning of the

signature of ρLS.Problem 5. Let X and Y be random variables such that

P(X = a) = p1, P(X = b) = q1 = 1− p1,P(Y = c) = p2, P(Y = d) = q2 = 1− p2,

where a, b, c, d are constant numbers and 0 < p1 < 1, 0 < p2 < 1. Show thatX, Y are independent if and only if σXY = 0. Explain the significance of this case.[Hint: In general, uncorrelated random variables are not necessarily independent.]

13

4. Limit Theorems

Let Xk be a Bernoulli trial with success probability 1/2, and consider the binomialprocess defined by

Sn =n∑k=1

Xk.

Since Sn counts the number of heads during the first n trials,

Snn

=1

n

n∑k=1

Xk

gives the relative frequency of heads during the first n trials. The following figure is 40

samples randomly chosen showing that the relative frequency of headsSnn

tends to 1/2.

It is our question how to describe this phenomenon mathematically. A naive formula

limn→∞

Snn

=1

2

is not acceptable.

0 200 400 600 800 1000n

0.0

0.2

0.4

0.6

0.8

1.0

Theorem (Weak law of large numbers). Let X1, X2, .... be identically distributedrandom variables with meanm and variance σ2. (This means thatXi has a finite variance.)If X1, X2, .... are uncorrelated, for any ε > 0 we have

limn→∞

P

(∣∣∣∣∣ 1nn∑k=1

Xk −m

∣∣∣∣∣ ≥ ε

)= 0.

We say that1

n

n∑k=1

Xk converges to m in probability.

14

Remark. In many literatures the weak law of large numbers is stated under the assump-tion that X1, X2, .... are independent. It is noticeable that the same result holds underthe weaker assumption of being uncorrelated.

Theorem (Chebyshev inequality). Let X be a random variable with mean m andvariance σ2. Then, for any ε > 0 we have

P(|X −m| ≥ ε) ≤ σ2

ε2.

Theorem (Strong law of large numbers). Let X1, X2, ...... be identically distributedrandom variables with mean m. (This means that Xi has a mean but is not assumed tohave a finite variance.) If X1, X2, ... are pairwise independent, we have

P(

limn→∞

1

n

n∑k=1

Xk = m)

= 1.

In other words,

limn→∞

1

n

n∑k=1

Xk = m, a.s.

Remark. Kolmogorov proved the strong law of large numbers under the assumptionthat X1, X2, .... are independent. In many literatures, the strong law of large numbers isstated as Kolmogorov proved. Theorem above is due to N. Etemadi (1981), where theassumption is relaxed to being pairwise independent and the proof is more elementary.

Now, consider X1, X2, ... independent identically distributed random variables whosemean m. Let a > m, and take ε = a−m in Theorem of weak law of large numbers. Then

limn→∞

P

(1

n

n∑k=1

Xk ≥ a

)= 0.

In fact, we can see that this convergence is exponential.

Theorem (Cramer). Let X1, X2, ... independent identically distributed random vari-ables. Assume that for all t ∈ R, ψ(t) := E(etX1) < +∞. Then, for any a > m = E(X1)and n = 1, 2, ...,

P

(1

n

n∑k=1

Xk ≥ a

)≤ e−I(a)n,

with I(a) = supt∈Rat− logψ(t).

15

Theorem (Central Limit Theorem). Let Z1, Z2, .... be independent identicallydistributed (iid) random variables with mean 0 and variance 1. Then, for any x ∈ R itholds that

limn→∞

P

(1√n

n∑k=1

Zk ≤ x

)=

1√2π

∫ x

−∞e−

t2

2 dt.

In short,

1√n

n∑k=1

Zk → N (0, 1), weakly as n→∞.

Remark. (The theorem of de Moivre-Laplace —special case of CLT). Let X1, X2, ..... bea Bernoulli trials with success probability p. Set

Zk =Xk −m

σ, m = E(Xk) = p, σ2 = V(Xk) = p(1− p).

Thus, Z1, Z2, .... are iid random variables with 0 and variance 1. Apply the central limittheorem for this Zk, we have

limn→∞

(1√n

n∑k=1

Zk ≤ x

)= lim

n→∞

(n∑k=1

Xk ≤ nm+ xσ√n

)=

1√2π

∫ x

−∞e−

t2

2 dt.

Setting y = nm+ xσ√n, we see that

1√2π

∫ x

−∞e−

t2

2 dt =1√

2πnσ2

∫ y

−∞e−

(θ−nm)2

2nσ2 dθ,

which implies for large nn∑k=1

Xk ∼ N (nm, nσ2) = N (np, np(1− p)).

On the other hand, we know that∑n

k=1Xk obeys B(n, p), of which the mean value andvariance are given by np and np(1− p). Consequently, for a large n we have

B(n, p) ∼ N (np, np(1− p)) :

their distribution functions are almost same for large n.

Problem 6 (Monte Carlo simulation) Let f(x) be a continuous function on the interval[0, 1] and consider the integral ∫ 1

0

f(x)dx. (0.2)

(1) Let X be a random variable obeying the uniform distribution on [0, 1]. Give ex-pressions of the mean value E(f(X)) and variance V(f(X)) of the random variablef(X).

16

(2) Let x1, x2, ..... is a sequence random numbers taken from [0, 1]. Explain that thearithmetic mean

1

n

n∑k=1

f(xk)

is a good approximation of the integral (0.2).(3) By using a computer, verify the above fact for f(x) =

√1− x2.

17

5. Markov Chains

Recall the conditional probability (see Section 3): For two events A,B, the conditionalprobability of A relative to B (or on the hypothesis B, or for given B) is defined by

P(A|B) =P(A ∩B)

P(B), whenever P(B) > 0,

i.e.P(A ∩B) = P(B)P(A|B).

Theorem. For events A1, A2, ..., An, we have

P(A1 ∩ A2 ∩ ... ∩ An) = P(A1)P(A2|A1)P(A3|A1 ∩ A2) · · ·P(An|A1 ∩ A2 ∩ .... ∩ An−1).

Remark. Tree diagram in computation of probability.

Markov chains

Let S be a finite or countable set. Consider a discrete time stochastic process Xn :n = 0, 1, 2... taking values in S. This S is called a state space and is not necessarily asubset of R in general. In the following we often meet the cases of

S = 0, 1, S = 1, 2, ..., N, S = 0, 1, 2, ...

Definition (Markov Chains). Let Xn : n = 0, 1, 2... be a discrete time stochasticprocess over S. It is called a Markov chain over S if

P(Xm = j|Xn1 = i1, Xn2 = i2, · · · , Xnk = ik, Xn = i) = P(Xm = j|Xn = i) (0.3)

holds for any 0 ≤ n1 < n2 < · · · < nk < n < m and i1, i2, ....., ik, i, j ∈ S.

Remark. The property (0.3) is called Markov property. Markov property is weaker thanthe independence.

Theorem (multiplication rule). Let Xn be a Markov chain over S . Then, for any0 ≤ n1 < n2 < · · · < nk and i1, i2, · · · ik ∈ S, we have

P(Xn1 = i1, Xn2 = i2, · · · , Xnk = ik) =

P(Xn1 = i1)P(Xn2 = i2|Xn1 = i1)P(Xn3 = i3|Xn2 = i2) · · ·P(Xnk = ik|Xnk−1= ik−1).

Definition (Transition probability). For a Markov chain Xn over S,

P(Xn+1 = j|Xn = i)

18

is called the transition probability at time n from a state i to j. If this is independent ofn, the Markov chain is called time homogeneous.

Hereafter a Markov chain is always assumed to be time homogeneous. In this case thetransition probability is denoted by

pi,j = p(i, j) := P(Xn+1 = j|Xn = i)

and P := [pi,j] is called the transition matrix.

Definition. A matrix P = [pi,j] with index set S × S is called a stochastic matrix if

pi,j ≥ 0, and∑j∈S

pi,j = 1, i ∈ S.

Theorem. The transition matrix of a Markov chain is a stochastic matrix. Conversely,given a stochastic matrix we can construct a Markov chain of which the transition matrixcoincides with the given stochastic matrix.

Example 5.1 (2-state Markov chain). A Markov chain over the state space 0, 1 isdetermined by the transition probabilities:

p(0, 1) = p, p(0, 0) = 1− p, p(1, 0) = q, p(1, 1) = 1− q.

The transition matrix is defined by [1− p pq 1− q

]The transition diagram is as follows:

0 11-p

q

p

1-q

Example 5.2 (3-state Markov chain). An animal is healthy, sick or dead, and changesits state every day. Consider a Markov chain on H,S,D described by the followingtransition diagram:

19

H S Da b

p

q

r

1

The transition matrix is defined by a b 0p r q0 0 1

where a+ b = 1 and p+ q + r = 1.

Example 5.3 (Random walk on Z1). The transition probabilities are given by

p(i, j) =

p if j = i+ 1q = 1− p if j = i− 10 otherwise.

The transition matrix is a two-sided infinite matrix given by

. . . . . . . . .

. . . . . . . . . . . .0 q 0 p 0

0 q 0 p 0. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .0 q 0 p 0

. . . . . . . . . . . .. . . . . . . . .

Example 5.4 (Random walk with absorbing barriers). Let A > 0 and B > 0. Thestate space of a random walk with absorbing barriers at −A and B is S = −A,−A +1, ..., B − 1, B. Then the transition probabilities are given as follows.

20

For −A < i < B,

p(i, j) =


For i = −A or i = B,

p(−A, j) =

1 if j = −A0 otherwise.

p(B, j) =

1 if j = B0 otherwise.

In a matrix form we have

1 0 0 · · · · · · · · · · · · · · · 0q 0 p 0 · · · · · · · · · · · · 00 q 0 p 0 · · · · · · · · · 00 0 q 0 p 0 · · · · · · 0...

... 0. . . . . . . . . . . . . . .

...

0 · · · · · · 0. . . . . . . . . . . .

...0 · · · · · · · · · . . . q 0 p 00 · · · · · · · · · · · · 0 q 0 p0 · · · · · · · · · · · · · · · 0 0 1

Example 5.5 (Random walk with reflecting barriers). Let A > 0 and B > 0. Thestate space of a random walk with absorbing barriers at −A and B is S = −A,−A +1, ....., B − 1, B. The transition probabilities are given as follows.

For −A < i < B,

p(i, j) =


For i = −A or i = B,

p(−A, j) =

1 if j = −A+ 10 otherwise.

p(B, j) =

1 if j = B − 10 otherwise.

21

Let S be a state space as before. In general, a row vector π = [...., πi, ...] indexed by Sis called a distribution on S if

πi ≥ 0,∑i∈S

πi = 1.

For a Markov chain Xn over S we set

π(n) = [....., πi(n), .....], πi(n) = P(Xn = i)

which becomes a distribution on S. We call π(n) the distribution of Xn. In particular,π(0), the distribution of X0, is called the initial distribution. We often take

π(0) = [......, 0, 1, 0, ......]

where 1 occurs at i th position. In this case the Markov chain Xn starts from the statei.

For a Markov chain Xn with a transition matrix P = [pij], the N -step transitionprobability is defined by

pN(i, j) = P(Xn+N = j|Xn = i), i.j ∈ S, N = 0, 1, 2, ..

The right-hand side is independent of n since our Markov chain is assumed to be timehomogeneous.

Theorem (Chapman-Kolmogorov equation). For 0 ≤ r ≤ n, we have

pn(i, j) =∑k∈S

pr(i, k)pn−r(k, j).

Recall P = [pij]: the transition matrix (independent of n).

P(Xm = i,Xm+1 = i1, · · · , Xm+n−1 = in−1, Xm+n = j)

= P(Xm = i)P(Xm+1 = i1|Xm = i) · · ·P(Xm+n = j|Xm+n−1 = in−1)

= P(Xm = i)p(i, i1)p(i1, i2) · · · p(in−1, j).Take the sum with respect to i1, · · · , in−1 ∈ S on the both side, and we obtain the followingimportant result.

Theorem. For m,n ≥ 0 and i, j ∈ S, we have

P(Xm+n = j|Xm = i) = pn(i, j) = (P n)ij

Theorem. We haveπ(n) = π(n− 1)P, n ≥ 1,

or equivalently,

πj(n) =∑i

πi(n− 1)pij.

22

Remark. Therefore, π(n) = π(0)P n.

Example 5.6 (2-state Markov chain). Let Xn be the Markov chain introduced in Exam-ple 5.1. The transition matrix has the eigenvalues λ1 = 1 and λ2 = 1− p− q, and λ1 6= λ2if p+ q > 0. We consider this case, i.e., the case that P has two distinct eigenvalues. Bystandard argument, we obtain

P n =1

p+ q=

[q + prn p(1− rn)q(1− rn) p+ qrn

]where we put r = 1− p− q.

Now, let π(0) = [π0(0), π1(0)] be the distriution of X0. Then the distribution of Xn isgiven by

π(n) = [P(Xn = 0),P(Xn = 1)] = [π0(0), π1(0)]P n.

Problem 7. There are two parties, say, A and B, and their supporters of a constant ratioexchange at every election. Suppose that just before an election, 25% of the supportersof A change to support B and 20% of the supporters of B change to support A. At thebeginning, 85% of the voters support A and 15% support B.

(1) When will the party B command a majority?(2) Find the final ratio of supporters after many elections if the same situation con-

tinues.

Problem 8. Study the n-step transition probability of the three-state Markov chain in-troduced in Example 5.2. Explain that every animal dies within finite time if b > 0 andq > 0.

Problem 9. Let Xn be a Markov chain on 0, 1 given by the transition matrix

P =

[1− p pq 1− q

]with the initial distribution π(0) = [ q

p+q, pp+q

]. Calculate the following statistical quanti-ties:

E(Xn), V(Xn), Cov(Xm+n, Xn).

23

6. Stationary distributions

Definition. Let Xn be a Markov chain over S with transition probability matrix P .A distribution π on S is called stationary(or invariant) if

π = πP (0.4)

or equivalently if

πj =∑i∈S

πipij, j ∈ S. (0.5)

Thus, in order to find a stationary distribution of a Markov chain, we need to solve thelinear system 0.4 (or equivalently (0.5)) together with the conditions:∑

i∈S

πi = 1, and π ≥ 0 for all i ∈ S

Examples. 2-state Markov chain, Random walk on Z1.

Theorem. A Markov chain over a finite state space S has a stationary distribution.

(For the proof see the textbooks.)

Remark. Note that the stationary distribution mentioned in the above theorem is notnecessarily unique.

Definition. We say that a state j can be reached from a state i if there exists some n ≥ 0such that pn(i, j) > 0. By definition every state i can be reached from itself. We say thattwo states i and j intercommunicate if i can be reached form j and j can be reached fromi, i.e., there exist m ≥ 0 and n ≥ 0 such that pn(i, j) > 0 and pm(j, i) > 0. For i, j ∈ Swe introduce a binary relation i ∼ j when they intercommunicate. Then ∼ becomes anequivalence relation on S:

(i) i ∼ i(ii) i ∼ j =⇒ j ∼ i

(iii) i ∼ j, j ∼ k =⇒ i ∼ k.

In fact, (i) and (ii) are obvious by definition, and (iii) is verified by the Chapman-Kolmogorov equation. Thereby the state space S is classified into a disjoint set of equiv-alence classes. In each equivalence class any two states intercommunicate each other.

Definition. A Markov chain is called irreducible if every state can be reached from everyother state, i.e., if there is only one equivalence class of intercommunicating states.

Theorem. An irreducible Markov chain on a finite state space S admits a unique sta-tionary distribution π = [πi]. Moreover, πi > 0 for all i ∈ S.

24

Now we recall the example of 2-state Markov chain. If p+ q > 0, the distribution of theabove Markov chain converges to the unique stationary distribution. Consider the caseof p = q = 1, i.e., the transition matrix becomes

P =

[0 11 0

]The stationary distribution is unique. But for a given initial distribution π(0) it is notnecessarily true that π(n) converges to the stationary distribution.

Roughly speaking, we need to avoid the periodic transition in order to have the con-vergence to a stationary distribution.

Definition. For a state i ∈ S,

GCDn ≥ 1; P(Xn = i| X0 = i) > 0

is called the period of i. (When the set in the right-hand side is empty, the period is notdefined.) A state i ∈ S is called aperiodic if its period is one.

Theorem. For an irreducible Markov chain, every state has a common period.

Theorem. Let π be a stationary distribution of an irreducible Markov chain on a finitestate space (It is unique). If Xn is aperiodic, for any j ∈ S we have

limn→∞

P(Xn = j) = πj.

Problem 10. Find all stationary distributions of the Markov chain determined by thetransition diagram below. Then discuss convergence of distributions.

1 2

2/3

1/2

1/3

54

3/4

1/3

2/3

3

1/2 1/4

1/2 1/2

Problem 11. Let Xn be the Markov chain introduced in Example 5.2.

(1) Find that if q > 0 and b > 0, a stationary distribution is unique and given byπ = [0, 0, 1].

Next, for n = 1, 2, ... let Hn denote the probability of starting from H and terminatingat D at n-step. Similarly, for n = 1, 2, ... let Sn denote the probability of starting from Sand terminating at D at n-step.

(2) Show that Hn and Sn satisfies the following linear system:Hn = aHn−1 + bSn−1,Sn = pHn−1 + rSn−1,

25

where n ≥ 2, H1 = 0, S1 = q.(3) Let H and S denote the life times starting from the state H and S, respectively.

Solving the linear system in (1), prove the following identities for the mean lifetimes:

E[H] =b+ p+ q

bq, E[S] =

b+ p

bq.

Example (Page Rank). The hyperlinks among N websites give rise to a digraph(directed graph) G on N vertices. It is natural to consider a Markov chain on G, whichis defined by the transition matrix P = [pi,j], where

pi,j =

1

deg(i), if i→ j and deg(i) 6= 0,

0, if i 6→ j and i 6= j,1 if i→ j and j = i and deg(i) = 0.

where deg(i) = |j; i→ j| is the out-degree of i.

There exists a stationary state but not necessarily unique. Taking 0 ≤ d ≤ 1 we modifythe transition matrix:

Q = [qi,j],

qi,j = dpi,j + ε, ε =1− dN

.

If 0 ≤ d < 1, the Markov chain determined by Q has necessarily a unique stationarydistribution. Choosing a suitable d < 1, we may understand the stationary distributionπ = [πi] as the page rank among the websites. In real application d should not be closeto 0 and d ≈ 0.85 is often taken.

26

7. Recurrence

Definition. Let S be a state space and i ∈ S. The first hitting time or first passage timeto i is defined by

Ti = infn ≥ 1;Xn = i.If n ≥ 1;Xn = i is an empty set, we define Ti = +∞. A state i ∈ S is called recurrentif P(Ti < +∞|X0 = i) = 1. It is called transient if P(Ti = +∞|X0 = i) > 0.

Theorem. A state i ∈ S is recurrent if and only if∞∑n=0

pn(i, i) =∞.

If a state i ∈ S is transient, we have∞∑n=0

pn(i, i) <∞,

and moreover,∞∑n=0

pn(i, i) =1

1− P(Ti <∞|X0 = i).

Examples. Random walk on Z1, Z2 and Z3. The following notation will be used:Let an and bn be sequences of positive numbers. We write an ∼ bn if limn→∞ an/bn =1. In this case, there exist two constant numbers c1 > 0 and c2 > 0 such that c1an ≤ bn ≤c2an. Hence

∑n an and

∑n bn converge or diverge at the same time.

Definition. If a state i ∈ S is recurrent, i.e., P(Ti <∞|X0 = i) = 1, the mean recurrenttime is defined by

E(Ti|X0 = i) =∞∑n=1

nP(Ti = n|X0 = i).

The state i is called positive recurrent if E(Ti|X0 = i) <∞, and null recurrent otherwise.

Theorem. The states in an equivalence class are all positive recurrent, or all null recur-rent, or all transient. In particular, for an irreducible Markov chain, the states are allpositive recurrent, or all null recurrent, or all transient.

Theorem. For an irreducible Markov chain on a finite state space S, every state ispositive recurrent.

Example. The mean recurrent time of the one-dimensional isotropic random walk isinfinity, i.e., the one-dimensional isotropic random walk is null recurrent.

27

Problem 12. Let Xn be a Markov chain described by the following transition diagram:

0 11-p

q

p

1-q

where p > 0 and q > 0. For a state i ∈ S let Ti = infn ≥ 1;Xn = i be the first hittingtime to i.

(1) Calculate

P(T0 = 1|X0 = 0), P(T0 = 2|X0 = 0), P(T0 = 3|X0 = 0), P(T0 = 4|X0 = 0).

(2) Find P(T0 = n|X0 = 0) and calculate∞∑n=1

P(T0 = n|X0 = 0),∞∑n=1

nP(T0 = n|X0 = 0).

28

8. Bienayme-Galton-Watson Branching Process

Consider a simplified family tree where each individual gives birth to offspring (children)and dies. The number of offsprings is random. We are interested in whether the familysurvives or not.

Let Xn be the number of individuals of the nth generation. Then Xn : n = 0, 1, 2, ...becomes a discrete- time stochasic process. We assume that the number of childrenborn from each individual obeys a common probability distribution and is independentof individuals and of generation. Under this assupmtion Xn becomes a Markov chain.

Let us find the transition probability. Let Y be the number of children born from anindividual and set

P(Y = k) = pk, k = 0, 1, 2, ....

The sequence p0, p1, p2, ... describes the distribution of the number of children bornfrom an individual. In fact, what we need is the condition

pk ≥ 0,∞∑k=0

pk = 1.

We refer to p0, p1, .... as the offspring distribution. Let Y1, Y2, ... be independentidentically distributed random variables, of which the distribution is the same as Y .Then, we define the transition probability by

p(i, j) = P(Xn+1 = j| Xn = i) = P( i∑k=1

Yk = j), i ≥ 1, j ≥ 0

and

p(0, j) =

0, j ≥ 11, j = 0.

The above Markov chain Xn over the state space 0, 1, 2, .... is called the Bienayme-Galton-Watson branching process with offspring distribution pk : k = 0, 1, 2, .... Forsimplicity we assume that X0 = 1. When p0 + p1 = 1, the famility tree is reduced to justa path without branching so the situation is much simpler (Problem 13).

Let Xn be the Bienayme-Galton-Watson branching process with offspring distributionpk : k = 0, 1, 2, ..... Let p(i, j) = P(Xn+1 = j|Xn = i) be the transition probability. Weassume that X0 = 1. Define the generating function of the offspring distribution by

f(s) =∞∑k=0

pksk.

29

The series in the right-hand side converges for |s| ≤ 1. We set

f0(s) = s, f1(s) = f(s), fn(s) = f(fn−1(s)).

Lemma.∞∑j=0

p(i, j)sj = [f(s)]i, i = 1, 2, ...

Lemma. Let pn(i, j) be the n-step transition probability of the Bienayme-Galton-Watsonbranching process. Then, we have

∞∑j=0

pn(i, j)sj = [fn(s)]i, i = 1, 2, ...

Theorem. Assume that the mean value of the offspring distribution is finite:

m =∞∑k=0

kpk <∞.

Then, we haveE[Xn] = mn.

In conclusion, the mean value of the number of individuals in the n-th generation,E(Xn), decreases and converges to 0 if m < 1 and diverges to the infinity if m > 1, asn → ∞. It stays at a constant if m = 1. We are thus suggested that extinction of thefamily occurs when m < 1.

The event Xn = 0 occurs for some n ≥ 1 means that the family died out before orat the n-th generation. Thus,

q = P( ∞⋃n=1

Xn = 0)

= limn→∞

P(Xn = 0)

is the probability of extinction of the family. If q = 1, this family almost surely dies out insome generation. If q < 1, the survival probability is positive 1−q > 0. We are interestedin whether q = 1 or not.

Lemma. Let f(s) be the generating function of the offspring distribution. Then we have

q = limn→∞

fn(0).

Therefore, q satisfies the equation:q = f(q)

30

Assume that the offspring distribution satisfies the conditions: p0 + p1 < 1.

Lemma. The generating function f(s) satisfies the following properties.

(1) f(s) is increasing, i.e., f(s1) ≤ f(s2) for 0 ≤ s1 ≤ s2 ≤ 1.(2) f(s) is strictly convex, i.e., if 0 ≤ s1 < s2 ≤ 1 and 0 < θ < 1 we have

f(θs1 + (1− θ)s2) < θf(s1) + (1− θ)f(s2).

Lemma.

(1) If m ≤ 1, we have f(s) > s for 0 ≤ s < 1.(2) If m > 1, there exists a unique s such that 0 ≤ s < 1 and f(s) = s.

Theorem. The extinction probability q of the Bienayme-Galton-Watson branching pro-cess as above coincides with the smallest s such that s = f(s), 0 ≤ s ≤ 1. Moreover, ifm ≤ 1 we have q = 1, and if m > 1 we have q < 1.

The Bienayme-Galton-Watson branching process is called subcritical, critical and su-percritical if m < 1, m = 1 and m > 1, respectively. The survival is determined onlyby the mean value m of the offspring distribution. The situation changes dramatically atm = 1 and, following the terminology of statistical physics, we call it phase transition.

Problem 13 (One-child policy). Consider the Bienayme-Galton-Watson branching processwith offspring distribution satisfying p0 + p1 = 1. Calculate the probabilities

q1 = P(X1 = 0), q2 = P(X1 6= 0, X2 = 0), ....

....., qn = P(X1 6= 0, ....., Xn−1 6= 0, Xn = 0), ......

and find the extinction probability

P(Xn = 0 occurs for some n ≥ 1).

Problem 14. Let b, p be constant numbers such that b > 0, 0 < p < 1 and b + p < 1.Suppose that the offspring distribution given by

pk = bpk−1, k = 1, 2, ..., p0 = 1−∞∑k=1

pk.

(1) Find the generating function f(s) of the offspring distribution.(2) Set m = 1 and find fn(s).

probability model / applied analysis 2019 1 2fukuizumi/2019-gsis-probmodel.pdf · probability model...

Documents