probability model / applied analysis 2019 1 2fukuizumi/2019-gsis-probmodel.pdf · probability model...
TRANSCRIPT
Probability Model / Applied Analysis 2019 1
by
Reika Fukuizumi 2
Probability models are essential in mathematical analysis of random phenomena. Inthese lectures, we focus on Markov chains as basic models of random time evolution.Starting with fundamental concepts in probability theory (random variables, probabil-ity distributions, etc.), we study fundamentals on Markov chains (transition probability,recurrence, stationary distributions, etc.). Background knowledge on elementary proba-bility is required.
The lectures will follow the book Essence of Probability models by Nobuaki Obata,(published from Makino shoten in 2012, in Japanese). Other references including Eng-lish/French texts are for example:
1. W. Feller, “An Introduction to Probability Theory and Its Applications,” Vol.1Wiley, 1957.
2. R. Durrett, “Probability: Theory and Examples” fourth edition, Cambridge Uni-versity Press 2010.
3. T. Bodineau, “PROMENADE ALEATOIRE : Chaınes de Markov et martingales,”MAP432, Ecole Polytechnique 2013.
To obtain this course’s credit, you are required to choose by yourself fiveproblems among the problems given during the lectures, and to submit an-swers to the five problems as a report.
The report should be handed in to the report box aside the educationalaffairs office of GSIS.Deadline: January 27, 2020.
No exception will be made.
The 1-4 sections of this course consist of a review of what we have already learned in theundergraduate program (at least at Tohoku University we hold a course “MathematicalStatics.”)
1updated: Dec.20 20192Email: [email protected]
1
2
1. Probability Spaces and Random Variables
Ω: sample space consisting of elementary events (or sample points).F : the set of events.
Definition (Probability). A function P : A ∈ F 7→ P(A) ∈ [0, 1] is said to be theprobability on F (or with domain F) if the following (P1)-(P3) are satisfied:
(P1) 0 ≤ P(A) ≤ 1 for any A ∈ F .(P2) P(∅) = 0 and P(Ω) = 1.(P3) For A1, A2, .... ∈ F (infinite sequence) with Aj ∩ Ak = ∅ when j 6= k,
P
(∞⋃n=1
An
)=∞∑n=1
P(An).
Examples: Coin toss, Dise throwing, random cut.
Definition (Probability Space). Let Ω be a non-empty set, and P be a probability onF . We call (Ω,F ,P) a probability space.
Theorem 1.1 Let A1, A2, ... be a sequence of events.
(1) If A1 ⊂ A2 ⊂ A3 ⊂ · · · , then
P
(∞⋃n=1
An
)= lim
n→∞An.
(2) If A1 ⊃ A2 ⊃ A3 ⊃ · · · , then
P
(∞⋂n=1
An
)= lim
n→∞An
.
Definition (Discrete random variables). A random variable X is called discrete ifthe number of values that X takes is finite or countably infinite. To be more precise, fora discrete random variable X there exist a (finite or infinite) sequence of real numbers a1,a2,... and corresponding nonnegative numbers p1, p2, ..... such that
P(X = ai) = pi, pi ≥ 0,∑i
pi = 1.
In this case,
µX :=∑
piδai
is called the (probability) distribution of X. Here, for a Borel set B ⊂ R,
δa(B) =
1, a ∈ B0, otherwise
3
is the Dirac measure at a ∈ R. Obviously,
P(a ≤ X ≤ b) =∑
i:a≤ai≤b
pi.
Examples. Coin toss, Waiting time.
Definition (Continuous random variables). A random variable X is called continu-ous if P(X = a) = 0 for all a ∈ R. If there exists a function f(x) such that
P(a ≤ X ≤ b) =
∫ b
a
f(x)dx, a < b, a, b ∈ R,
we say that X admits a probability density function f(x), and denote f(x) by fX(x).Note that ∫ +∞
−∞fX(x)dx = 1, fX(x) ≥ 0
In this case,
µX(dx) := fX(x)dx
is called the (probability) distribution of X.
It is useful to consider the distribution function
FX(x) := P(X ∈ (−∞, x]) =
∫ x
−∞fX(t)dt, x ∈ R.
Then, if FX is continuous and piecewise differentiable, we have
fX(x) =d
dxFX(x).
Remark.
(1) A continuous random variable does not necessarily admit a probability densityfunction. But many continuous random variables in practical applications admitprobability density functions.
(2) There is a random variable which is neither discrete nor continuous. But mostrandom variables in practical applications are either discrete or continuous.
Examples. random cut.
4
Definition (mean value). The mean or expectation value of a random variable X isdefined by
m = E[X] :=
∫ +∞
−∞xµX(dx)
=
∑i
aipi if X is discrete,∫ +∞
−∞xfX(x)dx if X is continuous
which admits a probability density function fX(x)
Proposition 1.1. For a (measurable) function ϕ(x) we have
E[ϕ(X)] =
∫ +∞
−∞ϕ(x)µX(dx).
For example,
• (m-th moment) E[Xm] =∫ +∞−∞ xmµX(dx).
• (characteristic function) E(eitX) =∫ +∞−∞ eitxµ(dx), t ∈ R.
Definition (variance). The variance of a random variable X is defined by
σ2 = V[X] = E[(X − E[X])2] = E[X2]− E[X]2,i.e.
σ2 = V[X] =
∫ +∞
−∞(x− E[X])2µX(dx) =
∫ +∞
−∞x2µX(dx)−
(∫ +∞
−∞xµX(dx)
)2
.
Examples. Waiting time, random cut.
5
2. Probability Distributions
We introduce some classical examples of one-dimensional distributions.
• Discrete distributions– Bernoulli distribution
For 0 ≤ p ≤ 1, the distribution (1− p)δ0 + pδ1 is called Bernoulli distributionwith success probability p.
m = p, σ2 = p(1− p)
– Binomial distribution B(n, p)For 0 ≤ p ≤ 1 and n ≥ 1, the distribution
n∑k=0
(n
k
)pk(1− p)n−kδk
is called the Binomial distribution. The quantity(nk
)pk(1− p)n−k is typically
the probability, that n coin tosses with probabilities p for heads and q = 1−pfor tails, resulting in k heads and n− k tails.
m = np, σ2 = np(1− p)
0 10 20 30 40
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Binomial distributions (size n fixed)
x value
Den
sity
size 40 fixed
proba=0.3proba=0.5proba=0.7
0 10 20 30 40
0.00
0.05
0.10
0.15
Binomial distributions (proba fixed)
x value
Den
sity
success proba 0.5 fixed
size n=40size n=20
success proba 0.5 fixed
size n=40size n=20
success proba 0.5 fixed
size n=40size n=20
– Geometric distributionFor 0 ≤ p ≤ 1, the distribution
+∞∑k=1
p(1− p)k−1δk
is called the geometric distribution with success probability p.
m =1
p, σ2 =
1− pp2
6
by the computation of the probability generating function,
G(z) =∞∑k=1
p(1− p)k−1zk, G′(z) =p
1− (1− p)z2, G′′(z) =
2p(1− p)1− (1− p)z3
.
m = G′(1) = 1/p, σ2 = G′′(1) +G′(1)− G′(1)2 = (1− p)/p2.
0 10 20 30 40
0.0
0.1
0.2
0.3
0.4
0.5
Geometric distributions
x value
Den
sity
Distributions
proba=0.1proba=0.25proba=0.75proba=0.5
Remark. In some literatures, the geometric distribution with parameter p isdefined by
+∞∑k=0
p(1− p)kδk.
In this case, the mean is (1− p)/p and the variance is (1− p)/p2.
– Poisson distributionFor λ > 0 the distribution
+∞∑k=0
e−λλk
k!δk
is called the Poisson distribution with parameter λ.
m = λ, σ2 = λ
7
0 10 20 30 40
0.0
0.1
0.2
0.3
Poisson distributions
x value
Den
sity
Distributions
parameter=4parameter=10parameter=1
Problem 1. Find the mean value and variance of the discrete distributions intro-duced above.
• Continuous distributions– Uniform distribution
For a finite interval [a, b], the function
f(x) =
1
b− a, a ≤ x ≤ b
0, otherwise.
becomes a density function, which determines the uniform distribution on[a, b].
m =
∫ b
a
xdx
b− a=b+ a
2, σ2 =
∫ b
a
x2dx
b− a−m2 =
(b− a)2
12.
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
Uniform distributions
x value
Den
sity
Distributions
[−1,1][1,4][0,1]
8
– Exponential distributionThe exponential distribution with parameter λ > 0 is defined by the densityfunction
f(x) =
λe−λx, x ≥ 0,0, otherwise.
The mean value and variance are
m =1
λ, σ2 =
1
λ2.
0 2 4 6 8
0.0
0.2
0.4
0.6
0.8
1.0
Exponential distributions
x value
Den
sity
Distributions
rate=5rate=10rate=1
– Normal distribution N (m,σ2)For m ∈ R and σ > 0, we see that
f(x) =1√
2πσ2exp
−(x−m)2
2σ2
becomes a density function. The distribution defined by this density functionis called the normal distribution or Gaussian distribution. When m = 0 andσ2 = 1, i.e. N (0, 1) is called the standard normal distribution or the standardGaussian distribution.
9
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
Normal distributions (same variance)
x value
Den
sity
variance=1
m=1m=2m=0
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
Normal distributions (same mean)
x value
Den
sity
mean=0
sd=2sd=3sd=1
Recall:
∫ +∞
0
e−tx2
dx =
√π
2√t.
Thus,
1√2πσ2
∫ +∞
−∞x exp
−(x−m)2
2σ2
dx = m,
1√2πσ2
∫ +∞
−∞(x−m)2 exp
−(x−m)2
2σ2
dx = σ2.
Problem 2. Choose randomly a point A from the disc with radius one and letX be the radius of the inscribed circle with center A.(1) Find the probability P(X ≤ x), x ≥ 0.(2) Find the probability density function fX(x) of X.(3) Calculate the mean and variance of X.(4) Calculate the mean and variance of the area of the inscribed circle: S = πX2.
10
3. Independence and Dependence
• Independent events and conditional probability
Definition (Pairwise independence). A (finite or infinite) sequence of eventsA1, A2, .... is called pairwise independent if any pair of events Ai1 , Ai2(i1 6= i2)verifies
P(Ai1 ∩ Ai2) = P(Ai1)P(Ai2).
Definition (Independence). A (finite or infinite) sequence of events A1, A2, ....is called independent if any choice of finitely many events Ai1 , ....., Ain(i1 < i2 <· · · < in) satisfies
P(Ai1 ∩ Ai2 ∩ · · · ∩ Ain) = P(Ai1)P(Ai2) · · ·P(Ain).
Example. Drawing randomly a card from a deck of 52 cards.
Remark. It is allowed to consider whether the sequence of events A,A is inde-pendent or not. If they are independent, by definition we have P(A) = P(A)P(A),from which P(A) = 0 or P(A) = 1 follows. Notice that P(A) = 0 does not implyA = ∅ (empty event). Similarly, P(A) = 1 does not imply A = Ω (whole event).
Definition (Conditional probability). For two events A,B, the conditionalprobability of A relative to B (or on the hypothesis B, or for given B) is definedby
P(A|B) =P(A ∩B)
P(B), whenever P(B) > 0.
Theorem. Let A,B be events with P(A) > 0 and P(B) > 0. A and B areindependent iff
P(A|B) = P(A), and P(B|A) = P(B).
• Independent random variables
Definition. A (finite or infinite) sequence of random variables X1, X2, .... is inde-pendent (resp. pairwise independent) if so is the sequence of events X1 ≤ a1, X2 ≤a2, ...... for any a1, a2, .... ∈ R.
In other words, a (finite or infinite) sequence of random variables X1, X2, .... isindependent if for any finite Xi1 , ...., Xin (i1 < i2 < .... < in) and constant numbersa1, ...., an, their joint probability has the following property.
P(Xi1 ≤ a1, Xi2 ≤ a2, ..., Xin ≤ an) = P(Xi1 ≤ a1)P(Xi2 ≤ a2) · · ·P(Xin ≤ ain). (0.1)
11
Similar assertion holds for the pairwise independence. If random variablesX1, X2, ....are discrete, (0.1) may be replaced with
P(Xi1 = a1, Xi2 = a2, ..., Xin = an) = P(Xi1 = a1)P(Xi2 = a2) · · ·P(Xin = ain).
Example. Choose at random a point from the rectangle.
Problem 3.– (1) A box contains four balls with numbers 112, 121, 211, 222. We draw a ball
at random and let X1 be the first digit, X2 the second digit, and X3 the lastdigit. For i = 1, 2, 3 we define an event Ai by Ai = Xi = 1. Show thatA1, A2, A3 is pairwise independent but is not independent.
– (2) Two dice are tossed. Let A be the event that the first die gives a 4, B bethe event that the sum is 6, and C be the event that the sum is 7. CalculateP(B|A) and P(C|A), and study the independence among A,B,C.
Example.(Bernoulli trials) This is a model of coin-toss and is the most funda-mental stochastic process. A sequence of random variables (or a discrete-timestochastic process) X1, X2, ....., Xn, .... is called the Bernoulli trials with successprobability p (0 ≤ p ≤ 1) if they are independent and have the same distributionas
P(Xn = 1) = p, P(Xn = 0) = 1− p.By definition of independence, we have
P(X1 = a1, X2 = a2, ..., Xn = an) = Πnk=1P(Xk = ak),
for all a1, ...., an ∈ 0, 1.In general, statistical quantity in the LHS is called the finite dimensional distri-
bution of the stochastic process Xn. The total set of finite dimensional distri-butions characterizes a stochastic process.
• Covariance and correlation coefficients
Recall that the mean of a real-valued (1-dim) random variable X is defined by
m = E(X) =
∫ ∞−∞
xµX(dx).
If X = (X1, ..., Xn) ∈ Rn, for a measurable function ϕ : Rn → R,
E(ϕ(X)) =
∫Rnϕ(x)µX(dx), dx = dx1dx2....dxn.
Theorem. For two random variables X, Y and two constant numbers a, b it holdsthat
E(aX + bY ) = aE(X) + bE(Y ).
12
Theorem. If random variables X1, X2, ...., Xn are independent, we have
E(X1X2 · · ·Xn) = E(X1)E(X2) · · ·E(Xn).
Remark. E(XY ) = E(X)E(Y ) is not a sufficient condition for the random vari-ables X and Y being independent.
Definition (Covariance). The covariance of two random variables X, Y is de-fined by
Cov(X, Y ) = σXY = E[(X − E(X))(Y − E(Y ))] = E(XY )− E(X)E(Y ).
In particular, σXX = σ2X becomes the variance of X. The correlation coefficient
of two random variables X, Y is defined by
ρXY =σXYσXσY
whenever σX > 0 and σY > 0.
Definition. X, Y are called uncorrelated if σXY = 0. They are called positively(resp. negatively) correlated if σXY > 0 (resp. σXY < 0).
Theorem. If two random variables X, Y are independent, they are uncorrelated.
Remark. The converse of this Theorem is not true in general. (see Problem 5.)
Theorem. Let X1, X2, ...., Xn be independent random variables. Then
V
[n∑k=1
Xk
]=
n∑k=1
V(Xk).
Theorem. |ρXY | ≤ 1 for two random variables X, Y with σX > 0, σY > 0.
Problem 4. Throw two dice and let L be the larger spot and S the smaller. (Ifdouble spots, set L = S.)
– (1) Show the joint probability of (L, S) by a table.– (2) Calculate the correlation coefficient ρLS and explain the meaning of the
signature of ρLS.Problem 5. Let X and Y be random variables such that
P(X = a) = p1, P(X = b) = q1 = 1− p1,P(Y = c) = p2, P(Y = d) = q2 = 1− p2,
where a, b, c, d are constant numbers and 0 < p1 < 1, 0 < p2 < 1. Show thatX, Y are independent if and only if σXY = 0. Explain the significance of this case.[Hint: In general, uncorrelated random variables are not necessarily independent.]
13
4. Limit Theorems
Let Xk be a Bernoulli trial with success probability 1/2, and consider the binomialprocess defined by
Sn =n∑k=1
Xk.
Since Sn counts the number of heads during the first n trials,
Snn
=1
n
n∑k=1
Xk
gives the relative frequency of heads during the first n trials. The following figure is 40
samples randomly chosen showing that the relative frequency of headsSnn
tends to 1/2.
It is our question how to describe this phenomenon mathematically. A naive formula
limn→∞
Snn
=1
2
is not acceptable.
0 200 400 600 800 1000n
0.0
0.2
0.4
0.6
0.8
1.0
Theorem (Weak law of large numbers). Let X1, X2, .... be identically distributedrandom variables with meanm and variance σ2. (This means thatXi has a finite variance.)If X1, X2, .... are uncorrelated, for any ε > 0 we have
limn→∞
P
(∣∣∣∣∣ 1nn∑k=1
Xk −m
∣∣∣∣∣ ≥ ε
)= 0.
We say that1
n
n∑k=1
Xk converges to m in probability.
14
Remark. In many literatures the weak law of large numbers is stated under the assump-tion that X1, X2, .... are independent. It is noticeable that the same result holds underthe weaker assumption of being uncorrelated.
Theorem (Chebyshev inequality). Let X be a random variable with mean m andvariance σ2. Then, for any ε > 0 we have
P(|X −m| ≥ ε) ≤ σ2
ε2.
Theorem (Strong law of large numbers). Let X1, X2, ...... be identically distributedrandom variables with mean m. (This means that Xi has a mean but is not assumed tohave a finite variance.) If X1, X2, ... are pairwise independent, we have
P(
limn→∞
1
n
n∑k=1
Xk = m)
= 1.
In other words,
limn→∞
1
n
n∑k=1
Xk = m, a.s.
Remark. Kolmogorov proved the strong law of large numbers under the assumptionthat X1, X2, .... are independent. In many literatures, the strong law of large numbers isstated as Kolmogorov proved. Theorem above is due to N. Etemadi (1981), where theassumption is relaxed to being pairwise independent and the proof is more elementary.
Now, consider X1, X2, ... independent identically distributed random variables whosemean m. Let a > m, and take ε = a−m in Theorem of weak law of large numbers. Then
limn→∞
P
(1
n
n∑k=1
Xk ≥ a
)= 0.
In fact, we can see that this convergence is exponential.
Theorem (Cramer). Let X1, X2, ... independent identically distributed random vari-ables. Assume that for all t ∈ R, ψ(t) := E(etX1) < +∞. Then, for any a > m = E(X1)and n = 1, 2, ...,
P
(1
n
n∑k=1
Xk ≥ a
)≤ e−I(a)n,
with I(a) = supt∈Rat− logψ(t).
15
Theorem (Central Limit Theorem). Let Z1, Z2, .... be independent identicallydistributed (iid) random variables with mean 0 and variance 1. Then, for any x ∈ R itholds that
limn→∞
P
(1√n
n∑k=1
Zk ≤ x
)=
1√2π
∫ x
−∞e−
t2
2 dt.
In short,
1√n
n∑k=1
Zk → N (0, 1), weakly as n→∞.
Remark. (The theorem of de Moivre-Laplace —special case of CLT). Let X1, X2, ..... bea Bernoulli trials with success probability p. Set
Zk =Xk −m
σ, m = E(Xk) = p, σ2 = V(Xk) = p(1− p).
Thus, Z1, Z2, .... are iid random variables with 0 and variance 1. Apply the central limittheorem for this Zk, we have
limn→∞
(1√n
n∑k=1
Zk ≤ x
)= lim
n→∞
(n∑k=1
Xk ≤ nm+ xσ√n
)=
1√2π
∫ x
−∞e−
t2
2 dt.
Setting y = nm+ xσ√n, we see that
1√2π
∫ x
−∞e−
t2
2 dt =1√
2πnσ2
∫ y
−∞e−
(θ−nm)2
2nσ2 dθ,
which implies for large nn∑k=1
Xk ∼ N (nm, nσ2) = N (np, np(1− p)).
On the other hand, we know that∑n
k=1Xk obeys B(n, p), of which the mean value andvariance are given by np and np(1− p). Consequently, for a large n we have
B(n, p) ∼ N (np, np(1− p)) :
their distribution functions are almost same for large n.
Problem 6 (Monte Carlo simulation) Let f(x) be a continuous function on the interval[0, 1] and consider the integral ∫ 1
0
f(x)dx. (0.2)
(1) Let X be a random variable obeying the uniform distribution on [0, 1]. Give ex-pressions of the mean value E(f(X)) and variance V(f(X)) of the random variablef(X).
16
(2) Let x1, x2, ..... is a sequence random numbers taken from [0, 1]. Explain that thearithmetic mean
1
n
n∑k=1
f(xk)
is a good approximation of the integral (0.2).(3) By using a computer, verify the above fact for f(x) =
√1− x2.
17
5. Markov Chains
Recall the conditional probability (see Section 3): For two events A,B, the conditionalprobability of A relative to B (or on the hypothesis B, or for given B) is defined by
P(A|B) =P(A ∩B)
P(B), whenever P(B) > 0,
i.e.P(A ∩B) = P(B)P(A|B).
Theorem. For events A1, A2, ..., An, we have
P(A1 ∩ A2 ∩ ... ∩ An) = P(A1)P(A2|A1)P(A3|A1 ∩ A2) · · ·P(An|A1 ∩ A2 ∩ .... ∩ An−1).
Remark. Tree diagram in computation of probability.
Markov chains
Let S be a finite or countable set. Consider a discrete time stochastic process Xn :n = 0, 1, 2... taking values in S. This S is called a state space and is not necessarily asubset of R in general. In the following we often meet the cases of
S = 0, 1, S = 1, 2, ..., N, S = 0, 1, 2, ...
Definition (Markov Chains). Let Xn : n = 0, 1, 2... be a discrete time stochasticprocess over S. It is called a Markov chain over S if
P(Xm = j|Xn1 = i1, Xn2 = i2, · · · , Xnk = ik, Xn = i) = P(Xm = j|Xn = i) (0.3)
holds for any 0 ≤ n1 < n2 < · · · < nk < n < m and i1, i2, ....., ik, i, j ∈ S.
Remark. The property (0.3) is called Markov property. Markov property is weaker thanthe independence.
Theorem (multiplication rule). Let Xn be a Markov chain over S . Then, for any0 ≤ n1 < n2 < · · · < nk and i1, i2, · · · ik ∈ S, we have
P(Xn1 = i1, Xn2 = i2, · · · , Xnk = ik) =
P(Xn1 = i1)P(Xn2 = i2|Xn1 = i1)P(Xn3 = i3|Xn2 = i2) · · ·P(Xnk = ik|Xnk−1= ik−1).
Definition (Transition probability). For a Markov chain Xn over S,
P(Xn+1 = j|Xn = i)
18
is called the transition probability at time n from a state i to j. If this is independent ofn, the Markov chain is called time homogeneous.
Hereafter a Markov chain is always assumed to be time homogeneous. In this case thetransition probability is denoted by
pi,j = p(i, j) := P(Xn+1 = j|Xn = i)
and P := [pi,j] is called the transition matrix.
Definition. A matrix P = [pi,j] with index set S × S is called a stochastic matrix if
pi,j ≥ 0, and∑j∈S
pi,j = 1, i ∈ S.
Theorem. The transition matrix of a Markov chain is a stochastic matrix. Conversely,given a stochastic matrix we can construct a Markov chain of which the transition matrixcoincides with the given stochastic matrix.
Example 5.1 (2-state Markov chain). A Markov chain over the state space 0, 1 isdetermined by the transition probabilities:
p(0, 1) = p, p(0, 0) = 1− p, p(1, 0) = q, p(1, 1) = 1− q.
The transition matrix is defined by [1− p pq 1− q
]The transition diagram is as follows:
0 11-p
q
p
1-q
Example 5.2 (3-state Markov chain). An animal is healthy, sick or dead, and changesits state every day. Consider a Markov chain on H,S,D described by the followingtransition diagram:
19
H S Da b
p
q
r
1
The transition matrix is defined by a b 0p r q0 0 1
where a+ b = 1 and p+ q + r = 1.
Example 5.3 (Random walk on Z1). The transition probabilities are given by
p(i, j) =
p if j = i+ 1q = 1− p if j = i− 10 otherwise.
The transition matrix is a two-sided infinite matrix given by
. . . . . . . . .
. . . . . . . . . . . .0 q 0 p 0
0 q 0 p 0. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .0 q 0 p 0
. . . . . . . . . . . .. . . . . . . . .
Example 5.4 (Random walk with absorbing barriers). Let A > 0 and B > 0. Thestate space of a random walk with absorbing barriers at −A and B is S = −A,−A +1, ..., B − 1, B. Then the transition probabilities are given as follows.
20
For −A < i < B,
p(i, j) =
p if j = i+ 1q = 1− p if j = i− 10 otherwise.
For i = −A or i = B,
p(−A, j) =
1 if j = −A0 otherwise.
p(B, j) =
1 if j = B0 otherwise.
In a matrix form we have
1 0 0 · · · · · · · · · · · · · · · 0q 0 p 0 · · · · · · · · · · · · 00 q 0 p 0 · · · · · · · · · 00 0 q 0 p 0 · · · · · · 0...
... 0. . . . . . . . . . . . . . .
...
0 · · · · · · 0. . . . . . . . . . . .
...0 · · · · · · · · · . . . q 0 p 00 · · · · · · · · · · · · 0 q 0 p0 · · · · · · · · · · · · · · · 0 0 1
Example 5.5 (Random walk with reflecting barriers). Let A > 0 and B > 0. Thestate space of a random walk with absorbing barriers at −A and B is S = −A,−A +1, ....., B − 1, B. The transition probabilities are given as follows.
For −A < i < B,
p(i, j) =
p if j = i+ 1q = 1− p if j = i− 10 otherwise.
For i = −A or i = B,
p(−A, j) =
1 if j = −A+ 10 otherwise.
p(B, j) =
1 if j = B − 10 otherwise.
21
Let S be a state space as before. In general, a row vector π = [...., πi, ...] indexed by Sis called a distribution on S if
πi ≥ 0,∑i∈S
πi = 1.
For a Markov chain Xn over S we set
π(n) = [....., πi(n), .....], πi(n) = P(Xn = i)
which becomes a distribution on S. We call π(n) the distribution of Xn. In particular,π(0), the distribution of X0, is called the initial distribution. We often take
π(0) = [......, 0, 1, 0, ......]
where 1 occurs at i th position. In this case the Markov chain Xn starts from the statei.
For a Markov chain Xn with a transition matrix P = [pij], the N -step transitionprobability is defined by
pN(i, j) = P(Xn+N = j|Xn = i), i.j ∈ S, N = 0, 1, 2, ..
The right-hand side is independent of n since our Markov chain is assumed to be timehomogeneous.
Theorem (Chapman-Kolmogorov equation). For 0 ≤ r ≤ n, we have
pn(i, j) =∑k∈S
pr(i, k)pn−r(k, j).
Recall P = [pij]: the transition matrix (independent of n).
P(Xm = i,Xm+1 = i1, · · · , Xm+n−1 = in−1, Xm+n = j)
= P(Xm = i)P(Xm+1 = i1|Xm = i) · · ·P(Xm+n = j|Xm+n−1 = in−1)
= P(Xm = i)p(i, i1)p(i1, i2) · · · p(in−1, j).Take the sum with respect to i1, · · · , in−1 ∈ S on the both side, and we obtain the followingimportant result.
Theorem. For m,n ≥ 0 and i, j ∈ S, we have
P(Xm+n = j|Xm = i) = pn(i, j) = (P n)ij
Theorem. We haveπ(n) = π(n− 1)P, n ≥ 1,
or equivalently,
πj(n) =∑i
πi(n− 1)pij.
22
Remark. Therefore, π(n) = π(0)P n.
Example 5.6 (2-state Markov chain). Let Xn be the Markov chain introduced in Exam-ple 5.1. The transition matrix has the eigenvalues λ1 = 1 and λ2 = 1− p− q, and λ1 6= λ2if p+ q > 0. We consider this case, i.e., the case that P has two distinct eigenvalues. Bystandard argument, we obtain
P n =1
p+ q=
[q + prn p(1− rn)q(1− rn) p+ qrn
]where we put r = 1− p− q.
Now, let π(0) = [π0(0), π1(0)] be the distriution of X0. Then the distribution of Xn isgiven by
π(n) = [P(Xn = 0),P(Xn = 1)] = [π0(0), π1(0)]P n.
Problem 7. There are two parties, say, A and B, and their supporters of a constant ratioexchange at every election. Suppose that just before an election, 25% of the supportersof A change to support B and 20% of the supporters of B change to support A. At thebeginning, 85% of the voters support A and 15% support B.
(1) When will the party B command a majority?(2) Find the final ratio of supporters after many elections if the same situation con-
tinues.
Problem 8. Study the n-step transition probability of the three-state Markov chain in-troduced in Example 5.2. Explain that every animal dies within finite time if b > 0 andq > 0.
Problem 9. Let Xn be a Markov chain on 0, 1 given by the transition matrix
P =
[1− p pq 1− q
]with the initial distribution π(0) = [ q
p+q, pp+q
]. Calculate the following statistical quanti-ties:
E(Xn), V(Xn), Cov(Xm+n, Xn).
23
6. Stationary distributions
Definition. Let Xn be a Markov chain over S with transition probability matrix P .A distribution π on S is called stationary(or invariant) if
π = πP (0.4)
or equivalently if
πj =∑i∈S
πipij, j ∈ S. (0.5)
Thus, in order to find a stationary distribution of a Markov chain, we need to solve thelinear system 0.4 (or equivalently (0.5)) together with the conditions:∑
i∈S
πi = 1, and π ≥ 0 for all i ∈ S
Examples. 2-state Markov chain, Random walk on Z1.
Theorem. A Markov chain over a finite state space S has a stationary distribution.
(For the proof see the textbooks.)
Remark. Note that the stationary distribution mentioned in the above theorem is notnecessarily unique.
Definition. We say that a state j can be reached from a state i if there exists some n ≥ 0such that pn(i, j) > 0. By definition every state i can be reached from itself. We say thattwo states i and j intercommunicate if i can be reached form j and j can be reached fromi, i.e., there exist m ≥ 0 and n ≥ 0 such that pn(i, j) > 0 and pm(j, i) > 0. For i, j ∈ Swe introduce a binary relation i ∼ j when they intercommunicate. Then ∼ becomes anequivalence relation on S:
(i) i ∼ i(ii) i ∼ j =⇒ j ∼ i
(iii) i ∼ j, j ∼ k =⇒ i ∼ k.
In fact, (i) and (ii) are obvious by definition, and (iii) is verified by the Chapman-Kolmogorov equation. Thereby the state space S is classified into a disjoint set of equiv-alence classes. In each equivalence class any two states intercommunicate each other.
Definition. A Markov chain is called irreducible if every state can be reached from everyother state, i.e., if there is only one equivalence class of intercommunicating states.
Theorem. An irreducible Markov chain on a finite state space S admits a unique sta-tionary distribution π = [πi]. Moreover, πi > 0 for all i ∈ S.
24
Now we recall the example of 2-state Markov chain. If p+ q > 0, the distribution of theabove Markov chain converges to the unique stationary distribution. Consider the caseof p = q = 1, i.e., the transition matrix becomes
P =
[0 11 0
]The stationary distribution is unique. But for a given initial distribution π(0) it is notnecessarily true that π(n) converges to the stationary distribution.
Roughly speaking, we need to avoid the periodic transition in order to have the con-vergence to a stationary distribution.
Definition. For a state i ∈ S,
GCDn ≥ 1; P(Xn = i| X0 = i) > 0
is called the period of i. (When the set in the right-hand side is empty, the period is notdefined.) A state i ∈ S is called aperiodic if its period is one.
Theorem. For an irreducible Markov chain, every state has a common period.
Theorem. Let π be a stationary distribution of an irreducible Markov chain on a finitestate space (It is unique). If Xn is aperiodic, for any j ∈ S we have
limn→∞
P(Xn = j) = πj.
Problem 10. Find all stationary distributions of the Markov chain determined by thetransition diagram below. Then discuss convergence of distributions.
1 2
2/3
1/2
1/3
54
3/4
1/3
2/3
3
1/2 1/4
1/2 1/2
Problem 11. Let Xn be the Markov chain introduced in Example 5.2.
(1) Find that if q > 0 and b > 0, a stationary distribution is unique and given byπ = [0, 0, 1].
Next, for n = 1, 2, ... let Hn denote the probability of starting from H and terminatingat D at n-step. Similarly, for n = 1, 2, ... let Sn denote the probability of starting from Sand terminating at D at n-step.
(2) Show that Hn and Sn satisfies the following linear system:Hn = aHn−1 + bSn−1,Sn = pHn−1 + rSn−1,
25
where n ≥ 2, H1 = 0, S1 = q.(3) Let H and S denote the life times starting from the state H and S, respectively.
Solving the linear system in (1), prove the following identities for the mean lifetimes:
E[H] =b+ p+ q
bq, E[S] =
b+ p
bq.
Example (Page Rank). The hyperlinks among N websites give rise to a digraph(directed graph) G on N vertices. It is natural to consider a Markov chain on G, whichis defined by the transition matrix P = [pi,j], where
pi,j =
1
deg(i), if i→ j and deg(i) 6= 0,
0, if i 6→ j and i 6= j,1 if i→ j and j = i and deg(i) = 0.
where deg(i) = |j; i→ j| is the out-degree of i.
There exists a stationary state but not necessarily unique. Taking 0 ≤ d ≤ 1 we modifythe transition matrix:
Q = [qi,j],
qi,j = dpi,j + ε, ε =1− dN
.
If 0 ≤ d < 1, the Markov chain determined by Q has necessarily a unique stationarydistribution. Choosing a suitable d < 1, we may understand the stationary distributionπ = [πi] as the page rank among the websites. In real application d should not be closeto 0 and d ≈ 0.85 is often taken.
26
7. Recurrence
Definition. Let S be a state space and i ∈ S. The first hitting time or first passage timeto i is defined by
Ti = infn ≥ 1;Xn = i.If n ≥ 1;Xn = i is an empty set, we define Ti = +∞. A state i ∈ S is called recurrentif P(Ti < +∞|X0 = i) = 1. It is called transient if P(Ti = +∞|X0 = i) > 0.
Theorem. A state i ∈ S is recurrent if and only if∞∑n=0
pn(i, i) =∞.
If a state i ∈ S is transient, we have∞∑n=0
pn(i, i) <∞,
and moreover,∞∑n=0
pn(i, i) =1
1− P(Ti <∞|X0 = i).
Examples. Random walk on Z1, Z2 and Z3. The following notation will be used:Let an and bn be sequences of positive numbers. We write an ∼ bn if limn→∞ an/bn =1. In this case, there exist two constant numbers c1 > 0 and c2 > 0 such that c1an ≤ bn ≤c2an. Hence
∑n an and
∑n bn converge or diverge at the same time.
Definition. If a state i ∈ S is recurrent, i.e., P(Ti <∞|X0 = i) = 1, the mean recurrenttime is defined by
E(Ti|X0 = i) =∞∑n=1
nP(Ti = n|X0 = i).
The state i is called positive recurrent if E(Ti|X0 = i) <∞, and null recurrent otherwise.
Theorem. The states in an equivalence class are all positive recurrent, or all null recur-rent, or all transient. In particular, for an irreducible Markov chain, the states are allpositive recurrent, or all null recurrent, or all transient.
Theorem. For an irreducible Markov chain on a finite state space S, every state ispositive recurrent.
Example. The mean recurrent time of the one-dimensional isotropic random walk isinfinity, i.e., the one-dimensional isotropic random walk is null recurrent.
27
Problem 12. Let Xn be a Markov chain described by the following transition diagram:
0 11-p
q
p
1-q
where p > 0 and q > 0. For a state i ∈ S let Ti = infn ≥ 1;Xn = i be the first hittingtime to i.
(1) Calculate
P(T0 = 1|X0 = 0), P(T0 = 2|X0 = 0), P(T0 = 3|X0 = 0), P(T0 = 4|X0 = 0).
(2) Find P(T0 = n|X0 = 0) and calculate∞∑n=1
P(T0 = n|X0 = 0),∞∑n=1
nP(T0 = n|X0 = 0).
28
8. Bienayme-Galton-Watson Branching Process
Consider a simplified family tree where each individual gives birth to offspring (children)and dies. The number of offsprings is random. We are interested in whether the familysurvives or not.
Let Xn be the number of individuals of the nth generation. Then Xn : n = 0, 1, 2, ...becomes a discrete- time stochasic process. We assume that the number of childrenborn from each individual obeys a common probability distribution and is independentof individuals and of generation. Under this assupmtion Xn becomes a Markov chain.
Let us find the transition probability. Let Y be the number of children born from anindividual and set
P(Y = k) = pk, k = 0, 1, 2, ....
The sequence p0, p1, p2, ... describes the distribution of the number of children bornfrom an individual. In fact, what we need is the condition
pk ≥ 0,∞∑k=0
pk = 1.
We refer to p0, p1, .... as the offspring distribution. Let Y1, Y2, ... be independentidentically distributed random variables, of which the distribution is the same as Y .Then, we define the transition probability by
p(i, j) = P(Xn+1 = j| Xn = i) = P( i∑k=1
Yk = j), i ≥ 1, j ≥ 0
and
p(0, j) =
0, j ≥ 11, j = 0.
The above Markov chain Xn over the state space 0, 1, 2, .... is called the Bienayme-Galton-Watson branching process with offspring distribution pk : k = 0, 1, 2, .... Forsimplicity we assume that X0 = 1. When p0 + p1 = 1, the famility tree is reduced to justa path without branching so the situation is much simpler (Problem 13).
Let Xn be the Bienayme-Galton-Watson branching process with offspring distributionpk : k = 0, 1, 2, ..... Let p(i, j) = P(Xn+1 = j|Xn = i) be the transition probability. Weassume that X0 = 1. Define the generating function of the offspring distribution by
f(s) =∞∑k=0
pksk.
29
The series in the right-hand side converges for |s| ≤ 1. We set
f0(s) = s, f1(s) = f(s), fn(s) = f(fn−1(s)).
Lemma.∞∑j=0
p(i, j)sj = [f(s)]i, i = 1, 2, ...
Lemma. Let pn(i, j) be the n-step transition probability of the Bienayme-Galton-Watsonbranching process. Then, we have
∞∑j=0
pn(i, j)sj = [fn(s)]i, i = 1, 2, ...
Theorem. Assume that the mean value of the offspring distribution is finite:
m =∞∑k=0
kpk <∞.
Then, we haveE[Xn] = mn.
In conclusion, the mean value of the number of individuals in the n-th generation,E(Xn), decreases and converges to 0 if m < 1 and diverges to the infinity if m > 1, asn → ∞. It stays at a constant if m = 1. We are thus suggested that extinction of thefamily occurs when m < 1.
The event Xn = 0 occurs for some n ≥ 1 means that the family died out before orat the n-th generation. Thus,
q = P( ∞⋃n=1
Xn = 0)
= limn→∞
P(Xn = 0)
is the probability of extinction of the family. If q = 1, this family almost surely dies out insome generation. If q < 1, the survival probability is positive 1−q > 0. We are interestedin whether q = 1 or not.
Lemma. Let f(s) be the generating function of the offspring distribution. Then we have
q = limn→∞
fn(0).
Therefore, q satisfies the equation:q = f(q)
30
Assume that the offspring distribution satisfies the conditions: p0 + p1 < 1.
Lemma. The generating function f(s) satisfies the following properties.
(1) f(s) is increasing, i.e., f(s1) ≤ f(s2) for 0 ≤ s1 ≤ s2 ≤ 1.(2) f(s) is strictly convex, i.e., if 0 ≤ s1 < s2 ≤ 1 and 0 < θ < 1 we have
f(θs1 + (1− θ)s2) < θf(s1) + (1− θ)f(s2).
Lemma.
(1) If m ≤ 1, we have f(s) > s for 0 ≤ s < 1.(2) If m > 1, there exists a unique s such that 0 ≤ s < 1 and f(s) = s.
Theorem. The extinction probability q of the Bienayme-Galton-Watson branching pro-cess as above coincides with the smallest s such that s = f(s), 0 ≤ s ≤ 1. Moreover, ifm ≤ 1 we have q = 1, and if m > 1 we have q < 1.
The Bienayme-Galton-Watson branching process is called subcritical, critical and su-percritical if m < 1, m = 1 and m > 1, respectively. The survival is determined onlyby the mean value m of the offspring distribution. The situation changes dramatically atm = 1 and, following the terminology of statistical physics, we call it phase transition.
Problem 13 (One-child policy). Consider the Bienayme-Galton-Watson branching processwith offspring distribution satisfying p0 + p1 = 1. Calculate the probabilities
q1 = P(X1 = 0), q2 = P(X1 6= 0, X2 = 0), ....
....., qn = P(X1 6= 0, ....., Xn−1 6= 0, Xn = 0), ......
and find the extinction probability
P(Xn = 0 occurs for some n ≥ 1).
Problem 14. Let b, p be constant numbers such that b > 0, 0 < p < 1 and b + p < 1.Suppose that the offspring distribution given by
pk = bpk−1, k = 1, 2, ..., p0 = 1−∞∑k=1
pk.
(1) Find the generating function f(s) of the offspring distribution.(2) Set m = 1 and find fn(s).