statistical inference - lecture 5: properties of a random...

Statistical InferenceLecture 5: Properties of a Random Sample

MING GAO

DASE @ ECNU(for course related communications)

[email protected]

Apr. 14, 2020

Outline

1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F

2 Order Statistics3 Convergence Concepts

Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution

4 Generating a Random SampleDirect ModelIndirect Model

5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling

6 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 2 / 101

Random vector

Definition

The r.v.s X1, · · · ,Xn are called a random sample of size n from thepopulation f (x) if X1, · · · ,Xn are mutually independent r.v.s and themarginal pdf or pmf of each Xi is the same function f (x).

That is, X1, · · · ,Xn are called independent and identicallydistributed (shorted in i.i.d) r.v.s with pdf or pmf f (x).

From the definition, the joint pdf or pmf of X1, · · · ,Xn is givenby

f (x1, · · · , xn) =n∏

f (xi ).

In particular, if the population pdf or pmf is a memeber of aparametric family with pdf or pmf given by f (x |θ), then thejoint pdf or pmf is

f (x1, · · · , xn|θ) =n∏

f (xi |θ).

Random vector

Definition

f (x1, · · · , xn) =n∏

f (xi ).

f (x1, · · · , xn|θ) =n∏

f (xi |θ).

Random vector

Definition

f (x1, · · · , xn) =n∏

f (xi ).

f (x1, · · · , xn|θ) =n∏

f (xi |θ).

Random vector

Definition

f (x1, · · · , xn) =n∏

f (xi ).

f (x1, · · · , xn|θ) =n∏

f (xi |θ).

Sample pdf-exponential

Let X1, · · · ,Xn be a random sample form an exponential(β) popu-lation. The joint pdf of the sample is

f (x1, · · · , xn|β) =n∏

f (xi |β) =1

βne−

x1+···+xnβ .

With replacement sampling

Suppose a value is chosen from the population in such a way thateach of the N values is equally likely (prob. = 1

N ) to be chosen.The value chosen at any stage is “replaced” in the population and isavailable for choice again at the next stage.

f (x1, · · · , xn|β) =n∏

f (xi |β) =1

βne−

x1+···+xnβ .

N ) to be chosen.

The value chosen at any stage is “replaced” in the population and isavailable for choice again at the next stage.

f (x1, · · · , xn|β) =n∏

f (xi |β) =1

βne−

x1+···+xnβ .

N ) to be chosen.The value chosen at any stage is “replaced” in the population and isavailable for choice again at the next stage.

Without replacement sampling

The first sample is recorded as X1 = y , where y ∈ x1, x2, · · · , xN.Now a second value is chosen from the remaining N − 1 values.

P(X2 = x) =N∑i=1

P(X2 = x |X1 = xi )P(X1 = xi )

= (N − 1)1

N − 1

Similar arguments can be used to show that each of the Xi hasthe same marginal distribution, which they are notindependent.

The nonzero probability in the conditional distribution of Xi

given X1, · · · ,Xi−1 is 1N−i+1 , which is close to 1

N if i ≤ n issmall compared with N

P(X2 = x) =N∑i=1

P(X2 = x |X1 = xi )P(X1 = xi )

= (N − 1)1

N − 1

P(X2 = x) =N∑i=1

P(X2 = x |X1 = xi )P(X1 = xi )

= (N − 1)1

N − 1

P(X2 = x) =N∑i=1

P(X2 = x |X1 = xi )P(X1 = xi )

= (N − 1)1

N − 1

Sampling distribution

Definition

Let X1, · · · ,Xn be a random sample of size n form an population, andT (X1, · · · ,Xn) be a real-valued or vector-valued function. Then the r.v. orrandom vector Y = T (X1, · · · ,Xn) is called a statistic. The pdf or pmf ofstatistic Y is called the sampling distribution of Y .

Sample mean and sample variance

The sample mean and sample variance are the statistics defined by

X =x1 + · · ·+ Xn

n∑i=1

n − 1

n∑i=1

(Xi − X )2.

The sample standard deviation is the statistic defined by S =√S2.

Sampling distribution

Definition

Let X1, · · · ,Xn be a random sample of size n form an population, andT (X1, · · · ,Xn) be a real-valued or vector-valued function. Then the r.v. orrandom vector Y = T (X1, · · · ,Xn) is called a statistic. The pdf or pmf ofstatistic Y is called the sampling distribution of Y .

Sample mean and sample variance

The sample mean and sample variance are the statistics defined by

X =x1 + · · ·+ Xn

n∑i=1

n − 1

n∑i=1

(Xi − X )2.

The sample standard deviation is the statistic defined by S =√S2.

Theorem

Definition

Let x1, · · · , xn be any numbers and x = 1n

∑ni=1 xi . Then

a. mina

∑ni=1(xi − a)2 =

∑ni=1(xi − x)2,

b. (n − 1)s2 =∑n

i=1(xi − x)2 =∑n

i=1 x2i − nx2.

Proof.

n∑i=1

(xi − a)2 =n∑

(xi − x + x − a)2

(xi − x)2 + 2n∑

(xi − x)(x − a) +n∑

(x − a)2

(xi − x)2 +n∑

(x − a)2.

Theorem

Definition

∑ni=1 xi . Then

a. mina

∑ni=1(xi − a)2 =

∑ni=1(xi − x)2,

b. (n − 1)s2 =∑n

i=1(xi − x)2 =∑n

i=1 x2i − nx2.

Proof.

n∑i=1

(xi − a)2 =n∑

(xi − x + x − a)2

(xi − x)2 + 2n∑

(xi − x)(x − a) +n∑

(x − a)2

(xi − x)2 +n∑

(x − a)2.

Theorem

Definition

∑ni=1 xi . Then

a. mina

∑ni=1(xi − a)2 =

∑ni=1(xi − x)2,

b. (n − 1)s2 =∑n

i=1(xi − x)2 =∑n

i=1 x2i − nx2.

Proof.

n∑i=1

(xi − a)2 =n∑

(xi − x + x − a)2

(xi − x)2 + 2n∑

(xi − x)(x − a) +n∑

(x − a)2

(xi − x)2 +n∑

(x − a)2.

Theorem

Definition

∑ni=1 xi . Then

a. mina

∑ni=1(xi − a)2 =

∑ni=1(xi − x)2,

b. (n − 1)s2 =∑n

i=1(xi − x)2 =∑n

i=1 x2i − nx2.

Proof.

n∑i=1

(xi − a)2 =n∑

(xi − x + x − a)2

(xi − x)2 + 2n∑

(xi − x)(x − a) +n∑

(x − a)2

(xi − x)2 +n∑

(x − a)2.

Definition

Let X1, · · · ,Xn be a random sample from a population and let g(x) be afunction s.t. E (g(X1)) and Var(g(X1)) exist. Then

E( n∑

g(Xi ))

= n · E (g(X1)),

Var( n∑

g(Xi ))

= n · Var(g(X1)).

Note that

E[(g(Xi )− E (g(Xi )))(g(Xj)− E (g(Xj)))

]= Cov(g(Xi ), g(Xj)) = 0.

Definition

E( n∑

g(Xi ))

= n · E (g(X1)),

Var( n∑

g(Xi ))

= n · Var(g(X1)).

Note that

]= Cov(g(Xi ), g(Xj)) = 0.

Definition

E( n∑

g(Xi ))

= n · E (g(X1)),

Var( n∑

g(Xi ))

= n · Var(g(X1)).

Note that

]= Cov(g(Xi ), g(Xj)) = 0.

Theorem

Definition

Let X1, · · · ,Xn be a random sample from a population with mean µ andvariance σ2 <∞. Then

a. E (X ) = µ;

b. Var(X ) = σ2

c. E (S2) = σ2.

Proof.

E (S2) = E( 1

n − 1

[ n∑i=1

X 2i − nX

n − 1(nE (X 2

1 )− nE (X2))

n − 1

(n(σ2 + µ2)− n(

n+ µ2)

)= σ2.

Unbiased statistics

Definition

If the following holds:

E [T (X1,X2, · · · ,Xn)] = θ,

then the statistic T (X1,X2, · · · ,Xn) is an unbiased estimator of the param-eter θ. Otherwise, T (X1,X2, · · · ,Xn) is a biased estimator of θ.

The statistic X is an unbiased estimator of µ, and S2 is an unbiasedestimator of σ2.

The use of n− 1 in the definition of S2 may have seemed unintuitive.

If S2 were defined as the usual average of the squared deviationswith n rather than n − 1 in the denominator, then E (S2) would ben

n−1σ2 and S2 would not be an unbiased estimator of σ2.

Unbiased statistics

Definition

E [T (X1,X2, · · · ,Xn)] = θ,

Unbiased statistics

Definition

E [T (X1,X2, · · · ,Xn)] = θ,

Unbiased statistics

Definition

E [T (X1,X2, · · · ,Xn)] = θ,

Theorem

Let X1, · · · ,Xn be a random sample from a population with mgfMX (t). Then the mgf of the sample means is

MX (t) =[MX (

Example

Let X1, · · · ,Xn be a random sample from a N(µ, σ2) population.Then the mgf of the sample mean is

MX (t) =[exp(

n+σ2(t/n)2

= exp[n(

n+σ2(t/n)2

= exp[tµ+

(σ2/n)t2

Thus, X has a N(µ, σ2

n ) distribution.

Theorem

Let X1, · · · ,Xn be a random sample from a population with mgfMX (t). Then the mgf of the sample means is

MX (t) =[MX (

Example

Let X1, · · · ,Xn be a random sample from a N(µ, σ2) population.Then the mgf of the sample mean is

MX (t) =[exp(

n+σ2(t/n)2

= exp[n(

n+σ2(t/n)2

= exp[tµ+

(σ2/n)t2

Thus, X has a N(µ, σ2

n ) distribution.

Convolution theorem

If X and Y are independent continuous r.v.s with pdfs fX (x) andfY (y), then the pdf of Z = X + Y is

fZ (z) =

∫ ∞−∞

fX (w)fY (z − w)dw .

Proof.

Let W = X . The Jacobian of the transformation from (X ,Y ) to(Z ,W ) is 1. Thus, we obtain the joint pdf of (Z ,W ) as

FZ ,W (z ,w) = FX ,Y (w , z − w) = fX (w)fY (z − w).

Integrating out w , we obtain the marginal pdf of Z as given in theconclusion.

Convolution theorem

fZ (z) =

∫ ∞−∞

Proof.

Convolution theorem

fZ (z) =

∫ ∞−∞

Proof.

Convolution theorem

fZ (z) =

∫ ∞−∞

Proof.

Sum of Cauchy random variables

If U and V be independent Cauchy r.v.s, where U ∼ Cauchy(0, σ) andV ∼ Cauchy(0, τ); that is

fU(u) =1

1 + (u/σ)2, and fV (v) =

1 + (v/τ)2.

Based on the convolution theorem, the pdf of Z = U + V is given by

fZ (z) =

∫ ∞−∞

1 + (w/σ)2

1 + ((z − w)/τ)2dw

π(σ + τ)

1 + (z/(σ + τ))2.

The sum of two independent Cauchy r.v.s is again a Cauchy, withthe scale parameters adding.

If Z1, · · · ,Zn are i.i.d Cauchy(0, 1) r.v., then∑n

i=1 Zi isCauchy(0, n) and Z is Cauchy(0, 1).

fU(u) =1

1 + (u/σ)2, and fV (v) =

1 + (v/τ)2.

fZ (z) =

∫ ∞−∞

1 + (w/σ)2

1 + ((z − w)/τ)2dw

π(σ + τ)

1 + (z/(σ + τ))2.

fU(u) =1

1 + (u/σ)2, and fV (v) =

1 + (v/τ)2.

fZ (z) =

∫ ∞−∞

1 + (w/σ)2

1 + ((z − w)/τ)2dw

π(σ + τ)

1 + (z/(σ + τ))2.

fU(u) =1

1 + (u/σ)2, and fV (v) =

1 + (v/τ)2.

fZ (z) =

∫ ∞−∞

1 + (w/σ)2

1 + ((z − w)/τ)2dw

π(σ + τ)

1 + (z/(σ + τ))2.

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Outline

6 Take-aways

Theorem

Let X1, · · · ,Xn be a random sample from a N(µ, σ2) distribution, letX = 1

∑ni=1 Xi and S2 = 1

∑ni=1(Xi − X )2. Then

a. X and S2 are independent r.v.s;

b. X ∼ N(µ, σ2

c. (n−1)S2

σ2 has a χ2(n − 1) distribution.

Proof.

Part (b) has been established. Next, we will prove Parts (a) and(c). Without loss of generality, assume that µ = 1 and σ = 1.For Part (a), we will show that X and S2 are functions ofindependent random vectors.After introducing a lemma, we will prove the Part(c).

Theorem

b. X ∼ N(µ, σ2

c. (n−1)S2

Proof.

Theorem

b. X ∼ N(µ, σ2

c. (n−1)S2

Proof.

Theorem

b. X ∼ N(µ, σ2

c. (n−1)S2

Proof.

Theorem

b. X ∼ N(µ, σ2

c. (n−1)S2

Proof.

Part (b) has been established. Next, we will prove Parts (a) and(c). Without loss of generality, assume that µ = 1 and σ = 1.

For Part (a), we will show that X and S2 are functions ofindependent random vectors.After introducing a lemma, we will prove the Part(c).

Theorem

b. X ∼ N(µ, σ2

c. (n−1)S2

Proof.

Part (b) has been established. Next, we will prove Parts (a) and(c). Without loss of generality, assume that µ = 1 and σ = 1.For Part (a), we will show that X and S2 are functions ofindependent random vectors.

After introducing a lemma, we will prove the Part(c).

Theorem

b. X ∼ N(µ, σ2

c. (n−1)S2

Proof.

Part (a) proof Cont’d

Note that we can write S2 as a function of n − 1 deviations as follows:

n − 1

n∑i=1

(Xi − X )2

n − 1

((X1 − X )2 +

n∑i=2

(Xi − X )2)

n − 1

([ n∑i=2

(Xi − X )]2

(Xi − X )2).

Thus, S2 can be written as a function only of (X2 − X , · · · ,Xn − X ). Wewill now show that these r.v.s are independent of X . The joint pdf of thesample X1, · · · ,Xn is given by

f (x1, · · · , xn) =1

(2π)n/2e−(1/2)

∑ni=1 x

Part (a) proof Cont’d

Make the transformation

y1 = x , x1 = y1 −n∑

y2 = x2 − x , x2 = y1 + y2

· · · · · ·yn = xn − x , xn = y1 + yn

This is a linear transformation with a Jacobian equal to 1n . We have

f (y1, · · · , yn) =1

(2π)n/2e−(1/2)(y1−

∑ni=2 yi )

e−(1/2)∑n

i=2(y1+yi )2

(2π)1/2e−(ny2

1 )/2][ n1/2

(2π)(n−1)/2e−(1/2)[

∑ni=2 y

∑ni=2 yi )

Since the joint pdf of Y1, · · · ,Yn factors, Y1 is independent of Y2, · · · ,Yn.That is, X and S2 are independent r.v.s.

Facts about Chi squared r.v.s

We use the notation χ2p to denote a Chi squared r.v. with p degrees of

freedom.

a. If Z is a N(0, 1) r.v., then Z 2 ∼ χ21; that is, the square of a standard

normal r.v. is a chi squared r.v.;

b. If X1, · · · ,Xn are independent and Xi ∼ χ2pi , then

X1 + · · ·+ Xn ∼ χ2p1+···+pn ; that is, independent chi squared variables

add to a chi squared variable, and the degrees of freedom also add.

Proof.

Part (a) has established in Chapter 2.Since a χ2

p r.v. is a gamma( p2 , 2), application of the example gives part

freedom.

Proof.

freedom.

Proof.

freedom.

Proof.

Part (a) has established in Chapter 2.

Since a χ2p r.v. is a gamma( p

2 , 2), application of the example gives part(b).

freedom.

Proof.

Part (c) proof

We will employ and induction argument to establish the distributionof S2, using the notation X k and S2

k to denote the sample mean andvariance based on the first k observations.

It is straightforward to establish that

(n − 1)S2n = (n − 2)S2

n−1 +n − 1

n(Xn − X n−1)2.

Now consider n = 2, we have S22 = 1

2 (X2 −X1)2. Since the distribu-

tion of X2−X1√2

is N(0, 1), Part (a) shows that S22 ∼ χ2

Proceeding with the induction, we assume that for n = k , (k−1)S2k ∼

χ2k−1.

For n = k + 1, we have the following form

kS2k+1 = (k − 1)S2

k + 1(Xk+1 − X k)2.

Part (c) proof

k to denote the sample mean andvariance based on the first k observations.It is straightforward to establish that

(n − 1)S2n = (n − 2)S2

n−1 +n − 1

n(Xn − X n−1)2.

tion of X2−X1√2

χ2k−1.

kS2k+1 = (k − 1)S2

k + 1(Xk+1 − X k)2.

Part (c) proof

(n − 1)S2n = (n − 2)S2

n−1 +n − 1

n(Xn − X n−1)2.

2 (X2 −X1)2.

Since the distribu-

tion of X2−X1√2

χ2k−1.

kS2k+1 = (k − 1)S2

k + 1(Xk+1 − X k)2.

Part (c) proof

(n − 1)S2n = (n − 2)S2

n−1 +n − 1

n(Xn − X n−1)2.

tion of X2−X1√2

χ2k−1.

kS2k+1 = (k − 1)S2

k + 1(Xk+1 − X k)2.

Part (c) proof

(n − 1)S2n = (n − 2)S2

n−1 +n − 1

n(Xn − X n−1)2.

tion of X2−X1√2

χ2k−1.

kS2k+1 = (k − 1)S2

k + 1(Xk+1 − X k)2.

Part (c) proof

(n − 1)S2n = (n − 2)S2

n−1 +n − 1

n(Xn − X n−1)2.

tion of X2−X1√2

χ2k−1.

kS2k+1 = (k − 1)S2

k + 1(Xk+1 − X k)2.

Part (c) proof Cont’d

According to the induction hypothesis, if kk+1 (Xk+1 − X k)2 ∼ χ2

and independent of S2k , it will follow from Part (b) that kS2

k+1 ∼ χ2k .

Since the vector (Xk+1,X k) is independent of S2k . Furthermore,

Xk+1 − X k ∼ N(0, 1) since

Var(Xk+1 − X k) =k + 1

Therefore, kk+1 (Xk+1 − X k)2 ∼ χ2

1.Based on the following lemma, we will provide the other manner toestablish the theorem.

k+1 ∼ χ2k .

Xk+1 − X k ∼ N(0, 1) since

Var(Xk+1 − X k) =k + 1

k+1 ∼ χ2k .

Xk+1 − X k ∼ N(0, 1) since

Var(Xk+1 − X k) =k + 1

Based on the following lemma, we will provide the other manner toestablish the theorem.

k+1 ∼ χ2k .

Xk+1 − X k ∼ N(0, 1) since

Var(Xk+1 − X k) =k + 1

Let Xj ∼ N(µj , σ2j ), j = 1, · · · , n be independent. For constants aij

and brj(j = 1, · · · , n; i = 1, · · · , k ; r = 1, · · · ,m), where k + m ≤ n,define

Ui =n∑

aijXj , and Vr =n∑

The r.v.s Ui and Vr are independent if and only ifCov(Ui ,Vr ) = 0. Furthermore, Cov(Ui ,Vr ) =

∑nj=1 aijbrjσ

The random vectors (U1, · · · ,Uk) and (V1, · · · ,Vm) areindependent if and only if Ui is independent of Vr for all pairsi , r(i = 1, · · · , k ; r = 1, · · · ,m).

Ui =n∑

∑nj=1 aijbrjσ

Ui =n∑

∑nj=1 aijbrjσ

Prove the theorem

We can use the above lemma to provide an alternative proof of theindependent of X and S2 in normal sampling.

Since we can write S2 as a function of n − 1 deviations (X2 −X , · · · ,Xn − X ), we must show that these r.v.s are uncorrelatedwith X . As an illustration of the application of the lemma, we write

n∑i=1

Xj − X =1

n∑i=1

(δij − 1)Xi ,

where δij = 1 if i = j and δij = 0 otherwise.It is then easy to show that X and Xj − X are independent.

Prove the theorem

We can use the above lemma to provide an alternative proof of theindependent of X and S2 in normal sampling.Since we can write S2 as a function of n − 1 deviations (X2 −X , · · · ,Xn − X ), we must show that these r.v.s are uncorrelatedwith X . As an illustration of the application of the lemma, we write

n∑i=1

Xj − X =1

n∑i=1

(δij − 1)Xi ,

where δij = 1 if i = j and δij = 0 otherwise.

It is then easy to show that X and Xj − X are independent.

Prove the theorem

We can use the above lemma to provide an alternative proof of theindependent of X and S2 in normal sampling.Since we can write S2 as a function of n − 1 deviations (X2 −X , · · · ,Xn − X ), we must show that these r.v.s are uncorrelatedwith X . As an illustration of the application of the lemma, we write

n∑i=1

Xj − X =1

n∑i=1

(δij − 1)Xi ,

where δij = 1 if i = j and δij = 0 otherwise.It is then easy to show that X and Xj − X are independent.

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

Outline

6 Take-aways

Student’s T distribution

If X1, · · · ,Xn are a random sample from a N(µ, σ2), we know that X−µσ/√n∼

N(0, 1).Most of the time, however, σ is unknown. We can use S to replace σ. Inthis case

X − µS/√n

=(X − µ)/(σ/

√n)√

S2/σ2

N(0, 1)√χ2n−1/(n − 1)

Definition

If X1, · · · ,Xn are a random sample from a N(µ, σ2). The quantity X−µS/√n

has Student’s T distribution with p = (n − 1) degrees of freedom with thepdf as

fT (t) =Γ((p + 1)/2)

Γ(p/2)

(pπ)1/2

(1 + t2/p)

](p+1)/2 ∼ tp.

Student’s T distribution

If X1, · · · ,Xn are a random sample from a N(µ, σ2), we know that X−µσ/√n∼

N(0, 1).Most of the time, however, σ is unknown. We can use S to replace σ. Inthis case

X − µS/√n

=(X − µ)/(σ/

√n)√

S2/σ2

N(0, 1)√χ2n−1/(n − 1)

Definition

If X1, · · · ,Xn are a random sample from a N(µ, σ2). The quantity X−µS/√n

has Student’s T distribution with p = (n − 1) degrees of freedom with thepdf as

fT (t) =Γ((p + 1)/2)

Γ(p/2)

(pπ)1/2

(1 + t2/p)

](p+1)/2 ∼ tp.

Student’s T distribution Cont’d

Notice that if p = 1, then tp becomes the pdf of the Cauchy distribution,which occurs for samples of size 2.

If we start with U and V defined above,their joint pdf is

fU,V (u, v) =1

(2π)1/2e−u

Γ(p/2)2p/2v (p/2)−1e−v/2.

Now make the transformation

t =u√v/p

, and w = v .

The Jacobian of the transformation is (w/p)1/2, and the marginal pdf ofT is given

fT (t) =

∫ ∞−∞

fU,V (t(w/p)1/2,w)(w/p)1/2dw

=Γ((p + 1)/2)

Γ(p/2)

(pπ)1/2

(1 + t2/p)

](p+1)/2 ∼ tp.

Notice that if p = 1, then tp becomes the pdf of the Cauchy distribution,which occurs for samples of size 2. If we start with U and V defined above,their joint pdf is

fU,V (u, v) =1

(2π)1/2e−u

Γ(p/2)2p/2v (p/2)−1e−v/2.

t =u√v/p

, and w = v .

fT (t) =

∫ ∞−∞

fU,V (t(w/p)1/2,w)(w/p)1/2dw

=Γ((p + 1)/2)

Γ(p/2)

(pπ)1/2

(1 + t2/p)

](p+1)/2 ∼ tp.

fU,V (u, v) =1

(2π)1/2e−u

Γ(p/2)2p/2v (p/2)−1e−v/2.

t =u√v/p

, and w = v .

fT (t) =

∫ ∞−∞

fU,V (t(w/p)1/2,w)(w/p)1/2dw

=Γ((p + 1)/2)

Γ(p/2)

(pπ)1/2

(1 + t2/p)

](p+1)/2 ∼ tp.

fU,V (u, v) =1

(2π)1/2e−u

Γ(p/2)2p/2v (p/2)−1e−v/2.

t =u√v/p

, and w = v .

fT (t) =

∫ ∞−∞

fU,V (t(w/p)1/2,w)(w/p)1/2dw

=Γ((p + 1)/2)

Γ(p/2)

(pπ)1/2

(1 + t2/p)

](p+1)/2 ∼ tp.

Student’s t has no mgf because it does not have moments of allorders.

In fact, if there are p degrees of freedom, then there are only p − 1moments, i.e., a t1 has no mean, a t2 has no variance.Note that T1 ∼ t1 is a Cauchy r.v., it’s mean does not exist.Let Tp is a r.v. with a tp distribution, then

E (Tp) = 0, if p > 0

Var(Tp) =p

p − 2and p > 2.

Student’s t has no mgf because it does not have moments of allorders.In fact, if there are p degrees of freedom, then there are only p − 1moments, i.e., a t1 has no mean, a t2 has no variance.

Note that T1 ∼ t1 is a Cauchy r.v., it’s mean does not exist.Let Tp is a r.v. with a tp distribution, then

E (Tp) = 0, if p > 0

Var(Tp) =p

p − 2and p > 2.

Student’s t has no mgf because it does not have moments of allorders.In fact, if there are p degrees of freedom, then there are only p − 1moments, i.e., a t1 has no mean, a t2 has no variance.Note that T1 ∼ t1 is a Cauchy r.v., it’s mean does not exist.

Let Tp is a r.v. with a tp distribution, then

E (Tp) = 0, if p > 0

Var(Tp) =p

p − 2and p > 2.

Student’s t has no mgf because it does not have moments of allorders.In fact, if there are p degrees of freedom, then there are only p − 1moments, i.e., a t1 has no mean, a t2 has no variance.Note that T1 ∼ t1 is a Cauchy r.v., it’s mean does not exist.Let Tp is a r.v. with a tp distribution, then

E (Tp) = 0, if p > 0

Var(Tp) =p

p − 2and p > 2.

F distribution

Let X1, · · · ,Xn be a random sample from a N(µX , σ2X ), and let Y1, · · · ,Ym

be a random sample from a N(µY , σ2Y ).

If we were interested in comparing the variability of the populations, onquantity of interest would be the ratio σ2

X/σ2Y . Information about this ratio

is contained in S2X/S

2Y , the ratio of sample variances.

Definition

be a random sample from a N(µY , σ2Y ). The random variable F =

S2X/σ

S2Y /σ

has Snedecor’s F distribution with p = n − 1 and q = m − 1 degrees offreedom

fF (x) =Γ( p+q

Γ(p/2)Γ(q/2)(p

q)p/2 x (p/2)−1

(1 + (p/q)x)(p+q)/2∼ Fn−1,m−1.

F distribution

be a random sample from a N(µY , σ2Y ).

If we were interested in comparing the variability of the populations, onquantity of interest would be the ratio σ2

X/σ2Y . Information about this ratio

is contained in S2X/S

2Y , the ratio of sample variances.

Definition

be a random sample from a N(µY , σ2Y ). The random variable F =

S2X/σ

S2Y /σ

has Snedecor’s F distribution with p = n − 1 and q = m − 1 degrees offreedom

fF (x) =Γ( p+q

Γ(p/2)Γ(q/2)(p

q)p/2 x (p/2)−1

(1 + (p/q)x)(p+q)/2∼ Fn−1,m−1.

F distribution Cont’d

Properties

A variance ratio may have an F distribution even if the parentpopulations are not normal. It is enough that their pdf aresymmetry functions.

F distribution can be found by the pdf of U/pV /q , where U and V

are independent, U ∼ χ2p and V ∼ χ2

If X ∼ Fp,q, then (p/q)X(1+(p/q)X ) ∼ Beta(p/2, q/2).

If X ∼ Fp,q, then 1X ∼ Fq,p.

If X ∼ tq, then X 2 ∼ F1,q.

Properties

Order Statistics

Order statistics

The order statistics of a random sample X1, · · · ,Xn are the sample valuesplaced in ascending order. They are denoted by X(1), · · · ,X(n).

The order statistics satisfy X(1) ≤ · · · ≤ X(n).

X(1) = min1≤i≤n

X(2) = second smallest Xi ,

· · · · · · ,X(n) = max

1≤i≤nXi .

Sample range: R = X(n) − X(1).

Sample median: M =

X((n+1)/2), if n is odd;X(n/2) + X(n/2+1), if n is even.

Order Statistics

Order statistics

X(1) = min1≤i≤n

· · · · · · ,X(n) = max

1≤i≤nXi .

Sample median: M =

Order Statistics

Order statistics

X(1) = min1≤i≤n

· · · · · · ,X(n) = max

1≤i≤nXi .

Sample median: M =

Order Statistics

Order statistics

X(1) = min1≤i≤n

· · · · · · ,X(n) = max

1≤i≤nXi .

Sample median: M =

Order Statistics

Order statistics

X(1) = min1≤i≤n

· · · · · · ,X(n) = max

1≤i≤nXi .

Sample median: M =

Order Statistics

Distribution of order statistics (discrete case)

Theorem

Let X1,X2, · · · ,Xn be a random sample from a discrete distribution withpmf fX (xi ) = pi , where x1 < x2 < · · · are the possible values of X inascending order. Define

P0 = 0,P1 = p1,

· · · · · · ,Pi = p1 + p2 + · · ·+ pi .

Let X(1), · · · ,X(n) denote the order statistics from the sample. Then

P(X(j) ≤ xi ) =n∑

)Pki (1− Pi )

P(X(j) = xi ) =n∑

i (1− Pi )n−k − Pk

i−1(1− Pi−1)n−k ].

Order Statistics

Proof the theorem

Fix i and let Y be a r.v. that counts the number of X1,X2, · · · ,Xn

that are less than or equal to xi . For each of X1,X2, · · · ,Xn, call theevent Xj ≤ xi a “success” and Xj > xi a “failure”. Then Y isthe number of successes in n trials.

Thus, Y ∼ binomial(n,Pi ).The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi , i.e.,

P(X(j) ≤ xi ) = P(Y ≥ j).

For the other case

P(X(j) = xi ) = P(X(j) ≤ xi )− P(X(j) ≤ xi−1).

Order Statistics

Proof the theorem

that are less than or equal to xi . For each of X1,X2, · · · ,Xn, call theevent Xj ≤ xi a “success” and Xj > xi a “failure”. Then Y isthe number of successes in n trials.Thus, Y ∼ binomial(n,Pi ).

The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi , i.e.,

P(X(j) ≤ xi ) = P(Y ≥ j).

For the other case

Order Statistics

Proof the theorem

that are less than or equal to xi . For each of X1,X2, · · · ,Xn, call theevent Xj ≤ xi a “success” and Xj > xi a “failure”. Then Y isthe number of successes in n trials.Thus, Y ∼ binomial(n,Pi ).The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi , i.e.,

P(X(j) ≤ xi ) = P(Y ≥ j).

For the other case

Order Statistics

Proof the theorem

that are less than or equal to xi . For each of X1,X2, · · · ,Xn, call theevent Xj ≤ xi a “success” and Xj > xi a “failure”. Then Y isthe number of successes in n trials.Thus, Y ∼ binomial(n,Pi ).The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi , i.e.,

P(X(j) ≤ xi ) = P(Y ≥ j).

For the other case

Order Statistics

Distribution of order statistics (continuous case)

Theorem

Let X(1), · · · ,X(n) denote the order statistics of a random sample,X1,X2, · · · ,Xn, from a continuous population with cdf FX (x) andpdf fX (x). Then the pdf of X(j) is

fX(j)(x) =

(j − 1)!(n − j)!fX (x)[FX (x)]j−1[1− FX (x)]n−j .

Proof.

We first find the cdf of X(j) and then differentiate it to obtain thepdf. Let Y be a r.v. that counts the number of X1,X2, · · · ,Xn thatare less than or equal to x .

Thus, Y ∼ binomial(n,Pi ). (Note thatwe can write Pi = FX (xi ) in the above theorem.)The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi ,

Order Statistics

Theorem

fX(j)(x) =

Proof.

We first find the cdf of X(j) and then differentiate it to obtain thepdf. Let Y be a r.v. that counts the number of X1,X2, · · · ,Xn thatare less than or equal to x . Thus, Y ∼ binomial(n,Pi ). (Note thatwe can write Pi = FX (xi ) in the above theorem.)

The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi ,

Order Statistics

Theorem

fX(j)(x) =

Proof.

We first find the cdf of X(j) and then differentiate it to obtain thepdf. Let Y be a r.v. that counts the number of X1,X2, · · · ,Xn thatare less than or equal to x . Thus, Y ∼ binomial(n,Pi ). (Note thatwe can write Pi = FX (xi ) in the above theorem.)The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi ,

Order Statistics

Proof Cont’d

Proof.

FX(j)(x) = P(Y ≥ j) =

n∑k=j

)[FX (x)]k [1− FX (x)]n−k ,

and the pdf of X(j) is

fX(j)(x) =

dxFX(j)

)(k[FX (x)]k−1[1− FX (x)]n−k fX (x)

− (n − k)[FX (x)]k [1− FX (x)]n−k−1fX (x))

Order Statistics

Proof Cont’d

Proof.

FX(j)(x) = P(Y ≥ j) =

n∑k=j

)[FX (x)]k [1− FX (x)]n−k ,

and the pdf of X(j) is

fX(j)(x) =

dxFX(j)

)(k[FX (x)]k−1[1− FX (x)]n−k fX (x)

− (n − k)[FX (x)]k [1− FX (x)]n−k−1fX (x))

Order Statistics

Proof Cont’d

fX(j)(x) = j

)fX (x)[FX (x)]j−1[1− FX (x)]n−j+

n∑k=j+1

)k[FX (x)]k−1[1− FX (x)]n−k fX (x)

−n−1∑k=j

)(n − k)[FX (x)]k [1− FX (x)]n−k−1fX (x)

(j − 1)!(n − j)!fX (x)[FX (x)]j−1[1− FX (x)]n−j

)(k + 1)[FX (x)]k [1− FX (x)]n−k−1fX (x)

−n−1∑k=j

)(n − k)[FX (x)]k [1− FX (x)]n−k−1fX (x).

Order Statistics

Uniform order statistic pdf

Example

Let X1,X2, · · · ,Xn be i.i.d. uniform(0, 1), so fX (x) = 1 for x ∈ (0, 1)and FX (x) = x for x ∈ (0, 1). We can obtain that the pdf of the jthorder statistics is

fX(j)(x) =

(j − 1)!(n − j)!x j−1(1− x)n−j

=Γ(n + 1)

Γ(j)Γ(n − j + 1)x j−1(1− x)(n−j+1)−1.

Thus, the jth order statistic form a uniform(0, 1) sample has aBeta(j , n − j + 1) distribution.From this we can deduce that

E (X(j)) =j

n + 1and Var(X(j)) =

j(n − j + 1)

(n + 1)2(n + 2).

Order Statistics

Example

fX(j)(x) =

(j − 1)!(n − j)!x j−1(1− x)n−j

=Γ(n + 1)

Γ(j)Γ(n − j + 1)x j−1(1− x)(n−j+1)−1.

Thus, the jth order statistic form a uniform(0, 1) sample has aBeta(j , n − j + 1) distribution.

From this we can deduce that

E (X(j)) =j

j(n − j + 1)

(n + 1)2(n + 2).

Order Statistics

Example

fX(j)(x) =

(j − 1)!(n − j)!x j−1(1− x)n−j

=Γ(n + 1)

Γ(j)Γ(n − j + 1)x j−1(1− x)(n−j+1)−1.

Thus, the jth order statistic form a uniform(0, 1) sample has aBeta(j , n − j + 1) distribution.From this we can deduce that

E (X(j)) =j

j(n − j + 1)

(n + 1)2(n + 2).

Order Statistics

Joint distribution of order statistics (continuous case)

Theorem

Let X(1), · · · ,X(n) denote the order statistics of a random sample,X1,X2, · · · ,Xn, from a continuous population with cdf FX (x) andpdf fX (x).

Then the joint pdf of X(i) and X(j), 1 ≤ i ≤ j ≤ n, is

fX(i),X(j)(u, v) =

(i − 1)!(j − 1− i)!(n − j)!fX (u)fX (v)[FX (u)]i−1

[FX (v)− FX (u)]j−1−v [1− FX (v)]n−j .

for −∞ < u < v <∞.

Order Statistics

Theorem

Let X(1), · · · ,X(n) denote the order statistics of a random sample,X1,X2, · · · ,Xn, from a continuous population with cdf FX (x) andpdf fX (x).Then the joint pdf of X(i) and X(j), 1 ≤ i ≤ j ≤ n, is

fX(i),X(j)(u, v) =

(i − 1)!(j − 1− i)!(n − j)!fX (u)fX (v)[FX (u)]i−1

[FX (v)− FX (u)]j−1−v [1− FX (v)]n−j .

for −∞ < u < v <∞.

Order Statistics

Theorem

The joint pdf of three or more order statistics could be derived usingsimilar but even more involved arguments. Perhaps the other mostuseful pdf is

fX(1),··· ,X(n)(x1, · · · , xn) =

n!fX (x1) · · · fX (xn), x1 < · · · < xn0, otherwise.

The joint pdf and the techniques from previous chapter can be used toderive marginal and conditional distributions and distributions of otherfunctions of the order statistics.

Order Statistics

Theorem

The joint pdf of three or more order statistics could be derived usingsimilar but even more involved arguments. Perhaps the other mostuseful pdf is

fX(1),··· ,X(n)(x1, · · · , xn) =

n!fX (x1) · · · fX (xn), x1 < · · · < xn0, otherwise.

The joint pdf and the techniques from previous chapter can be used toderive marginal and conditional distributions and distributions of otherfunctions of the order statistics.

Convergence Concepts Convergence in Probability

Outline

6 Take-aways

Convergence in probability

Definition

A sequence of r.v.s, X1,X2, · · · , converges in probability to a r.v. Xif, for every ε > 0,

limn→∞

P(|Xn − X | ≥ ε) = 0

orlimn→∞

P(|Xn − X | < ε) = 1,Xnp−→ X .

The r.v.s X1,X2, · · · are typically not i.i.d. r.v.s, as in arandom sample;

The distribution of Xn changes as the subscript changes;

The distribution of Xn converges to some limiting distributionas the subscript becomes large.

Definition

limn→∞

P(|Xn − X | ≥ ε) = 0

orlimn→∞

P(|Xn − X | < ε) = 1,Xnp−→ X .

Definition

limn→∞

P(|Xn − X | ≥ ε) = 0

orlimn→∞

P(|Xn − X | < ε) = 1,Xnp−→ X .

Definition

limn→∞

P(|Xn − X | ≥ ε) = 0

orlimn→∞

P(|Xn − X | < ε) = 1,Xnp−→ X .

Weak law of large numbers

Theorem

Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.Define X n = 1

∑ni=1 Xi . Then, for every ε > 0,

limn→∞

P(|X n − µ| < ε) = 1,

that is, X n converges in probability to µ.

Proof.

Using Chebychev’s inequality, for every ε > 0,

P(|X n − µ| < ε) = 1− P(|X n − µ| ≥ ε) = 1− P((X n − µ)2 ≥ ε2)

≥ 1− E (X n − µ)2

ε2= 1− σ2

nε2→ 1.

Weak law of large numbers

Theorem

limn→∞

P(|X n − µ| < ε) = 1,

that is, X n converges in probability to µ.

Proof.

Using Chebychev’s inequality, for every ε > 0,

P(|X n − µ| < ε) = 1− P(|X n − µ| ≥ ε) = 1− P((X n − µ)2 ≥ ε2)

≥ 1− E (X n − µ)2

ε2= 1− σ2

nε2→ 1.

Consistency of S2

Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.If we define

n − 1

n∑i=1

(Xi − X n)2,

can we prove a WLLN for S2n?

Using Chebychev’s inequality, we have,

P(|Sn − σ2| ≥ ε) ≤ E (Sn − σ2)2

=Var(S2

Thus, a sufficient condition that S2n converges in probability to σ2 is

that Var(S2n )→ 0 as n→∞.

Consistency of S2

n − 1

n∑i=1

(Xi − X n)2,

P(|Sn − σ2| ≥ ε) ≤ E (Sn − σ2)2

=Var(S2

Consistency of S2

n − 1

n∑i=1

(Xi − X n)2,

P(|Sn − σ2| ≥ ε) ≤ E (Sn − σ2)2

=Var(S2

Consistency of S?

Theorem

Let X1,X2, · · · converge in probability to a r.v. X and that h is acontinuous function. Then h(X1), h(X2), · · · converges in probabilityto h(X ).

Consistency of S

If S2n is a consistent estimator of σ2, then the sample standard devi-

ation Sn =√

S2n = h(S2

n ) is a consistent estimator of σ.Note that Sn is, in fact, a biased estimator of σ, but the bias disap-pears asymptotically.

Consistency of S?

Theorem

Let X1,X2, · · · converge in probability to a r.v. X and that h is acontinuous function. Then h(X1), h(X2), · · · converges in probabilityto h(X ).

Consistency of S

If S2n is a consistent estimator of σ2, then the sample standard devi-

ation Sn =√

S2n = h(S2

n ) is a consistent estimator of σ.Note that Sn is, in fact, a biased estimator of σ, but the bias disap-pears asymptotically.

Convergence Concepts Almost Sure Convergence

Outline

6 Take-aways

Almost Sure Convergence

A type of convergence that is stronger than convergence in probability isalmost sure convergence.

Definition

A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,

P( limn→∞

|Xn − X | < ε) = 1, or Xna.s.−−→ X .

Although they look similar, they are very different statements;

Given a sample space Ω, Xn(ω) and X (ω) are all functionsdefined on Ω. Xn

a.s.−−→ X indicates that the functions Xn(ω)converge to X (ω) for all ω ∈ Ω except perhaps for ω ∈ N,where N ⊂ Ω and P(N) = 0.

Definition

P( limn→∞

|Xn − X | < ε) = 1, or Xna.s.−−→ X .

Definition

P( limn→∞

|Xn − X | < ε) = 1, or Xna.s.−−→ X .

Definition

P( limn→∞

|Xn − X | < ε) = 1, or Xna.s.−−→ X .

Example

Let the sample space Ω be the closed interval [0, 1] with the uniformpdf.Define r.v.s

Xn(ω) = ω + ωn, and X (ω) = ω.

For every ω ∈ [0, 1), ωn → 0 as n → ∞ and Xn(ω) → ω = X (ω).However, Xn(1) = 2 for every n so Xn(1) does not converge to1 = X (1).But since the convergence occurs on the set [0, 1) and P([0, 1)) = 1,A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,

Xna.s.−−→ X .

Example

For every ω ∈ [0, 1), ωn → 0 as n → ∞ and Xn(ω) → ω = X (ω).

However, Xn(1) = 2 for every n so Xn(1) does not converge to1 = X (1).But since the convergence occurs on the set [0, 1) and P([0, 1)) = 1,A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,

Xna.s.−−→ X .

Example

For every ω ∈ [0, 1), ωn → 0 as n → ∞ and Xn(ω) → ω = X (ω).However, Xn(1) = 2 for every n so Xn(1) does not converge to1 = X (1).

But since the convergence occurs on the set [0, 1) and P([0, 1)) = 1,A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,

Xna.s.−−→ X .

Example

For every ω ∈ [0, 1), ωn → 0 as n → ∞ and Xn(ω) → ω = X (ω).However, Xn(1) = 2 for every n so Xn(1) does not converge to1 = X (1).But since the convergence occurs on the set [0, 1) and P([0, 1)) = 1,A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,

Xna.s.−−→ X .

Convergence in probability, not a.s.

Example

X1(ω) = ω + I[0,1](ω),X2(ω) = ω + I[0, 12

](ω),

X3(ω) = ω + I[ 12,1](ω),X4(ω) = ω + I[0, 1

3](ω),

X5(ω) = ω + I[ 13, 2

3](ω),X6(ω) = ω + I[ 2

3,1](ω), · · ·

Let X1(ω) = ω. Thus Xnp−→ X .

For every ω, the value of Xn(ω) alternates between the values ω and1 + ω infinitely often. Thus,

Xn 6a.s.−−→ X .

Example

X1(ω) = ω + I[0,1](ω),X2(ω) = ω + I[0, 12

](ω),

X3(ω) = ω + I[ 12,1](ω),X4(ω) = ω + I[0, 1

3](ω),

X5(ω) = ω + I[ 13, 2

3](ω),X6(ω) = ω + I[ 2

3,1](ω), · · ·

Xn 6a.s.−−→ X .

Example

X1(ω) = ω + I[0,1](ω),X2(ω) = ω + I[0, 12

](ω),

X3(ω) = ω + I[ 12,1](ω),X4(ω) = ω + I[0, 1

3](ω),

X5(ω) = ω + I[ 13, 2

3](ω),X6(ω) = ω + I[ 2

3,1](ω), · · ·

Xn 6a.s.−−→ X .

Strong law of large numbers

Theorem

P( limn→∞

|X n − µ| < ε) = 1, or Xna.s.−−→ X .

For both the weak and strong laws of large numbers, we havethe assumption of a finite variance. Both laws hold withoutthis assumption. The only moment condition needed is thatE |Xi | <∞;

Convergence almost surely, being the stronger criterion, impliesconvergence in probability;

If a sequence converges in probability, it is possible to find asubsequence that converges almost surely.

Theorem

P( limn→∞

|X n − µ| < ε) = 1, or Xna.s.−−→ X .

Theorem

P( limn→∞

|X n − µ| < ε) = 1, or Xna.s.−−→ X .

Theorem

P( limn→∞

|X n − µ| < ε) = 1, or Xna.s.−−→ X .

Convergence Concepts Convergence in Distribution

Outline

6 Take-aways

Convergence in distribution

Definition

A sequence of r.v.s, X1,X2, · · · , converges in distribution to a r.v. Xif

limn→∞

FXn(x) = FX (x),

at all points x where FX (x) is continuous.

Example

If r.v.s, X1,X2, · · · are i.i.d. uniform(0, 1) and X(n) = max1≤i≤n Xi ,let us examine if X(n) converges in distribution.As n→∞, we expect X(n) to get close to 1, as X(n) must necessarilybe less than 1, we have for any ε > 0

P(|X(n) − 1| ≥ ε) = P(X(n) ≥ 1 + ε) + P(X(n) ≤ 1− ε)= 0 + P(X(n) ≤ 1− ε).

Definition

limn→∞

FXn(x) = FX (x),

Example

If r.v.s, X1,X2, · · · are i.i.d. uniform(0, 1) and X(n) = max1≤i≤n Xi ,let us examine if X(n) converges in distribution.

As n→∞, we expect X(n) to get close to 1, as X(n) must necessarilybe less than 1, we have for any ε > 0

Definition

limn→∞

FXn(x) = FX (x),

Example

Definition

limn→∞

FXn(x) = FX (x),

Example

Example Cont’d

Next using the fact that we have an i.i.d. sample, we can write

P(X(n) ≤ 1− ε) = P(Xi ≤ 1− ε, i = 1, · · · , n) = (1− ε)n,

which goes to 0.

So we have proved that X(n) converges to 1 in probability. However,if we take ε = t

n , we then have

P(X(n) ≤ 1− t

n) = (1− t

n)n → e−t ,

which, upon rearranging, yields

P(n(1− X(n)) ≤ t)→ 1− e−t ;

That is, the r.v. n(1 − X(n)) converges in distribution to anexponential(1) r.v.

Example Cont’d

P(X(n) ≤ 1− ε) = P(Xi ≤ 1− ε, i = 1, · · · , n) = (1− ε)n,

which goes to 0.So we have proved that X(n) converges to 1 in probability. However,if we take ε = t

n , we then have

P(X(n) ≤ 1− t

n) = (1− t

n)n → e−t ,

P(n(1− X(n)) ≤ t)→ 1− e−t ;

Example Cont’d

P(X(n) ≤ 1− ε) = P(Xi ≤ 1− ε, i = 1, · · · , n) = (1− ε)n,

n , we then have

P(X(n) ≤ 1− t

n) = (1− t

n)n → e−t ,

P(n(1− X(n)) ≤ t)→ 1− e−t ;

Example Cont’d

P(X(n) ≤ 1− ε) = P(Xi ≤ 1− ε, i = 1, · · · , n) = (1− ε)n,

n , we then have

P(X(n) ≤ 1− t

n) = (1− t

n)n → e−t ,

P(n(1− X(n)) ≤ t)→ 1− e−t ;

Central limit theorem

Let X1,X2, · · · be a sequence of i.i.d r.v.s whose mgfs exist in aneighborhood of 0 (i.e., MXi

(t) exists for |t| < h, for some positiveh). Let E (Xi ) = µ and Var(Xi ) = σ2 > 0. Define

X n =1

n∑i=1

Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ∈ [−∞,∞],

limn→∞

Gn(x) =

−∞

1√2π

e−y2/2dy .

i.e.,√n(X n−µ)σ has a limiting standard normal distribution.

X n =1

n∑i=1

limn→∞

Gn(x) =

−∞

1√2π

e−y2/2dy .

X n =1

n∑i=1

limn→∞

Gn(x) =

−∞

1√2π

e−y2/2dy .

The proof of the central limit theorem

Proof.

We will show that, for |t| < h, the mgf of√n(X n−µ)σ converges to

et2/2, the mgf of a N(0, 1) r.v.

Define Yi = Xi−µσ , and let MY (t) denote the common mfg of the

Y ′i s, which exists for |t| < σh, we have

M√n(X n−µ)/σ(t) = M∑ni=1 Yi/

√n(t) = M∑n

i=1 Yi(t√n

) =(MY (

We now expand MY ( t√n

) in a Taylor series around 0. We have

MY (t√n

) =∞∑k=0

M(k)Y (0)

(t/√n)k

Proof.

Y ′i s, which exists for |t| < σh,

we have

√n(t) = M∑n

i=1 Yi(t√n

) =(MY (

MY (t√n

) =∞∑k=0

M(k)Y (0)

(t/√n)k

Proof.

√n(t) = M∑n

i=1 Yi(t√n

) =(MY (

MY (t√n

) =∞∑k=0

M(k)Y (0)

(t/√n)k

Proof.

√n(t) = M∑n

i=1 Yi(t√n

) =(MY (

MY (t√n

) =∞∑k=0

M(k)Y (0)

(t/√n)k

Proof Cont’d

where M(k)Y (0) = dkMY (t)

dtk|t=0. Since the mgfs exist for |t| < σh, the

power series expansion is valid if |t| <√nσh.

Using the facts that M(0)Y = 1, M

(1)Y = 0, and M

(2)Y = 1, we have

MY (t√n

) = 1 +(t/√n)2

2!+ RY (t/

√n),

where RY is the remainder term in the Taylor expansion,

RY (t√n

) =∞∑k=3

M(k)Y (0)

(t/√n)k

An application of Taylor’s Theorem (5.5.21) shows that, for fixedt 6= 0, we have

limn→∞

RY (t/√n)

(t/√n)2

Proof Cont’d

(1)Y = 0, and M

(2)Y = 1, we have

MY (t√n

) = 1 +(t/√n)2

2!+ RY (t/

√n),

RY (t√n

) =∞∑k=3

M(k)Y (0)

(t/√n)k

limn→∞

RY (t/√n)

(t/√n)2

Proof Cont’d

(1)Y = 0, and M

(2)Y = 1, we have

MY (t√n

) = 1 +(t/√n)2

2!+ RY (t/

√n),

RY (t√n

) =∞∑k=3

M(k)Y (0)

(t/√n)k

limn→∞

RY (t/√n)

(t/√n)2

Proof Cont’d

Since t is fixed, we also have

limn→∞

RY (t/√n)

(t/√n)2

= limn→∞

nRY (t/√n) = 0,

and is also true at t = 0 since RY (0/√n) = 0.

Thus, for any fixedt, we can write

limn→∞

= limn→∞

(t/√n)2

2!+ RY (t/

√n)]n

= limn→∞

2+ nRY (t/

√n))]n

= et2/2.

Note that et2/2 is the mgf of the N(0, 1) distribution.

Proof Cont’d

limn→∞

RY (t/√n)

(t/√n)2

= limn→∞

nRY (t/√n) = 0,

and is also true at t = 0 since RY (0/√n) = 0. Thus, for any fixed

t, we can write

limn→∞

= limn→∞

(t/√n)2

2!+ RY (t/

√n)]n

= limn→∞

2+ nRY (t/

√n))]n

= et2/2.

Proof Cont’d

limn→∞

RY (t/√n)

(t/√n)2

= limn→∞

nRY (t/√n) = 0,

and is also true at t = 0 since RY (0/√n) = 0. Thus, for any fixed

t, we can write

limn→∞

= limn→∞

(t/√n)2

2!+ RY (t/

√n)]n

= limn→∞

2+ nRY (t/

√n))]n

= et2/2.

Strong form of the central limit theorem

Let X1,X2, · · · be a sequence of i.i.d r.v.s with E (Xi ) = µ and 0 <Var(Xi ) = σ2 <∞. Define

X n =1

n∑i=1

Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ,

limn→∞

Gn(x) =

−∞

1√2π

e−y2/2dy .

In particular, all of the assumptions about mgfs are not needed, i.e., theuse of characteristic functions can replace them.This theorem is general enough for almost all statistical purposes. It holdswhen the parent distribution has finite variance.

X n =1

n∑i=1

limn→∞

Gn(x) =

−∞

1√2π

e−y2/2dy .

X n =1

n∑i=1

limn→∞

Gn(x) =

−∞

1√2π

e−y2/2dy .

X n =1

n∑i=1

limn→∞

Gn(x) =

−∞

1√2π

e−y2/2dy .

In particular, all of the assumptions about mgfs are not needed, i.e., theuse of characteristic functions can replace them.

This theorem is general enough for almost all statistical purposes. It holdswhen the parent distribution has finite variance.

X n =1

n∑i=1

limn→∞

Gn(x) =

−∞

1√2π

e−y2/2dy .

In particular, all of the assumptions about mgfs are not needed, i.e., theuse of characteristic functions can replace them.This theorem is general enough for almost all statistical purposes.

It holdswhen the parent distribution has finite variance.

X n =1

n∑i=1

limn→∞

Gn(x) =

−∞

1√2π

e−y2/2dy .

Normal approximation to the negative binomial

Let X1,X2, · · · be a random sample from a negative binomial(r , p) distri-bution. Recall that

E (X ) =r(1− p)

p, and Var(X ) =

r(1− p)

and the central limit theorem tells us that

√n(X − r(1− p)/p)√

r(1− p)/p2→ N(0, 1).

For r = 10, p = 12 , and n = 30, and calculations would be

P(X ≤ 11) = P( 30∑

Xi ≤ 330)

= · · · = 0.8916.

P(X ≤ 11) = P(√30(X − 10)√

20≤√

30(11− 10)√20

) ≈ 0.8888.

Normal approximation to the negative binomial

Let X1,X2, · · · be a random sample from a negative binomial(r , p) distri-bution. Recall that

E (X ) =r(1− p)

p, and Var(X ) =

r(1− p)

and the central limit theorem tells us that

√n(X − r(1− p)/p)√

r(1− p)/p2→ N(0, 1).

For r = 10, p = 12 , and n = 30, and calculations would be

P(X ≤ 11) = P( 30∑

Xi ≤ 330)

= · · · = 0.8916.

P(X ≤ 11) = P(√30(X − 10)√

20≤√

30(11− 10)√20

) ≈ 0.8888.

Slutsky’s Theorem

Let Xn → X in distribution and Yn → a, a constant, in probability,then

a. YnXn → aX in distribution.

b. Xn + Yn → X + a in distribution.

Let√n(X n−µ)σ → N(0, 1), but the value of σ is unknown.

If limn→∞ Var(S2n ) = 0, then S2

n → σ2 in probability, i.e., σSn→ 1 in

probability.Hence, Slutsky’s theorem tells us

√n(X n − µ)

σ→ N(0, 1).

Slutsky’s Theorem

√n(X n − µ)

σ→ N(0, 1).

Slutsky’s Theorem

√n(X n − µ)

σ→ N(0, 1).

Slutsky’s Theorem

√n(X n − µ)

σ→ N(0, 1).

Slutsky’s Theorem

probability.

Hence, Slutsky’s theorem tells us

√n(X n − µ)

σ→ N(0, 1).

Slutsky’s Theorem

√n(X n − µ)

σ→ N(0, 1).

Generating a Random Sample

Exponential lifetime

Example

Suppose that a particular electrical component is to be modeled with anexponential(λ) lifetime. The manufacturer is interested in determining theprobability that, out of c components, at least t of them will last h hours.

Taking this one step at a time, we have

p1 = P(component lasts at least h hours) = P(X ≥ h|λ),

and assuming that the components are independent, we can model theoutcomes of the c components as Bernoulli trials, so

p2 = P(at least t components last h hours)

)pk1 (1− p1)c−k .

Exponential lifetime

Example

Suppose that a particular electrical component is to be modeled with anexponential(λ) lifetime. The manufacturer is interested in determining theprobability that, out of c components, at least t of them will last h hours.Taking this one step at a time, we have

p1 = P(component lasts at least h hours) = P(X ≥ h|λ),

and assuming that the components are independent, we can model theoutcomes of the c components as Bernoulli trials, so

p2 = P(at least t components last h hours)

)pk1 (1− p1)c−k .

Example Cont’d

Moreover, the exponential model has the advantage that p1 can be ex-pressed in closed form, i.e.,

∫ ∞h

λe−x/λdx = e−h/λ.

However, if each component were modeled with, say, a Gamma distribution,then p1 may not be expressible in closed form. This would make calculationof p2 even more involved.A simulation approach to the calculation of expressions is to generate r.v.swith the desired distribution and then use the Weak Law of Large Numbersto validate the simulation. That is, if Yi , i = 1, 2, · · · , n, are i.i.d., then aconsequence of that theorem is

n∑i=1

h(Yi )→ E (h(Y )) in probability.

Example Cont’d

∫ ∞h

However, if each component were modeled with, say, a Gamma distribution,then p1 may not be expressible in closed form. This would make calculationof p2 even more involved.

A simulation approach to the calculation of expressions is to generate r.v.swith the desired distribution and then use the Weak Law of Large Numbersto validate the simulation. That is, if Yi , i = 1, 2, · · · , n, are i.i.d., then aconsequence of that theorem is

n∑i=1

Example Cont’d

∫ ∞h

However, if each component were modeled with, say, a Gamma distribution,then p1 may not be expressible in closed form. This would make calculationof p2 even more involved.A simulation approach to the calculation of expressions is to generate r.v.swith the desired distribution and then use the Weak Law of Large Numbersto validate the simulation. That is, if Yi , i = 1, 2, · · · , n, are i.i.d., then aconsequence of that theorem is

n∑i=1

Approximate calculation

The probability p2 can be calculated using the following steps. For j =1, · · · , n :

a. Generate X1, · · · ,Xc i.i.d. ∼ exponential(λ);

b. Set Yj = 1 if at least t Xi s are ≥ h; otherwise, set Yj = 0.

Then, because Yj ∼ Bernoulli(p2) and E (Yj) = p2, then

n∑i=1

Yi → p2 in probability.

There two major concerns should be emphasize

We must examine how to generate the r.v.s that we need;

We then use a version of the Law of Large Numbers to validate thesimulation approximation.

n∑i=1

Pseudo-random number generation

We start with the assumption that we can generate i.i.d. uniform r.v.sU1, · · · ,Um. There exist many algorithms for generating pseudo-random numbers that will pass almost all tests for uniformity.Since we are starting from the uniform r.v.s, our problem here isreally not the problem of generating the desired r.v.s, but rather oftransforming the uniform r.v.s to the desired distribution.

There are two general methods for doing this

Direct model

Indirect model

We start with the assumption that we can generate i.i.d. uniform r.v.sU1, · · · ,Um. There exist many algorithms for generating pseudo-random numbers that will pass almost all tests for uniformity.Since we are starting from the uniform r.v.s, our problem here isreally not the problem of generating the desired r.v.s, but rather oftransforming the uniform r.v.s to the desired distribution.There are two general methods for doing this

Direct model

Indirect model

Generating a Random Sample Direct Model

Outline

6 Take-aways

We start with the assumption that we can generate i.i.d. uniform r.v.sU1, · · · ,Um. There exist many algorithms for generating pseudo-random numbers that will pass almost all tests for uniformity.Since we are starting from the uniform r.v.s, our problem here isreally not the problem of generating the desired r.v.s, but rather oftransforming the uniform r.v.s to the desired distribution.

There are two general methods for doing this

Direct model

Indirect model

We start with the assumption that we can generate i.i.d. uniform r.v.sU1, · · · ,Um. There exist many algorithms for generating pseudo-random numbers that will pass almost all tests for uniformity.Since we are starting from the uniform r.v.s, our problem here isreally not the problem of generating the desired r.v.s, but rather oftransforming the uniform r.v.s to the desired distribution.There are two general methods for doing this

Direct model

Indirect model

Probability integral transform

A direct model of generating a r.v. is one for which there exists a closedform function g(u) such that the transformed variable Y = g(U) has thedesired distribution when U ∼ uniform(0, 1).

As might be recalled, thiswas already accomplished for continuous r.v.s in the probability integraltransform, where any distribution was transformed to the uniform. Hencethe inverse transformation solves our problem.

Example

If Y is a continuous r.v. with cdf FY , then the r.v. F−1Y (U), where U ∼

uniform(0, 1), has distribution FY . If Y ∼ exponential(λ), then

F−1Y (U) = λ log (1− U)

is an exponential(λ) r.v.

A direct model of generating a r.v. is one for which there exists a closedform function g(u) such that the transformed variable Y = g(U) has thedesired distribution when U ∼ uniform(0, 1). As might be recalled, thiswas already accomplished for continuous r.v.s in the probability integraltransform, where any distribution was transformed to the uniform. Hencethe inverse transformation solves our problem.

Example

F−1Y (U) = λ log (1− U)

A direct model of generating a r.v. is one for which there exists a closedform function g(u) such that the transformed variable Y = g(U) has thedesired distribution when U ∼ uniform(0, 1). As might be recalled, thiswas already accomplished for continuous r.v.s in the probability integraltransform, where any distribution was transformed to the uniform. Hencethe inverse transformation solves our problem.

Example

F−1Y (U) = λ log (1− U)

Analysis of probability integral transform

Thus, if we generate U1, · · · ,Un as i.i.d. uniform r.v.s, Y =−λ log (1− Ui ), i = 1, · · · , n, are i.i.d. exponential(λ) r.v.s. As an ex-ample, for n = 10, 000, we generate ui and have

∑ui = 0.5019 and

n − 1

∑(ui − u)2 = 0.0842.

According to the WLLN, we know that

U → E (U) =1

2and S2 → Var(U) =

12= 0.0833.

The transformed variables Yi = −2 log (1− ui ) have an exponential(2)distribution, and we find that

∑yi = 2.0004 and

n − 1

∑(yi − y)2 = 4.0908.

∑ui = 0.5019 and

n − 1

∑(ui − u)2 = 0.0842.

U → E (U) =1

12= 0.0833.

∑yi = 2.0004 and

n − 1

∑(yi − y)2 = 4.0908.

∑ui = 0.5019 and

n − 1

∑(ui − u)2 = 0.0842.

U → E (U) =1

12= 0.0833.

∑yi = 2.0004 and

n − 1

∑(yi − y)2 = 4.0908.

The relationship between the exponential and other distributions allows thequick generation of many r.v.s. For example

Yj = −λ log (Uj) ∼ exponential(λ)

Y = −2v∑

log (Uj) ∼ χ22v

Y = −βa∑

log (Uj) ∼ gamma(a, β)

∑aj=1 log (Uj)∑a+bj=1 log (Uj)

∼ beta(a, b).

Unfortunately, there are limits to this transformation. For example, wecannot use it to generate χ2 r.v.s with odd degrees of freedom.

Box-Muller algorithm

F−1Y (u) = y ⇔ u =

−∞fy (t)dt.

Application of this formula to the exponential distribution was particularhandy, as the integral equation has a simple solution.

However, in manycases no closed-form solution. In this case, the other types of transforma-tions or indirect methods may be feasible.

Box-muller algorithm

Generate U1 and U2, two independent uniform(0, 1) r.v.s, and set

R =√−2 logU1 and θ = 2πU2.

X = R cos θ ∼ N(0, 1) and Y = R sin θ ∼ N(0, 1).

F−1Y (u) = y ⇔ u =

−∞fy (t)dt.

Application of this formula to the exponential distribution was particularhandy, as the integral equation has a simple solution. However, in manycases no closed-form solution.

In this case, the other types of transforma-tions or indirect methods may be feasible.

F−1Y (u) = y ⇔ u =

−∞fy (t)dt.

Application of this formula to the exponential distribution was particularhandy, as the integral equation has a simple solution. However, in manycases no closed-form solution. In this case, the other types of transforma-tions or indirect methods may be feasible.

F−1Y (u) = y ⇔ u =

−∞fy (t)dt.

Application of this formula to the exponential distribution was particularhandy, as the integral equation has a simple solution. However, in manycases no closed-form solution. In this case, the other types of transforma-tions or indirect methods may be feasible.

Generating discrete r.v.s

Generating processing

If Y is a discrete r.v. taking on values y1 < y2 < · · · < yk , then wecan similarly write

P[FY (yi ) < U ≤ FY (yi+1)] = FY (yi+1)− FY (yi ) = P(Y = yi+1).

To generate Yi ∼ FY (y),

a. Generate U ∼ uniform(0, 1);

b. If FY (yi+1) ≥ U > FY (yi ), set Y = yi+1.

We define y0 = −∞ and FY (y0) = 0.

Generating a Random Sample Indirect Model

Outline

6 Take-aways

Beta r.v. generation I

Suppose the goal is to generate Y ∼ beta(a, b). If both a andb are integers, then the direct transformation method can be used.

However, if a and b are not integers, for example a = 2.7 and b = 6.3.

If (U,V ) are independent uniform(0, 1)r.v.s, then the probability of the shadedarea is

P(V ≤ y ,U ≤ 1

cfY (V ))

∫ fY (v)/c

0fY (v)dv =

cP(Y ≤ y)

Here, if y = 1, we have

P(U ≤ 1

cfY (V )) =

Suppose the goal is to generate Y ∼ beta(a, b). If both a andb are integers, then the direct transformation method can be used.However, if a and b are not integers, for example a = 2.7 and b = 6.3.

P(V ≤ y ,U ≤ 1

cfY (V ))

∫ fY (v)/c

0fY (v)dv =

cP(Y ≤ y)

P(U ≤ 1

cfY (V )) =

P(V ≤ y ,U ≤ 1

cfY (V ))

∫ fY (v)/c

0fY (v)dv =

cP(Y ≤ y)

P(U ≤ 1

cfY (V )) =

P(V ≤ y ,U ≤ 1

cfY (V ))

∫ fY (v)/c

0fY (v)dv =

cP(Y ≤ y)

P(U ≤ 1

cfY (V )) =

Rejection sampling

Hence, we have

P(Y ≤ y) =P(V ≤ y ,U ≤ 1

c fY (V ))

P(U ≤ 1c fY (V ))

= P(V ≤ y |U ≤ 1

cfY (V )).

The result suggests the following rejection sampling algorithm forgenerating Y ∼ beta(a, b) :

a. Generate (U,V ) independent uniform(0, 1);

b. If U < 1cFY (V ), set Y = V ; otherwise, return to step(a).

This algorithm generates a beta(a, b) r.v. as long as c ≥ maxy fY (y)and, in fact, can be generalized to any bounded density with boundedsupport.

Rejection sampling

Hence, we have

P(Y ≤ y) =P(V ≤ y ,U ≤ 1

c fY (V ))

P(U ≤ 1c fY (V ))

= P(V ≤ y |U ≤ 1

cfY (V )).

Rejection sampling

Hence, we have

P(Y ≤ y) =P(V ≤ y ,U ≤ 1

c fY (V ))

P(U ≤ 1c fY (V ))

= P(V ≤ y |U ≤ 1

cfY (V )).

Rejection sampling

Hence, we have

P(Y ≤ y) =P(V ≤ y ,U ≤ 1

c fY (V ))

P(U ≤ 1c fY (V ))

= P(V ≤ y |U ≤ 1

cfY (V )).

Rejection sampling

Hence, we have

P(Y ≤ y) =P(V ≤ y ,U ≤ 1

c fY (V ))

P(U ≤ 1c fY (V ))

= P(V ≤ y |U ≤ 1

cfY (V )).

Analysis of rejection sampling

It should be clear that the optimal choice of c is c = maxy fY (y).To see why this is so, note that the algorithm is open-ended in thesense that we do not know how many (U,V ) pairs will be needed toget on r.v. Y .

Let N be the number of (U,V ) pairs required for one Y . Then,recalling that P(U ≤ 1

c fY (V )) = 1c , we see that N is a geometric( 1

c )r.v. Thus to generate one Y , we expect to need E (N) = c pairs(U,V ), and in this sense minimizing c will optimize the algorithm.Examining the figure, we see that the algorithm is wasteful in the areawhere U > 1

cFY (V ). This is because we are using a uniform r.v. Vto get a beta r.v. Y . To improve, we might start with somethingcloser to the beta distribution.

It should be clear that the optimal choice of c is c = maxy fY (y).To see why this is so, note that the algorithm is open-ended in thesense that we do not know how many (U,V ) pairs will be needed toget on r.v. Y .Let N be the number of (U,V ) pairs required for one Y . Then,recalling that P(U ≤ 1

c )r.v.

Thus to generate one Y , we expect to need E (N) = c pairs(U,V ), and in this sense minimizing c will optimize the algorithm.Examining the figure, we see that the algorithm is wasteful in the areawhere U > 1

c )r.v. Thus to generate one Y , we expect to need E (N) = c pairs(U,V ), and in this sense minimizing c will optimize the algorithm.

Examining the figure, we see that the algorithm is wasteful in the areawhere U > 1

cFY (V ).

This is because we are using a uniform r.v. Vto get a beta r.v. Y . To improve, we might start with somethingcloser to the beta distribution.

cFY (V ). This is because we are using a uniform r.v. Vto get a beta r.v. Y .

To improve, we might start with somethingcloser to the beta distribution.

Theorem of Accept/Reject sampling

Let Y ∼ fY (y) and V ∼ fV (v), where fY and fV have commonsupport with

M = supy

fY (y)/fV (y) <∞.

To generate a r.v. Y ∼ fY ;

a. Generate U ∼ uniform(0, 1), andV ∼ fV , independent;

b. If U < 1M fY (V )/fV (V ), set Y = V ;

otherwise, return to step(a).

M = supy

fY (y)/fV (y) <∞.

M = supy

fY (y)/fV (y) <∞.

Proof.

The generated r.v. Y has cdf

P(Y ≤ y) = P(V ≤ y |stop) = P(V ≤ y |U <1

MfY (V )/fV (V ))

=P(V ≤ y ,U < 1

M fY (V )/fV (V ))

P(U < 1M fY (V )/fV (V ))

−∞∫ 1

M fY (V )/fV (V )

0dufV (v)dv∫ +∞

−∞∫ 1

M fY (V )/fV (V )

0dufV (v)dv

−∞fY (v)dv .

Hence, the number of trials needed to generate one Y is a geometric( 1M )

r.v., and M is the expected number of trials.

Beta r.v. generation II

To generate Y ∼ beta(2.7, 6.3), we can consider the following algo-rithm

a. Generate U ∼ uniform(0, 1), and V ∼ beta(2, 6), independent;

b. If U < 1M fY (V )/fV (V ), set Y = V ; otherwise, return to

step(a).

This Accept/reject algorithm will generate the required Y as longas supy fY (y)/fV (y) ≤ M < ∞. For the given densities, we haveM = 1.67, so the requirement is satisfied.

step(a).

Markov Chain Monte Carlo

The Main idea of MCMC

We cannot sample directly from the target distribution p(x) in theintegral

∫f (x)p(x)dx .

Create a Markov chain whose transition matrix does notdepend on the normalization term.

Make sure the chain has a stationary distribution and it isequal to the target distribution.

After sufficient number of iterations, the chain will convergethe stationary distribution.

∫f (x)p(x)dx .

Overview

Markov Chain Monte Carlo (MCMC) methods are a class of algorithms forsampling from a probability distribution based on constructing a Markovchain that has the desired distribution as its stationary distribution.

The algorithm is proposed in 1953, which is the top-10 mostimportant algorithms in the 20th century.

MCMC works by generating a sequence of sample values in such away that, as more and more sample values are produced, thedistribution of values more closely approximates the desireddistribution, π(i).

That is, a Markov Chain has stationary distribution π(i) associatedwith transition probability matrix P.

The Markov Chain converges to the stationary distribution π(i) forarbitrary initial status x0.

Overview

Markov Chain Monte Carlo Markov Chain and Random Walk

Outline

6 Take-aways

Random process

A random process is a collection of r.v.s indexed by some set I , takingvalues in a set S .

I is the index set, usually time, e.g., Z+, R, and R+, etc.

S is the state space, e.g., Z+, Rn, and a, b, c, etc.

We classify random processes according to

Index set can be discrete or continuous;

State space can be finite, countable or uncountable(continuous).

Random process

A random process is a collection of r.v.s indexed by some set I , takingvalues in a set S .

I is the index set, usually time, e.g., Z+, R, and R+, etc.

S is the state space, e.g., Z+, Rn, and a, b, c, etc.

We classify random processes according to

Index set can be discrete or continuous;

State space can be finite, countable or uncountable(continuous).

Markov property

More formally, X (t) is Markovian if the following property holds:

P(X (tn) = jn|X (tn−1) = jn−1, · · · ,X (t1) = j1)

= P(X (tn) = jn|X (tn−1) = jn−1)

for all finite sequence of times t1 < · · · < tn ∈ I and of statesj1, · · · , jn ∈ S .

A random process is called a Markov Process if conditionalon the current state, its future is independent of its past.

Process X0,X1, · · · satisfies the Markov property if

P(Xn+1 = xn+1|X0 = x0, · · · ,Xn = xn) = P(Xn+1 = xn+1|Xn = xn)

for all n and all xi ∈ Ω.

The term Markov property refers to the memoryless propertyof a stochastic process.

Markov property

P(X (tn) = jn|X (tn−1) = jn−1, · · · ,X (t1) = j1)

= P(X (tn) = jn|X (tn−1) = jn−1)

Markov property

P(X (tn) = jn|X (tn−1) = jn−1, · · · ,X (t1) = j1)

= P(X (tn) = jn|X (tn−1) = jn−1)

Markov property

P(X (tn) = jn|X (tn−1) = jn−1, · · · ,X (t1) = j1)

= P(X (tn) = jn|X (tn−1) = jn−1)

Discrete time Markov Chain

A stochastic process X0,X1, · · · of discrete time and discrete spaceis a Markov chain if it satisfies the Markov property, i.e.,

A Markov chain Xt is said to be time homogeneous ifP(Xs+t = j |Xs = i) is independent of s. When this holds,putting s = 0 gives

P(Xs+t = j |Xs = i) = P(Xt = j |X0 = i).

If moreover P(Xn+1 = j |Xn = i) = Pij is independent of n,then X is said to be time homogeneous Markov chain.

P(Xs+t = j |Xs = i) = P(Xt = j |X0 = i).

Examples

Ω = A,B; π(A,B) = q, π(A,A) = 1− q, π(B,A) = r ,π(B,B) = q for some 0 < q, r < 1 and µ(A) = 1, µ(B) = 0,so that at time 0 we always start at A.

Ω = Z; π(a, a− 1) = 12 , π(a, a + 1) = 1

2 , π(a, b) = 0 ifb 6= a± 1 for every a ∈ Z, µ(0) = 1 and µ(a) = 0 if a 6= 0, soat time 0 we always start at 0.

A graph G = (V ,E ) consists of a vertex set V and an edge setE , where the elements of E are unordered pairs of vertices:E ⊂ (x , y) : x , y ∈ V , x 6= y. The degree deg(x) of a vertexx is the number of neighbours of x . In this case, Ω = V and isfinite, π(a, b) = 1

deg(a) if b is a neighbour of a and π(a, b) = 0

otherwise. µ(a) = 1|V | for every a ∈ V , so at time 0 we start

from the uniform distribution on V

Examples

Ω = Z; π(a, a− 1) = 12 , π(a, a + 1) = 1

Examples

Ω = Z; π(a, a− 1) = 12 , π(a, a + 1) = 1

Transition matrix

Definition

Let a Markov chain have P(t+1)x,y = P[Xt+1 = y |Xt = x], and the finite state space be

Ω = [n]. This gives us a transition matrix P(t+1) at time t. The transition matrix is anN ×N matrix of nonnegative entries such that the sum over each row of P(t) is 1, since∀n and ∀xi ∈ Ω ∑

P(t+1)x,y =

P[Xt+1 = y |Xt = x] = 1

For example, P(t+1) =

12 0 0

23 0 0

0 0 34

0 0 14

(t+1)1,2 = P[Xt+1 = 2|Xt = 1] = 1

P(t+1)1,3 = P[Xt+1 = 3|Xt = 1] = 0∑4i=1 P

(t+1)1,i = 1

Transition matrix

Definition

P(t+1)x,y =

P[Xt+1 = y |Xt = x] = 1

12 0 0

23 0 0

0 0 34

0 0 14

P(t+1)1,2 = P[Xt+1 = 2|Xt = 1] = 1

P(t+1)1,3 = P[Xt+1 = 3|Xt = 1] = 0∑4i=1 P

(t+1)1,i = 1

Transition matrix

Definition

P(t+1)x,y =

P[Xt+1 = y |Xt = x] = 1

12 0 0

23 0 0

0 0 34

0 0 14

(t+1)1,2 = P[Xt+1 = 2|Xt = 1] = 1

P(t+1)1,3 = P[Xt+1 = 3|Xt = 1] = 0

∑4i=1 P

(t+1)1,i = 1

Transition matrix

Definition

P(t+1)x,y =

P[Xt+1 = y |Xt = x] = 1

12 0 0

23 0 0

0 0 34

0 0 14

(t+1)1,2 = P[Xt+1 = 2|Xt = 1] = 1

P(t+1)1,3 = P[Xt+1 = 3|Xt = 1] = 0∑4i=1 P

(t+1)1,i = 1

State distribution

Definition

Let π(t) be the state distribution of the chain at time t, that π(t)x = P[Xt =

For a finite chain, π(t) is a vector of N nonnegative entries such that∑x π

(t)x = 1. Then, it holds that π(t+1) = π(t)P(t+1). We apply the

law of total probability

π(t+1)y =

P[Xt+1 = y |Xt = x ]P[Xt = x ] =∑x

π(t)x P(t+1)

x,y = (π(t)P(t+1))y

Let π(t) = (0.4, 0.6, 0, 0) be a state distribution, thenπ(t+1) = (0.4, 0.6, 0, 0)

Let π(t) = (0, 0, 0.5, 0.5) be a state distribution, thenπ(t+1) = (0, 0, 0.5, 0.5)

State distribution

Definition

x ].For a finite chain, π(t) is a vector of N nonnegative entries such that∑

x π(t)x = 1. Then, it holds that π(t+1) = π(t)P(t+1). We apply the

π(t+1)y =

P[Xt+1 = y |Xt = x ]P[Xt = x ] =∑x

π(t)x P(t+1)

x,y = (π(t)P(t+1))y

State distribution

Definition

π(t+1)y =

P[Xt+1 = y |Xt = x ]P[Xt = x ] =∑x

π(t)x P(t+1)

x,y = (π(t)P(t+1))y

State distribution

Definition

π(t+1)y =

P[Xt+1 = y |Xt = x ]P[Xt = x ] =∑x

π(t)x P(t+1)

x,y = (π(t)P(t+1))y

State distribution

Definition

π(t+1)y =

P[Xt+1 = y |Xt = x ]P[Xt = x ] =∑x

π(t)x P(t+1)

x,y = (π(t)P(t+1))y

Stationary distributions

Definition

A stationary distribution of a finite Markov chain with transition matrix Pis a probability distribution π such that πP = π.

For some Markov chains, no matter what the initial distribution is,after running the chain for a while, the distribution of the chainapproaches the stationary distribution

E.g., P20 =

0.4 0.6 0 00.4 0.6 0 00 0 0.5 0.50 0 0.5 0.5

. The chain could converge to

any distribution which is a linear combination of (0.4, 0.6, 0, 0) and(0, 0, 0.5, 0.5). We observe that the original chain P can be brokeninto two disjoint Markov chains, which have their own stationarydistributions. We say that the chain is reducible

Definition

E.g., P20 =

0.4 0.6 0 00.4 0.6 0 00 0 0.5 0.50 0 0.5 0.5

Definition

E.g., P20 =

0.4 0.6 0 00.4 0.6 0 00 0 0.5 0.50 0 0.5 0.5

Irreducibility

Definition

State y is accessible from state x if it is possible for the chain to visit statey if the chain starts in state x , in other words, Pn(x , y) > 0, ∀n. State xcommunicates with state y if y is accessible from x and x is accessiblefrom y . We say that the Markov chain is irreducible if all pairs of statescommunicates.

y is accessible from x means that y is connected from x in thetransition graph, i.e., there is a directed path from x to y

x communicates with y means that x and y are strongly connectedin the transition graph

A finite Markov chain is irreducible if and only if its transition graphis strongly connected

The Markov chain associated with transition matrix P is notirreducible

Irreducibility

Definition

Irreducibility

Definition

Irreducibility

Definition

Irreducibility

Definition

Aperiodicity

Definition

The period of a state x is the greatest common divisor (gcd), such thatdx = gcdn|(Pn)x,x > 0. A state is aperiodic if its period is 1. A Markovchain is aperiodic if all its states are aperiodic.

For example, suppose that the period of state x is dx = 3. Then,starting from state x , chain x ,©,©,,©,©,, · · · , only thesquares are possible to be x .

In the transition graph of a finite Markov chain, (Pn)x,x > 0 isequivalent to that x is on a cycle of length n. Period of a state x isthe greatest common devisor of the lengths of cycles passing x .

Theorem

1. If the states x and y communicate, then dx = dy .2. We have (Pn)x,x = 0 if n mod(dx) 6= 0.

Aperiodicity

Definition

Theorem

Aperiodicity

Definition

Theorem

Aperiodicity

Definition

Theorem

Existence of the stationary distribution

Theorem

Let X0,X1, · · · , be an irreducible and aperiodic Markov chain withtransition matrix P. Then, limn→∞ Pn

ij exists and independent of i ,denoted as limn→∞ Pn

ij = π(j). We also have

limn→∞

π(1) π(2) · · · π(j) · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·

π(j) =∑∞

i=1 π(i)Pij , and∑∞

i=1 π(i) = 1.

π is the unique and non-negative solution for equation πP = π.

Theorem

limn→∞

π(1) π(2) · · · π(j) · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·

π(j) =∑∞

i=1 π(i) = 1.

Theorem

limn→∞

π(1) π(2) · · · π(j) · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·

π(j) =∑∞

i=1 π(i) = 1.

Theorem

limn→∞

π(1) π(2) · · · π(j) · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·

π(j) =∑∞

i=1 π(i) = 1.

Detailed Balance Condition

Theorem

Let X0,X1, · · · , be an aperiodic Markov chain with transition matrix P anddistribution π. If the following condition holds,

π(i)Pij = π(j)Pji , for all i , j (2)

then π(x) is the stationary distribution of the Markov chain. The aboveequation is called the detailed balance condition.

Proof:∑∞i=1 π(i)Pij =

∑∞i=1 π(j)Pji = π(j)

∑∞i=1 Pji = π(j)⇒ πP = π.

In general, π(i)Pij 6= π(j)Pji . That is, π(i) may not be the stationarydistribution.

The natural question is how to revise the Markov Chain such that πbecomes a stationary distribution. For example, we introduce afunction α(i , j) s.t. π(i)Pijα(i , j) = π(j)Pjiα(j , i)

Theorem

∑∞i=1 Pji = π(j)⇒ πP = π.

Theorem

∑∞i=1 Pji = π(j)⇒ πP = π.

Theorem

∑∞i=1 Pji = π(j)⇒ πP = π.

Markov Chain Monte Carlo MCMC Sampling Algorithm

Outline

6 Take-aways

Revising the Markov Chain

Choosing a reasonable parameter

How to choose α(i , j) such that π(i)Pijα(i , j) = π(j)Pjiα(j , i).

In terms of symmetry, we simply chooseα(i , j) = π(j)Pji , α(j , i) = π(i)Pij .

Therefore, we have π(i)Pijα(i , j)︸︷︷︸Qij

= π(j)Pjiα(j , i)︸︷︷︸Qji

The transition matrix:Qij = Pijα(i , j), if j 6= i ;Qii = Pii +

∑k 6=i Pi,k(1− α(i , k)), Otherwise.

Accept with probability α(i , j);

Otherwise stay in the currentlocation.

MCMC Sampling Algorithm

Let P(x2|x1) be a proposed distribution.0: initialize x (0);1: for i = 0 to N − 1 do2: sample u ∼ U[0, 1];3: sample x ∼ P(x |x (i));4: if u < α(x , x (i)) = π(x)P(x (i)|x),5: then x (i+1) = x ;6: else reject x , and x (i+1) = x (i);7: endif8: endfor9: output Last N samples;

Observation

Let α(i , j) = 0.1, andα(j , i) = 0.2 satisfy thedetailed balance condi-tion, thus we have

π(i)Pij0.1 = π(j)Pji0.2.

The small value ofα(i , j) results in a highrejection ratio. Wetherefore modify the e-quation as follows:

π(i)Pij0.5 = π(j)Pji1.

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Outline

6 Take-aways

Metropolis-Hastings algorithm

Let P(x2|x1) be a proposeddistribution.0: initialize x (0);1: for i = 0 to max do2: sample u ∼ U[0, 1];3: sample x ∼ P(x |x (i));

4: if u < min1, π(x)P(x(i)|x)

π(x(i))P(x |x(i)),

5: then x (i+1) = x ;6: else reject x , andx (i+1) = x (i);7: endif8: endfor9: output Last N samples;

Observation

If we let

α(i , j) = min

1,π(x)P(x (i)|x)

π(x (i))P(x |x (i))

we can get a high accept ratio, andfurther improve the algorithm effi-ciency.However, for high-dimensionalP, Metropolis-Hastings algorithmmay be inefficient because ofα < 1. Is there a way to find atransition matrix with acceptanceratio α = 1?

Properties of MCMC

Trade-off between Mixing rate and Acceptance ratio, where

Acceptance ratio = E[α(xi , x)]

Mixing ratio = rate that the chain moves around the dist.

We can have multiple transition matrices Pi (i.e., proposaldistribution), and apply them in turn.

Observation

For example:

Sample:x (t+1)|x (t) ∼ N (0.5x (t), 1.0);

Convergence:x (t)|x (0) ∼ N (0, 1.33), t → +∞.

Intuition

Example: two-dimensional case

Let P(x , y) be a two-dimensional probability distribution, and twopoints A(x1, y1) and B(x1, y2). We have

P(x1, y1)P(y2|x1) = P(x1)P(y1|x1)P(y2|x1) (3)

P(x1, y2)P(y1|x1) = P(x1)P(y2|x1)P(y1|x1) (4)

That is

P(x1, y1)P(y2|x1) = P(x1, y2)P(y1|x1) (5)

P(A)P(y2|x1) = P(B)P(y1|x1) (6)

Intuition Cont’d

If p(y |xi ) is considered as the transition probability of two pointswhose x-axis coordinates are xi . Therefore, transition betweenthese two points satisfies the detailed balance condition, i.e.,P(A)P(y1|x1) = P(B)P(y2|x1) holds.

Transition matrix

The transition probabilities between two points A and B are given by T (A→ B).

T (A→ B) =

p(yB |x1), if xA = xB = x1;p(xB |y1), if yA = yB = y1;0, otherwise.

It is easy to confirm that the detailed balance condition holds, i.e., p(A)T (A → B) =p(B)T (B → A).

Intuition Cont’d

If p(y |xi ) is considered as the transition probability of two pointswhose x-axis coordinates are xi . Therefore, transition betweenthese two points satisfies the detailed balance condition, i.e.,P(A)P(y1|x1) = P(B)P(y2|x1) holds.

Transition matrix

The transition probabilities between two points A and B are given by T (A→ B).

T (A→ B) =

p(yB |x1), if xA = xB = x1;p(xB |y1), if yA = yB = y1;0, otherwise.

It is easy to confirm that the detailed balance condition holds, i.e., p(A)T (A → B) =p(B)T (B → A).

Multivariate case

For multivariate case

Let P(xi |x−i ) = P(xi |x1, · · · , xi−1, xi+1, · · · , xn). The transition probabilities are given

by T (x→ x′) = P(x

′i |x−i ). Then, we have:

T (x→ x′)p(x) = P(x

′i |x−i )P(xi |x−i )P(x−i )

T (x′→ x)p(x

′) = P(xi |x

′−i )P(x

′i |x′−i )P(x

′−i )

Note that x′−i = x−i . That is,

T (x→ x′)p(x) = T (x

′→ x)p(x

′). (7)

Therefore, the detailed balance condition also holds.

Gibbs sampling is feasible if it is easy to sample from the conditional probabilitydistribution.

Multivariate case

by T (x→ x′) = P(x

T (x→ x′)p(x) = P(x

T (x′→ x)p(x

′) = P(xi |x

′−i )P(x

′i |x′−i )P(x

′−i )

T (x→ x′)p(x) = T (x

′→ x)p(x

′). (7)

Multivariate case

by T (x→ x′) = P(x

T (x→ x′)p(x) = P(x

T (x′→ x)p(x

′) = P(xi |x

′−i )P(x

′i |x′−i )P(x

′−i )

T (x→ x′)p(x) = T (x

′→ x)p(x

′). (7)

Multivariate case

by T (x→ x′) = P(x

T (x→ x′)p(x) = P(x

T (x′→ x)p(x

′) = P(xi |x

′−i )P(x

′i |x′−i )P(x

′−i )

T (x→ x′)p(x) = T (x

′→ x)p(x

′). (7)

Gibbs sampling algorithm (proposed distribution P(xi |x−i))

0: initialize x1, · · · , xn;1: for τ = 0 to max do2: sample xτ+1

1 ∼ P(x1|xτ2 , xτ3 , · · · , xτn );3: · · · ;4: sample xτ+1

j ∼ P(xj |xτ+11 , · · · , xτ+1

j−1 , xτj+1, · · · , x

τn );

5: · · · ;6: sample xτ+1

n ∼ P(xn|xτ+11 , xτ+1

2 , · · · , xτ+1n−1 );

7: output Last N samples;

Gibbs sampling is a type of random walk through parameter space, and can be consideredas a Metroplis-Hastings algorithm with a special proposal distribution.At each iteration, we are drawing from conditional posterior probabilities. This mean-s that the proposal move is always accepted. Hence, if we can draw samples fromthe conditional distributions, Gibbs sampling can be much more efficient than regularMetropolis-Hastings.

j ∼ P(xj |xτ+11 , · · · , xτ+1

j−1 , xτj+1, · · · , x

τn );

5: · · · ;6: sample xτ+1

n ∼ P(xn|xτ+11 , xτ+1

2 , · · · , xτ+1n−1 );

Gibbs sampling is a type of random walk through parameter space, and can be consideredas a Metroplis-Hastings algorithm with a special proposal distribution.

At each iteration, we are drawing from conditional posterior probabilities. This mean-s that the proposal move is always accepted. Hence, if we can draw samples fromthe conditional distributions, Gibbs sampling can be much more efficient than regularMetropolis-Hastings.

j ∼ P(xj |xτ+11 , · · · , xτ+1

j−1 , xτj+1, · · · , x

τn );

5: · · · ;6: sample xτ+1

n ∼ P(xn|xτ+11 , xτ+1

2 , · · · , xτ+1n−1 );

Gibbs sampling is a type of random walk through parameter space, and can be consideredas a Metroplis-Hastings algorithm with a special proposal distribution.At each iteration, we are drawing from conditional posterior probabilities. This mean-s that the proposal move is always accepted. Hence, if we can draw samples fromthe conditional distributions, Gibbs sampling can be much more efficient than regularMetropolis-Hastings.

Properties of Gibbs Sampling

Properties

No need to tune the proposal distribution;

Good trade-off between acceptance and mixing: Acceptance ratio is always 1.

Need to be able to derive conditional probability distributions.

Acceleration of Gibbs sampling: (givenp(a, b, c) draw samples from a and c)

Blocked Gibbs:(1) Draw (a,b) given c;(2) Draw c given (a,b);

Collapsed Gibbs:(1) Draw a given c;(2) Draw c given a;

Marginalize whenever you can.

R code for MCMC and Gibbs sampling, please refer tohttps://blog.csdn.net/abcjennifer/article/details/25908495

Properties

Take-aways

Conclusions

Basic Concepts of Random Samples

Sampling from the Normal Distribution

Properties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F

Order Statistics

Convergence Concepts

statistical inference - lecture 5: properties of a random...

Documents