statistical inference - lecture 5: properties of a random...

301
Statistical Inference Lecture 5: Properties of a Random Sample MING GAO DASE @ ECNU (for course related communications) [email protected] Apr. 14, 2020

Upload: others

Post on 07-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Statistical InferenceLecture 5: Properties of a Random Sample

MING GAO

DASE @ ECNU(for course related communications)

[email protected]

Apr. 14, 2020

Page 2: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Outline

1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F

2 Order Statistics3 Convergence Concepts

Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution

4 Generating a Random SampleDirect ModelIndirect Model

5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling

6 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 2 / 101

Page 3: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Random vector

Definition

The r.v.s X1, · · · ,Xn are called a random sample of size n from thepopulation f (x) if X1, · · · ,Xn are mutually independent r.v.s and themarginal pdf or pmf of each Xi is the same function f (x).

That is, X1, · · · ,Xn are called independent and identicallydistributed (shorted in i.i.d) r.v.s with pdf or pmf f (x).

From the definition, the joint pdf or pmf of X1, · · · ,Xn is givenby

f (x1, · · · , xn) =n∏

i=1

f (xi ).

In particular, if the population pdf or pmf is a memeber of aparametric family with pdf or pmf given by f (x |θ), then thejoint pdf or pmf is

f (x1, · · · , xn|θ) =n∏

i=1

f (xi |θ).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 3 / 101

Page 4: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Random vector

Definition

The r.v.s X1, · · · ,Xn are called a random sample of size n from thepopulation f (x) if X1, · · · ,Xn are mutually independent r.v.s and themarginal pdf or pmf of each Xi is the same function f (x).

That is, X1, · · · ,Xn are called independent and identicallydistributed (shorted in i.i.d) r.v.s with pdf or pmf f (x).

From the definition, the joint pdf or pmf of X1, · · · ,Xn is givenby

f (x1, · · · , xn) =n∏

i=1

f (xi ).

In particular, if the population pdf or pmf is a memeber of aparametric family with pdf or pmf given by f (x |θ), then thejoint pdf or pmf is

f (x1, · · · , xn|θ) =n∏

i=1

f (xi |θ).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 3 / 101

Page 5: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Random vector

Definition

The r.v.s X1, · · · ,Xn are called a random sample of size n from thepopulation f (x) if X1, · · · ,Xn are mutually independent r.v.s and themarginal pdf or pmf of each Xi is the same function f (x).

That is, X1, · · · ,Xn are called independent and identicallydistributed (shorted in i.i.d) r.v.s with pdf or pmf f (x).

From the definition, the joint pdf or pmf of X1, · · · ,Xn is givenby

f (x1, · · · , xn) =n∏

i=1

f (xi ).

In particular, if the population pdf or pmf is a memeber of aparametric family with pdf or pmf given by f (x |θ), then thejoint pdf or pmf is

f (x1, · · · , xn|θ) =n∏

i=1

f (xi |θ).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 3 / 101

Page 6: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Random vector

Definition

The r.v.s X1, · · · ,Xn are called a random sample of size n from thepopulation f (x) if X1, · · · ,Xn are mutually independent r.v.s and themarginal pdf or pmf of each Xi is the same function f (x).

That is, X1, · · · ,Xn are called independent and identicallydistributed (shorted in i.i.d) r.v.s with pdf or pmf f (x).

From the definition, the joint pdf or pmf of X1, · · · ,Xn is givenby

f (x1, · · · , xn) =n∏

i=1

f (xi ).

In particular, if the population pdf or pmf is a memeber of aparametric family with pdf or pmf given by f (x |θ), then thejoint pdf or pmf is

f (x1, · · · , xn|θ) =n∏

i=1

f (xi |θ).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 3 / 101

Page 7: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sample pdf-exponential

Let X1, · · · ,Xn be a random sample form an exponential(β) popu-lation. The joint pdf of the sample is

f (x1, · · · , xn|β) =n∏

i=1

f (xi |β) =1

βne−

x1+···+xnβ .

With replacement sampling

Suppose a value is chosen from the population in such a way thateach of the N values is equally likely (prob. = 1

N ) to be chosen.The value chosen at any stage is “replaced” in the population and isavailable for choice again at the next stage.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 4 / 101

Page 8: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sample pdf-exponential

Let X1, · · · ,Xn be a random sample form an exponential(β) popu-lation. The joint pdf of the sample is

f (x1, · · · , xn|β) =n∏

i=1

f (xi |β) =1

βne−

x1+···+xnβ .

With replacement sampling

Suppose a value is chosen from the population in such a way thateach of the N values is equally likely (prob. = 1

N ) to be chosen.

The value chosen at any stage is “replaced” in the population and isavailable for choice again at the next stage.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 4 / 101

Page 9: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sample pdf-exponential

Let X1, · · · ,Xn be a random sample form an exponential(β) popu-lation. The joint pdf of the sample is

f (x1, · · · , xn|β) =n∏

i=1

f (xi |β) =1

βne−

x1+···+xnβ .

With replacement sampling

Suppose a value is chosen from the population in such a way thateach of the N values is equally likely (prob. = 1

N ) to be chosen.The value chosen at any stage is “replaced” in the population and isavailable for choice again at the next stage.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 4 / 101

Page 10: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Without replacement sampling

The first sample is recorded as X1 = y , where y ∈ x1, x2, · · · , xN.Now a second value is chosen from the remaining N − 1 values.

P(X2 = x) =N∑i=1

P(X2 = x |X1 = xi )P(X1 = xi )

= (N − 1)1

N − 1

1

N=

1

N

Similar arguments can be used to show that each of the Xi hasthe same marginal distribution, which they are notindependent.

The nonzero probability in the conditional distribution of Xi

given X1, · · · ,Xi−1 is 1N−i+1 , which is close to 1

N if i ≤ n issmall compared with N

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 5 / 101

Page 11: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Without replacement sampling

The first sample is recorded as X1 = y , where y ∈ x1, x2, · · · , xN.Now a second value is chosen from the remaining N − 1 values.

P(X2 = x) =N∑i=1

P(X2 = x |X1 = xi )P(X1 = xi )

= (N − 1)1

N − 1

1

N=

1

N

Similar arguments can be used to show that each of the Xi hasthe same marginal distribution, which they are notindependent.

The nonzero probability in the conditional distribution of Xi

given X1, · · · ,Xi−1 is 1N−i+1 , which is close to 1

N if i ≤ n issmall compared with N

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 5 / 101

Page 12: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Without replacement sampling

The first sample is recorded as X1 = y , where y ∈ x1, x2, · · · , xN.Now a second value is chosen from the remaining N − 1 values.

P(X2 = x) =N∑i=1

P(X2 = x |X1 = xi )P(X1 = xi )

= (N − 1)1

N − 1

1

N=

1

N

Similar arguments can be used to show that each of the Xi hasthe same marginal distribution, which they are notindependent.

The nonzero probability in the conditional distribution of Xi

given X1, · · · ,Xi−1 is 1N−i+1 , which is close to 1

N if i ≤ n issmall compared with N

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 5 / 101

Page 13: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Without replacement sampling

The first sample is recorded as X1 = y , where y ∈ x1, x2, · · · , xN.Now a second value is chosen from the remaining N − 1 values.

P(X2 = x) =N∑i=1

P(X2 = x |X1 = xi )P(X1 = xi )

= (N − 1)1

N − 1

1

N=

1

N

Similar arguments can be used to show that each of the Xi hasthe same marginal distribution, which they are notindependent.

The nonzero probability in the conditional distribution of Xi

given X1, · · · ,Xi−1 is 1N−i+1 , which is close to 1

N if i ≤ n issmall compared with N

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 5 / 101

Page 14: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling distribution

Definition

Let X1, · · · ,Xn be a random sample of size n form an population, andT (X1, · · · ,Xn) be a real-valued or vector-valued function. Then the r.v. orrandom vector Y = T (X1, · · · ,Xn) is called a statistic. The pdf or pmf ofstatistic Y is called the sampling distribution of Y .

Sample mean and sample variance

The sample mean and sample variance are the statistics defined by

X =x1 + · · ·+ Xn

n=

1

n

n∑i=1

Xi .

S2 =1

n − 1

n∑i=1

(Xi − X )2.

The sample standard deviation is the statistic defined by S =√S2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 6 / 101

Page 15: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling distribution

Definition

Let X1, · · · ,Xn be a random sample of size n form an population, andT (X1, · · · ,Xn) be a real-valued or vector-valued function. Then the r.v. orrandom vector Y = T (X1, · · · ,Xn) is called a statistic. The pdf or pmf ofstatistic Y is called the sampling distribution of Y .

Sample mean and sample variance

The sample mean and sample variance are the statistics defined by

X =x1 + · · ·+ Xn

n=

1

n

n∑i=1

Xi .

S2 =1

n − 1

n∑i=1

(Xi − X )2.

The sample standard deviation is the statistic defined by S =√S2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 6 / 101

Page 16: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Theorem

Definition

Let x1, · · · , xn be any numbers and x = 1n

∑ni=1 xi . Then

a. mina

∑ni=1(xi − a)2 =

∑ni=1(xi − x)2,

b. (n − 1)s2 =∑n

i=1(xi − x)2 =∑n

i=1 x2i − nx2.

Proof.

n∑i=1

(xi − a)2 =n∑

i=1

(xi − x + x − a)2

=n∑

i=1

(xi − x)2 + 2n∑

i=1

(xi − x)(x − a) +n∑

i=1

(x − a)2

=n∑

i=1

(xi − x)2 +n∑

i=1

(x − a)2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 7 / 101

Page 17: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Theorem

Definition

Let x1, · · · , xn be any numbers and x = 1n

∑ni=1 xi . Then

a. mina

∑ni=1(xi − a)2 =

∑ni=1(xi − x)2,

b. (n − 1)s2 =∑n

i=1(xi − x)2 =∑n

i=1 x2i − nx2.

Proof.

n∑i=1

(xi − a)2 =n∑

i=1

(xi − x + x − a)2

=n∑

i=1

(xi − x)2 + 2n∑

i=1

(xi − x)(x − a) +n∑

i=1

(x − a)2

=n∑

i=1

(xi − x)2 +n∑

i=1

(x − a)2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 7 / 101

Page 18: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Theorem

Definition

Let x1, · · · , xn be any numbers and x = 1n

∑ni=1 xi . Then

a. mina

∑ni=1(xi − a)2 =

∑ni=1(xi − x)2,

b. (n − 1)s2 =∑n

i=1(xi − x)2 =∑n

i=1 x2i − nx2.

Proof.

n∑i=1

(xi − a)2 =n∑

i=1

(xi − x + x − a)2

=n∑

i=1

(xi − x)2 + 2n∑

i=1

(xi − x)(x − a) +n∑

i=1

(x − a)2

=n∑

i=1

(xi − x)2 +n∑

i=1

(x − a)2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 7 / 101

Page 19: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Theorem

Definition

Let x1, · · · , xn be any numbers and x = 1n

∑ni=1 xi . Then

a. mina

∑ni=1(xi − a)2 =

∑ni=1(xi − x)2,

b. (n − 1)s2 =∑n

i=1(xi − x)2 =∑n

i=1 x2i − nx2.

Proof.

n∑i=1

(xi − a)2 =n∑

i=1

(xi − x + x − a)2

=n∑

i=1

(xi − x)2 + 2n∑

i=1

(xi − x)(x − a) +n∑

i=1

(x − a)2

=n∑

i=1

(xi − x)2 +n∑

i=1

(x − a)2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 7 / 101

Page 20: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Lemma

Definition

Let X1, · · · ,Xn be a random sample from a population and let g(x) be afunction s.t. E (g(X1)) and Var(g(X1)) exist. Then

a.

E( n∑

i=1

g(Xi ))

= n · E (g(X1)),

b.

Var( n∑

i=1

g(Xi ))

= n · Var(g(X1)).

Note that

E[(g(Xi )− E (g(Xi )))(g(Xj)− E (g(Xj)))

]= Cov(g(Xi ), g(Xj)) = 0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 8 / 101

Page 21: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Lemma

Definition

Let X1, · · · ,Xn be a random sample from a population and let g(x) be afunction s.t. E (g(X1)) and Var(g(X1)) exist. Then

a.

E( n∑

i=1

g(Xi ))

= n · E (g(X1)),

b.

Var( n∑

i=1

g(Xi ))

= n · Var(g(X1)).

Note that

E[(g(Xi )− E (g(Xi )))(g(Xj)− E (g(Xj)))

]= Cov(g(Xi ), g(Xj)) = 0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 8 / 101

Page 22: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Lemma

Definition

Let X1, · · · ,Xn be a random sample from a population and let g(x) be afunction s.t. E (g(X1)) and Var(g(X1)) exist. Then

a.

E( n∑

i=1

g(Xi ))

= n · E (g(X1)),

b.

Var( n∑

i=1

g(Xi ))

= n · Var(g(X1)).

Note that

E[(g(Xi )− E (g(Xi )))(g(Xj)− E (g(Xj)))

]= Cov(g(Xi ), g(Xj)) = 0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 8 / 101

Page 23: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Theorem

Definition

Let X1, · · · ,Xn be a random sample from a population with mean µ andvariance σ2 <∞. Then

a. E (X ) = µ;

b. Var(X ) = σ2

n ;

c. E (S2) = σ2.

Proof.

E (S2) = E( 1

n − 1

[ n∑i=1

X 2i − nX

2])=

1

n − 1(nE (X 2

1 )− nE (X2))

=1

n − 1

(n(σ2 + µ2)− n(

σ2

n+ µ2)

)= σ2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 9 / 101

Page 24: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Unbiased statistics

Definition

If the following holds:

E [T (X1,X2, · · · ,Xn)] = θ,

then the statistic T (X1,X2, · · · ,Xn) is an unbiased estimator of the param-eter θ. Otherwise, T (X1,X2, · · · ,Xn) is a biased estimator of θ.

The statistic X is an unbiased estimator of µ, and S2 is an unbiasedestimator of σ2.

The use of n− 1 in the definition of S2 may have seemed unintuitive.

If S2 were defined as the usual average of the squared deviationswith n rather than n − 1 in the denominator, then E (S2) would ben

n−1σ2 and S2 would not be an unbiased estimator of σ2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 10 / 101

Page 25: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Unbiased statistics

Definition

If the following holds:

E [T (X1,X2, · · · ,Xn)] = θ,

then the statistic T (X1,X2, · · · ,Xn) is an unbiased estimator of the param-eter θ. Otherwise, T (X1,X2, · · · ,Xn) is a biased estimator of θ.

The statistic X is an unbiased estimator of µ, and S2 is an unbiasedestimator of σ2.

The use of n− 1 in the definition of S2 may have seemed unintuitive.

If S2 were defined as the usual average of the squared deviationswith n rather than n − 1 in the denominator, then E (S2) would ben

n−1σ2 and S2 would not be an unbiased estimator of σ2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 10 / 101

Page 26: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Unbiased statistics

Definition

If the following holds:

E [T (X1,X2, · · · ,Xn)] = θ,

then the statistic T (X1,X2, · · · ,Xn) is an unbiased estimator of the param-eter θ. Otherwise, T (X1,X2, · · · ,Xn) is a biased estimator of θ.

The statistic X is an unbiased estimator of µ, and S2 is an unbiasedestimator of σ2.

The use of n− 1 in the definition of S2 may have seemed unintuitive.

If S2 were defined as the usual average of the squared deviationswith n rather than n − 1 in the denominator, then E (S2) would ben

n−1σ2 and S2 would not be an unbiased estimator of σ2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 10 / 101

Page 27: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Unbiased statistics

Definition

If the following holds:

E [T (X1,X2, · · · ,Xn)] = θ,

then the statistic T (X1,X2, · · · ,Xn) is an unbiased estimator of the param-eter θ. Otherwise, T (X1,X2, · · · ,Xn) is a biased estimator of θ.

The statistic X is an unbiased estimator of µ, and S2 is an unbiasedestimator of σ2.

The use of n− 1 in the definition of S2 may have seemed unintuitive.

If S2 were defined as the usual average of the squared deviationswith n rather than n − 1 in the denominator, then E (S2) would ben

n−1σ2 and S2 would not be an unbiased estimator of σ2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 10 / 101

Page 28: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Theorem

Let X1, · · · ,Xn be a random sample from a population with mgfMX (t). Then the mgf of the sample means is

MX (t) =[MX (

t

n)]n.

Example

Let X1, · · · ,Xn be a random sample from a N(µ, σ2) population.Then the mgf of the sample mean is

MX (t) =[exp(

n+σ2(t/n)2

2)]n

= exp[n(

n+σ2(t/n)2

2)]

= exp[tµ+

(σ2/n)t2

2)]

Thus, X has a N(µ, σ2

n ) distribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 11 / 101

Page 29: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Theorem

Let X1, · · · ,Xn be a random sample from a population with mgfMX (t). Then the mgf of the sample means is

MX (t) =[MX (

t

n)]n.

Example

Let X1, · · · ,Xn be a random sample from a N(µ, σ2) population.Then the mgf of the sample mean is

MX (t) =[exp(

n+σ2(t/n)2

2)]n

= exp[n(

n+σ2(t/n)2

2)]

= exp[tµ+

(σ2/n)t2

2)]

Thus, X has a N(µ, σ2

n ) distribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 11 / 101

Page 30: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convolution theorem

If X and Y are independent continuous r.v.s with pdfs fX (x) andfY (y), then the pdf of Z = X + Y is

fZ (z) =

∫ ∞−∞

fX (w)fY (z − w)dw .

Proof.

Let W = X . The Jacobian of the transformation from (X ,Y ) to(Z ,W ) is 1. Thus, we obtain the joint pdf of (Z ,W ) as

FZ ,W (z ,w) = FX ,Y (w , z − w) = fX (w)fY (z − w).

Integrating out w , we obtain the marginal pdf of Z as given in theconclusion.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 12 / 101

Page 31: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convolution theorem

If X and Y are independent continuous r.v.s with pdfs fX (x) andfY (y), then the pdf of Z = X + Y is

fZ (z) =

∫ ∞−∞

fX (w)fY (z − w)dw .

Proof.

Let W = X . The Jacobian of the transformation from (X ,Y ) to(Z ,W ) is 1. Thus, we obtain the joint pdf of (Z ,W ) as

FZ ,W (z ,w) = FX ,Y (w , z − w) = fX (w)fY (z − w).

Integrating out w , we obtain the marginal pdf of Z as given in theconclusion.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 12 / 101

Page 32: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convolution theorem

If X and Y are independent continuous r.v.s with pdfs fX (x) andfY (y), then the pdf of Z = X + Y is

fZ (z) =

∫ ∞−∞

fX (w)fY (z − w)dw .

Proof.

Let W = X . The Jacobian of the transformation from (X ,Y ) to(Z ,W ) is 1. Thus, we obtain the joint pdf of (Z ,W ) as

FZ ,W (z ,w) = FX ,Y (w , z − w) = fX (w)fY (z − w).

Integrating out w , we obtain the marginal pdf of Z as given in theconclusion.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 12 / 101

Page 33: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convolution theorem

If X and Y are independent continuous r.v.s with pdfs fX (x) andfY (y), then the pdf of Z = X + Y is

fZ (z) =

∫ ∞−∞

fX (w)fY (z − w)dw .

Proof.

Let W = X . The Jacobian of the transformation from (X ,Y ) to(Z ,W ) is 1. Thus, we obtain the joint pdf of (Z ,W ) as

FZ ,W (z ,w) = FX ,Y (w , z − w) = fX (w)fY (z − w).

Integrating out w , we obtain the marginal pdf of Z as given in theconclusion.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 12 / 101

Page 34: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sum of Cauchy random variables

If U and V be independent Cauchy r.v.s, where U ∼ Cauchy(0, σ) andV ∼ Cauchy(0, τ); that is

fU(u) =1

πσ

1

1 + (u/σ)2, and fV (v) =

1

πτ

1

1 + (v/τ)2.

Based on the convolution theorem, the pdf of Z = U + V is given by

fZ (z) =

∫ ∞−∞

1

πσ

1

1 + (w/σ)2

1

πτ

1

1 + ((z − w)/τ)2dw

=1

π(σ + τ)

1

1 + (z/(σ + τ))2.

The sum of two independent Cauchy r.v.s is again a Cauchy, withthe scale parameters adding.

If Z1, · · · ,Zn are i.i.d Cauchy(0, 1) r.v., then∑n

i=1 Zi isCauchy(0, n) and Z is Cauchy(0, 1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 13 / 101

Page 35: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sum of Cauchy random variables

If U and V be independent Cauchy r.v.s, where U ∼ Cauchy(0, σ) andV ∼ Cauchy(0, τ); that is

fU(u) =1

πσ

1

1 + (u/σ)2, and fV (v) =

1

πτ

1

1 + (v/τ)2.

Based on the convolution theorem, the pdf of Z = U + V is given by

fZ (z) =

∫ ∞−∞

1

πσ

1

1 + (w/σ)2

1

πτ

1

1 + ((z − w)/τ)2dw

=1

π(σ + τ)

1

1 + (z/(σ + τ))2.

The sum of two independent Cauchy r.v.s is again a Cauchy, withthe scale parameters adding.

If Z1, · · · ,Zn are i.i.d Cauchy(0, 1) r.v., then∑n

i=1 Zi isCauchy(0, n) and Z is Cauchy(0, 1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 13 / 101

Page 36: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sum of Cauchy random variables

If U and V be independent Cauchy r.v.s, where U ∼ Cauchy(0, σ) andV ∼ Cauchy(0, τ); that is

fU(u) =1

πσ

1

1 + (u/σ)2, and fV (v) =

1

πτ

1

1 + (v/τ)2.

Based on the convolution theorem, the pdf of Z = U + V is given by

fZ (z) =

∫ ∞−∞

1

πσ

1

1 + (w/σ)2

1

πτ

1

1 + ((z − w)/τ)2dw

=1

π(σ + τ)

1

1 + (z/(σ + τ))2.

The sum of two independent Cauchy r.v.s is again a Cauchy, withthe scale parameters adding.

If Z1, · · · ,Zn are i.i.d Cauchy(0, 1) r.v., then∑n

i=1 Zi isCauchy(0, n) and Z is Cauchy(0, 1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 13 / 101

Page 37: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sum of Cauchy random variables

If U and V be independent Cauchy r.v.s, where U ∼ Cauchy(0, σ) andV ∼ Cauchy(0, τ); that is

fU(u) =1

πσ

1

1 + (u/σ)2, and fV (v) =

1

πτ

1

1 + (v/τ)2.

Based on the convolution theorem, the pdf of Z = U + V is given by

fZ (z) =

∫ ∞−∞

1

πσ

1

1 + (w/σ)2

1

πτ

1

1 + ((z − w)/τ)2dw

=1

π(σ + τ)

1

1 + (z/(σ + τ))2.

The sum of two independent Cauchy r.v.s is again a Cauchy, withthe scale parameters adding.

If Z1, · · · ,Zn are i.i.d Cauchy(0, 1) r.v., then∑n

i=1 Zi isCauchy(0, n) and Z is Cauchy(0, 1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 13 / 101

Page 38: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Outline

1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F

2 Order Statistics3 Convergence Concepts

Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution

4 Generating a Random SampleDirect ModelIndirect Model

5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling

6 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 14 / 101

Page 39: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Theorem

Let X1, · · · ,Xn be a random sample from a N(µ, σ2) distribution, letX = 1

n

∑ni=1 Xi and S2 = 1

n−1

∑ni=1(Xi − X )2. Then

a. X and S2 are independent r.v.s;

b. X ∼ N(µ, σ2

n );

c. (n−1)S2

σ2 has a χ2(n − 1) distribution.

Proof.

Part (b) has been established. Next, we will prove Parts (a) and(c). Without loss of generality, assume that µ = 1 and σ = 1.For Part (a), we will show that X and S2 are functions ofindependent random vectors.After introducing a lemma, we will prove the Part(c).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 15 / 101

Page 40: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Theorem

Let X1, · · · ,Xn be a random sample from a N(µ, σ2) distribution, letX = 1

n

∑ni=1 Xi and S2 = 1

n−1

∑ni=1(Xi − X )2. Then

a. X and S2 are independent r.v.s;

b. X ∼ N(µ, σ2

n );

c. (n−1)S2

σ2 has a χ2(n − 1) distribution.

Proof.

Part (b) has been established. Next, we will prove Parts (a) and(c). Without loss of generality, assume that µ = 1 and σ = 1.For Part (a), we will show that X and S2 are functions ofindependent random vectors.After introducing a lemma, we will prove the Part(c).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 15 / 101

Page 41: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Theorem

Let X1, · · · ,Xn be a random sample from a N(µ, σ2) distribution, letX = 1

n

∑ni=1 Xi and S2 = 1

n−1

∑ni=1(Xi − X )2. Then

a. X and S2 are independent r.v.s;

b. X ∼ N(µ, σ2

n );

c. (n−1)S2

σ2 has a χ2(n − 1) distribution.

Proof.

Part (b) has been established. Next, we will prove Parts (a) and(c). Without loss of generality, assume that µ = 1 and σ = 1.For Part (a), we will show that X and S2 are functions ofindependent random vectors.After introducing a lemma, we will prove the Part(c).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 15 / 101

Page 42: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Theorem

Let X1, · · · ,Xn be a random sample from a N(µ, σ2) distribution, letX = 1

n

∑ni=1 Xi and S2 = 1

n−1

∑ni=1(Xi − X )2. Then

a. X and S2 are independent r.v.s;

b. X ∼ N(µ, σ2

n );

c. (n−1)S2

σ2 has a χ2(n − 1) distribution.

Proof.

Part (b) has been established. Next, we will prove Parts (a) and(c). Without loss of generality, assume that µ = 1 and σ = 1.For Part (a), we will show that X and S2 are functions ofindependent random vectors.After introducing a lemma, we will prove the Part(c).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 15 / 101

Page 43: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Theorem

Let X1, · · · ,Xn be a random sample from a N(µ, σ2) distribution, letX = 1

n

∑ni=1 Xi and S2 = 1

n−1

∑ni=1(Xi − X )2. Then

a. X and S2 are independent r.v.s;

b. X ∼ N(µ, σ2

n );

c. (n−1)S2

σ2 has a χ2(n − 1) distribution.

Proof.

Part (b) has been established. Next, we will prove Parts (a) and(c). Without loss of generality, assume that µ = 1 and σ = 1.

For Part (a), we will show that X and S2 are functions ofindependent random vectors.After introducing a lemma, we will prove the Part(c).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 15 / 101

Page 44: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Theorem

Let X1, · · · ,Xn be a random sample from a N(µ, σ2) distribution, letX = 1

n

∑ni=1 Xi and S2 = 1

n−1

∑ni=1(Xi − X )2. Then

a. X and S2 are independent r.v.s;

b. X ∼ N(µ, σ2

n );

c. (n−1)S2

σ2 has a χ2(n − 1) distribution.

Proof.

Part (b) has been established. Next, we will prove Parts (a) and(c). Without loss of generality, assume that µ = 1 and σ = 1.For Part (a), we will show that X and S2 are functions ofindependent random vectors.

After introducing a lemma, we will prove the Part(c).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 15 / 101

Page 45: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Theorem

Let X1, · · · ,Xn be a random sample from a N(µ, σ2) distribution, letX = 1

n

∑ni=1 Xi and S2 = 1

n−1

∑ni=1(Xi − X )2. Then

a. X and S2 are independent r.v.s;

b. X ∼ N(µ, σ2

n );

c. (n−1)S2

σ2 has a χ2(n − 1) distribution.

Proof.

Part (b) has been established. Next, we will prove Parts (a) and(c). Without loss of generality, assume that µ = 1 and σ = 1.For Part (a), we will show that X and S2 are functions ofindependent random vectors.After introducing a lemma, we will prove the Part(c).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 15 / 101

Page 46: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Part (a) proof Cont’d

Note that we can write S2 as a function of n − 1 deviations as follows:

S2 =1

n − 1

n∑i=1

(Xi − X )2

=1

n − 1

((X1 − X )2 +

n∑i=2

(Xi − X )2)

=1

n − 1

([ n∑i=2

(Xi − X )]2

+n∑

i=2

(Xi − X )2).

Thus, S2 can be written as a function only of (X2 − X , · · · ,Xn − X ). Wewill now show that these r.v.s are independent of X . The joint pdf of thesample X1, · · · ,Xn is given by

f (x1, · · · , xn) =1

(2π)n/2e−(1/2)

∑ni=1 x

2i

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 16 / 101

Page 47: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Part (a) proof Cont’d

Make the transformation

y1 = x , x1 = y1 −n∑

i=2

yi

y2 = x2 − x , x2 = y1 + y2

· · · · · ·yn = xn − x , xn = y1 + yn

This is a linear transformation with a Jacobian equal to 1n . We have

f (y1, · · · , yn) =1

(2π)n/2e−(1/2)(y1−

∑ni=2 yi )

2

e−(1/2)∑n

i=2(y1+yi )2

=[ 1

(2π)1/2e−(ny2

1 )/2][ n1/2

(2π)(n−1)/2e−(1/2)[

∑ni=2 y

2i +(

∑ni=2 yi )

2]]

Since the joint pdf of Y1, · · · ,Yn factors, Y1 is independent of Y2, · · · ,Yn.That is, X and S2 are independent r.v.s.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 17 / 101

Page 48: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Facts about Chi squared r.v.s

We use the notation χ2p to denote a Chi squared r.v. with p degrees of

freedom.

a. If Z is a N(0, 1) r.v., then Z 2 ∼ χ21; that is, the square of a standard

normal r.v. is a chi squared r.v.;

b. If X1, · · · ,Xn are independent and Xi ∼ χ2pi , then

X1 + · · ·+ Xn ∼ χ2p1+···+pn ; that is, independent chi squared variables

add to a chi squared variable, and the degrees of freedom also add.

Proof.

Part (a) has established in Chapter 2.Since a χ2

p r.v. is a gamma( p2 , 2), application of the example gives part

(b).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 18 / 101

Page 49: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Facts about Chi squared r.v.s

We use the notation χ2p to denote a Chi squared r.v. with p degrees of

freedom.

a. If Z is a N(0, 1) r.v., then Z 2 ∼ χ21; that is, the square of a standard

normal r.v. is a chi squared r.v.;

b. If X1, · · · ,Xn are independent and Xi ∼ χ2pi , then

X1 + · · ·+ Xn ∼ χ2p1+···+pn ; that is, independent chi squared variables

add to a chi squared variable, and the degrees of freedom also add.

Proof.

Part (a) has established in Chapter 2.Since a χ2

p r.v. is a gamma( p2 , 2), application of the example gives part

(b).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 18 / 101

Page 50: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Facts about Chi squared r.v.s

We use the notation χ2p to denote a Chi squared r.v. with p degrees of

freedom.

a. If Z is a N(0, 1) r.v., then Z 2 ∼ χ21; that is, the square of a standard

normal r.v. is a chi squared r.v.;

b. If X1, · · · ,Xn are independent and Xi ∼ χ2pi , then

X1 + · · ·+ Xn ∼ χ2p1+···+pn ; that is, independent chi squared variables

add to a chi squared variable, and the degrees of freedom also add.

Proof.

Part (a) has established in Chapter 2.Since a χ2

p r.v. is a gamma( p2 , 2), application of the example gives part

(b).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 18 / 101

Page 51: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Facts about Chi squared r.v.s

We use the notation χ2p to denote a Chi squared r.v. with p degrees of

freedom.

a. If Z is a N(0, 1) r.v., then Z 2 ∼ χ21; that is, the square of a standard

normal r.v. is a chi squared r.v.;

b. If X1, · · · ,Xn are independent and Xi ∼ χ2pi , then

X1 + · · ·+ Xn ∼ χ2p1+···+pn ; that is, independent chi squared variables

add to a chi squared variable, and the degrees of freedom also add.

Proof.

Part (a) has established in Chapter 2.

Since a χ2p r.v. is a gamma( p

2 , 2), application of the example gives part(b).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 18 / 101

Page 52: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Facts about Chi squared r.v.s

We use the notation χ2p to denote a Chi squared r.v. with p degrees of

freedom.

a. If Z is a N(0, 1) r.v., then Z 2 ∼ χ21; that is, the square of a standard

normal r.v. is a chi squared r.v.;

b. If X1, · · · ,Xn are independent and Xi ∼ χ2pi , then

X1 + · · ·+ Xn ∼ χ2p1+···+pn ; that is, independent chi squared variables

add to a chi squared variable, and the degrees of freedom also add.

Proof.

Part (a) has established in Chapter 2.Since a χ2

p r.v. is a gamma( p2 , 2), application of the example gives part

(b).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 18 / 101

Page 53: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Part (c) proof

We will employ and induction argument to establish the distributionof S2, using the notation X k and S2

k to denote the sample mean andvariance based on the first k observations.

It is straightforward to establish that

(n − 1)S2n = (n − 2)S2

n−1 +n − 1

n(Xn − X n−1)2.

Now consider n = 2, we have S22 = 1

2 (X2 −X1)2. Since the distribu-

tion of X2−X1√2

is N(0, 1), Part (a) shows that S22 ∼ χ2

1.

Proceeding with the induction, we assume that for n = k , (k−1)S2k ∼

χ2k−1.

For n = k + 1, we have the following form

kS2k+1 = (k − 1)S2

k +k

k + 1(Xk+1 − X k)2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 19 / 101

Page 54: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Part (c) proof

We will employ and induction argument to establish the distributionof S2, using the notation X k and S2

k to denote the sample mean andvariance based on the first k observations.It is straightforward to establish that

(n − 1)S2n = (n − 2)S2

n−1 +n − 1

n(Xn − X n−1)2.

Now consider n = 2, we have S22 = 1

2 (X2 −X1)2. Since the distribu-

tion of X2−X1√2

is N(0, 1), Part (a) shows that S22 ∼ χ2

1.

Proceeding with the induction, we assume that for n = k , (k−1)S2k ∼

χ2k−1.

For n = k + 1, we have the following form

kS2k+1 = (k − 1)S2

k +k

k + 1(Xk+1 − X k)2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 19 / 101

Page 55: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Part (c) proof

We will employ and induction argument to establish the distributionof S2, using the notation X k and S2

k to denote the sample mean andvariance based on the first k observations.It is straightforward to establish that

(n − 1)S2n = (n − 2)S2

n−1 +n − 1

n(Xn − X n−1)2.

Now consider n = 2, we have S22 = 1

2 (X2 −X1)2.

Since the distribu-

tion of X2−X1√2

is N(0, 1), Part (a) shows that S22 ∼ χ2

1.

Proceeding with the induction, we assume that for n = k , (k−1)S2k ∼

χ2k−1.

For n = k + 1, we have the following form

kS2k+1 = (k − 1)S2

k +k

k + 1(Xk+1 − X k)2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 19 / 101

Page 56: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Part (c) proof

We will employ and induction argument to establish the distributionof S2, using the notation X k and S2

k to denote the sample mean andvariance based on the first k observations.It is straightforward to establish that

(n − 1)S2n = (n − 2)S2

n−1 +n − 1

n(Xn − X n−1)2.

Now consider n = 2, we have S22 = 1

2 (X2 −X1)2. Since the distribu-

tion of X2−X1√2

is N(0, 1), Part (a) shows that S22 ∼ χ2

1.

Proceeding with the induction, we assume that for n = k , (k−1)S2k ∼

χ2k−1.

For n = k + 1, we have the following form

kS2k+1 = (k − 1)S2

k +k

k + 1(Xk+1 − X k)2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 19 / 101

Page 57: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Part (c) proof

We will employ and induction argument to establish the distributionof S2, using the notation X k and S2

k to denote the sample mean andvariance based on the first k observations.It is straightforward to establish that

(n − 1)S2n = (n − 2)S2

n−1 +n − 1

n(Xn − X n−1)2.

Now consider n = 2, we have S22 = 1

2 (X2 −X1)2. Since the distribu-

tion of X2−X1√2

is N(0, 1), Part (a) shows that S22 ∼ χ2

1.

Proceeding with the induction, we assume that for n = k , (k−1)S2k ∼

χ2k−1.

For n = k + 1, we have the following form

kS2k+1 = (k − 1)S2

k +k

k + 1(Xk+1 − X k)2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 19 / 101

Page 58: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Part (c) proof

We will employ and induction argument to establish the distributionof S2, using the notation X k and S2

k to denote the sample mean andvariance based on the first k observations.It is straightforward to establish that

(n − 1)S2n = (n − 2)S2

n−1 +n − 1

n(Xn − X n−1)2.

Now consider n = 2, we have S22 = 1

2 (X2 −X1)2. Since the distribu-

tion of X2−X1√2

is N(0, 1), Part (a) shows that S22 ∼ χ2

1.

Proceeding with the induction, we assume that for n = k , (k−1)S2k ∼

χ2k−1.

For n = k + 1, we have the following form

kS2k+1 = (k − 1)S2

k +k

k + 1(Xk+1 − X k)2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 19 / 101

Page 59: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Part (c) proof Cont’d

According to the induction hypothesis, if kk+1 (Xk+1 − X k)2 ∼ χ2

1,

and independent of S2k , it will follow from Part (b) that kS2

k+1 ∼ χ2k .

Since the vector (Xk+1,X k) is independent of S2k . Furthermore,

Xk+1 − X k ∼ N(0, 1) since

Var(Xk+1 − X k) =k + 1

k.

Therefore, kk+1 (Xk+1 − X k)2 ∼ χ2

1.Based on the following lemma, we will provide the other manner toestablish the theorem.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 20 / 101

Page 60: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Part (c) proof Cont’d

According to the induction hypothesis, if kk+1 (Xk+1 − X k)2 ∼ χ2

1,

and independent of S2k , it will follow from Part (b) that kS2

k+1 ∼ χ2k .

Since the vector (Xk+1,X k) is independent of S2k . Furthermore,

Xk+1 − X k ∼ N(0, 1) since

Var(Xk+1 − X k) =k + 1

k.

Therefore, kk+1 (Xk+1 − X k)2 ∼ χ2

1.Based on the following lemma, we will provide the other manner toestablish the theorem.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 20 / 101

Page 61: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Part (c) proof Cont’d

According to the induction hypothesis, if kk+1 (Xk+1 − X k)2 ∼ χ2

1,

and independent of S2k , it will follow from Part (b) that kS2

k+1 ∼ χ2k .

Since the vector (Xk+1,X k) is independent of S2k . Furthermore,

Xk+1 − X k ∼ N(0, 1) since

Var(Xk+1 − X k) =k + 1

k.

Therefore, kk+1 (Xk+1 − X k)2 ∼ χ2

1.

Based on the following lemma, we will provide the other manner toestablish the theorem.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 20 / 101

Page 62: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Part (c) proof Cont’d

According to the induction hypothesis, if kk+1 (Xk+1 − X k)2 ∼ χ2

1,

and independent of S2k , it will follow from Part (b) that kS2

k+1 ∼ χ2k .

Since the vector (Xk+1,X k) is independent of S2k . Furthermore,

Xk+1 − X k ∼ N(0, 1) since

Var(Xk+1 − X k) =k + 1

k.

Therefore, kk+1 (Xk+1 − X k)2 ∼ χ2

1.Based on the following lemma, we will provide the other manner toestablish the theorem.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 20 / 101

Page 63: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Lemma

Let Xj ∼ N(µj , σ2j ), j = 1, · · · , n be independent. For constants aij

and brj(j = 1, · · · , n; i = 1, · · · , k ; r = 1, · · · ,m), where k + m ≤ n,define

Ui =n∑

j=1

aijXj , and Vr =n∑

j=1

brjXj

.

The r.v.s Ui and Vr are independent if and only ifCov(Ui ,Vr ) = 0. Furthermore, Cov(Ui ,Vr ) =

∑nj=1 aijbrjσ

2j .

The random vectors (U1, · · · ,Uk) and (V1, · · · ,Vm) areindependent if and only if Ui is independent of Vr for all pairsi , r(i = 1, · · · , k ; r = 1, · · · ,m).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 21 / 101

Page 64: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Lemma

Let Xj ∼ N(µj , σ2j ), j = 1, · · · , n be independent. For constants aij

and brj(j = 1, · · · , n; i = 1, · · · , k ; r = 1, · · · ,m), where k + m ≤ n,define

Ui =n∑

j=1

aijXj , and Vr =n∑

j=1

brjXj

.

The r.v.s Ui and Vr are independent if and only ifCov(Ui ,Vr ) = 0. Furthermore, Cov(Ui ,Vr ) =

∑nj=1 aijbrjσ

2j .

The random vectors (U1, · · · ,Uk) and (V1, · · · ,Vm) areindependent if and only if Ui is independent of Vr for all pairsi , r(i = 1, · · · , k ; r = 1, · · · ,m).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 21 / 101

Page 65: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Lemma

Let Xj ∼ N(µj , σ2j ), j = 1, · · · , n be independent. For constants aij

and brj(j = 1, · · · , n; i = 1, · · · , k ; r = 1, · · · ,m), where k + m ≤ n,define

Ui =n∑

j=1

aijXj , and Vr =n∑

j=1

brjXj

.

The r.v.s Ui and Vr are independent if and only ifCov(Ui ,Vr ) = 0. Furthermore, Cov(Ui ,Vr ) =

∑nj=1 aijbrjσ

2j .

The random vectors (U1, · · · ,Uk) and (V1, · · · ,Vm) areindependent if and only if Ui is independent of Vr for all pairsi , r(i = 1, · · · , k ; r = 1, · · · ,m).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 21 / 101

Page 66: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Prove the theorem

We can use the above lemma to provide an alternative proof of theindependent of X and S2 in normal sampling.

Since we can write S2 as a function of n − 1 deviations (X2 −X , · · · ,Xn − X ), we must show that these r.v.s are uncorrelatedwith X . As an illustration of the application of the lemma, we write

X =1

n

n∑i=1

Xi ,

Xj − X =1

n

n∑i=1

(δij − 1)Xi ,

where δij = 1 if i = j and δij = 0 otherwise.It is then easy to show that X and Xj − X are independent.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 22 / 101

Page 67: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Prove the theorem

We can use the above lemma to provide an alternative proof of theindependent of X and S2 in normal sampling.Since we can write S2 as a function of n − 1 deviations (X2 −X , · · · ,Xn − X ), we must show that these r.v.s are uncorrelatedwith X . As an illustration of the application of the lemma, we write

X =1

n

n∑i=1

Xi ,

Xj − X =1

n

n∑i=1

(δij − 1)Xi ,

where δij = 1 if i = j and δij = 0 otherwise.

It is then easy to show that X and Xj − X are independent.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 22 / 101

Page 68: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution Properties of the Sample Mean and Variance

Prove the theorem

We can use the above lemma to provide an alternative proof of theindependent of X and S2 in normal sampling.Since we can write S2 as a function of n − 1 deviations (X2 −X , · · · ,Xn − X ), we must show that these r.v.s are uncorrelatedwith X . As an illustration of the application of the lemma, we write

X =1

n

n∑i=1

Xi ,

Xj − X =1

n

n∑i=1

(δij − 1)Xi ,

where δij = 1 if i = j and δij = 0 otherwise.It is then easy to show that X and Xj − X are independent.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 22 / 101

Page 69: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

Outline

1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F

2 Order Statistics3 Convergence Concepts

Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution

4 Generating a Random SampleDirect ModelIndirect Model

5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling

6 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 23 / 101

Page 70: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

Student’s T distribution

If X1, · · · ,Xn are a random sample from a N(µ, σ2), we know that X−µσ/√n∼

N(0, 1).Most of the time, however, σ is unknown. We can use S to replace σ. Inthis case

X − µS/√n

=(X − µ)/(σ/

√n)√

S2/σ2

.=

N(0, 1)√χ2n−1/(n − 1)

.

Definition

If X1, · · · ,Xn are a random sample from a N(µ, σ2). The quantity X−µS/√n

has Student’s T distribution with p = (n − 1) degrees of freedom with thepdf as

fT (t) =Γ((p + 1)/2)

Γ(p/2)

1

(pπ)1/2

[ 2

(1 + t2/p)

](p+1)/2 ∼ tp.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 24 / 101

Page 71: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

Student’s T distribution

If X1, · · · ,Xn are a random sample from a N(µ, σ2), we know that X−µσ/√n∼

N(0, 1).Most of the time, however, σ is unknown. We can use S to replace σ. Inthis case

X − µS/√n

=(X − µ)/(σ/

√n)√

S2/σ2

.=

N(0, 1)√χ2n−1/(n − 1)

.

Definition

If X1, · · · ,Xn are a random sample from a N(µ, σ2). The quantity X−µS/√n

has Student’s T distribution with p = (n − 1) degrees of freedom with thepdf as

fT (t) =Γ((p + 1)/2)

Γ(p/2)

1

(pπ)1/2

[ 2

(1 + t2/p)

](p+1)/2 ∼ tp.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 24 / 101

Page 72: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

Student’s T distribution Cont’d

Notice that if p = 1, then tp becomes the pdf of the Cauchy distribution,which occurs for samples of size 2.

If we start with U and V defined above,their joint pdf is

fU,V (u, v) =1

(2π)1/2e−u

2/2 1

Γ(p/2)2p/2v (p/2)−1e−v/2.

Now make the transformation

t =u√v/p

, and w = v .

The Jacobian of the transformation is (w/p)1/2, and the marginal pdf ofT is given

fT (t) =

∫ ∞−∞

fU,V (t(w/p)1/2,w)(w/p)1/2dw

=Γ((p + 1)/2)

Γ(p/2)

1

(pπ)1/2

[ 2

(1 + t2/p)

](p+1)/2 ∼ tp.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 25 / 101

Page 73: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

Student’s T distribution Cont’d

Notice that if p = 1, then tp becomes the pdf of the Cauchy distribution,which occurs for samples of size 2. If we start with U and V defined above,their joint pdf is

fU,V (u, v) =1

(2π)1/2e−u

2/2 1

Γ(p/2)2p/2v (p/2)−1e−v/2.

Now make the transformation

t =u√v/p

, and w = v .

The Jacobian of the transformation is (w/p)1/2, and the marginal pdf ofT is given

fT (t) =

∫ ∞−∞

fU,V (t(w/p)1/2,w)(w/p)1/2dw

=Γ((p + 1)/2)

Γ(p/2)

1

(pπ)1/2

[ 2

(1 + t2/p)

](p+1)/2 ∼ tp.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 25 / 101

Page 74: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

Student’s T distribution Cont’d

Notice that if p = 1, then tp becomes the pdf of the Cauchy distribution,which occurs for samples of size 2. If we start with U and V defined above,their joint pdf is

fU,V (u, v) =1

(2π)1/2e−u

2/2 1

Γ(p/2)2p/2v (p/2)−1e−v/2.

Now make the transformation

t =u√v/p

, and w = v .

The Jacobian of the transformation is (w/p)1/2, and the marginal pdf ofT is given

fT (t) =

∫ ∞−∞

fU,V (t(w/p)1/2,w)(w/p)1/2dw

=Γ((p + 1)/2)

Γ(p/2)

1

(pπ)1/2

[ 2

(1 + t2/p)

](p+1)/2 ∼ tp.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 25 / 101

Page 75: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

Student’s T distribution Cont’d

Notice that if p = 1, then tp becomes the pdf of the Cauchy distribution,which occurs for samples of size 2. If we start with U and V defined above,their joint pdf is

fU,V (u, v) =1

(2π)1/2e−u

2/2 1

Γ(p/2)2p/2v (p/2)−1e−v/2.

Now make the transformation

t =u√v/p

, and w = v .

The Jacobian of the transformation is (w/p)1/2, and the marginal pdf ofT is given

fT (t) =

∫ ∞−∞

fU,V (t(w/p)1/2,w)(w/p)1/2dw

=Γ((p + 1)/2)

Γ(p/2)

1

(pπ)1/2

[ 2

(1 + t2/p)

](p+1)/2 ∼ tp.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 25 / 101

Page 76: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

Student’s T distribution Cont’d

Student’s t has no mgf because it does not have moments of allorders.

In fact, if there are p degrees of freedom, then there are only p − 1moments, i.e., a t1 has no mean, a t2 has no variance.Note that T1 ∼ t1 is a Cauchy r.v., it’s mean does not exist.Let Tp is a r.v. with a tp distribution, then

E (Tp) = 0, if p > 0

Var(Tp) =p

p − 2and p > 2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 26 / 101

Page 77: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

Student’s T distribution Cont’d

Student’s t has no mgf because it does not have moments of allorders.In fact, if there are p degrees of freedom, then there are only p − 1moments, i.e., a t1 has no mean, a t2 has no variance.

Note that T1 ∼ t1 is a Cauchy r.v., it’s mean does not exist.Let Tp is a r.v. with a tp distribution, then

E (Tp) = 0, if p > 0

Var(Tp) =p

p − 2and p > 2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 26 / 101

Page 78: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

Student’s T distribution Cont’d

Student’s t has no mgf because it does not have moments of allorders.In fact, if there are p degrees of freedom, then there are only p − 1moments, i.e., a t1 has no mean, a t2 has no variance.Note that T1 ∼ t1 is a Cauchy r.v., it’s mean does not exist.

Let Tp is a r.v. with a tp distribution, then

E (Tp) = 0, if p > 0

Var(Tp) =p

p − 2and p > 2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 26 / 101

Page 79: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

Student’s T distribution Cont’d

Student’s t has no mgf because it does not have moments of allorders.In fact, if there are p degrees of freedom, then there are only p − 1moments, i.e., a t1 has no mean, a t2 has no variance.Note that T1 ∼ t1 is a Cauchy r.v., it’s mean does not exist.Let Tp is a r.v. with a tp distribution, then

E (Tp) = 0, if p > 0

Var(Tp) =p

p − 2and p > 2.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 26 / 101

Page 80: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

F distribution

Let X1, · · · ,Xn be a random sample from a N(µX , σ2X ), and let Y1, · · · ,Ym

be a random sample from a N(µY , σ2Y ).

If we were interested in comparing the variability of the populations, onquantity of interest would be the ratio σ2

X/σ2Y . Information about this ratio

is contained in S2X/S

2Y , the ratio of sample variances.

Definition

Let X1, · · · ,Xn be a random sample from a N(µX , σ2X ), and let Y1, · · · ,Ym

be a random sample from a N(µY , σ2Y ). The random variable F =

S2X/σ

2X

S2Y /σ

2Y

has Snedecor’s F distribution with p = n − 1 and q = m − 1 degrees offreedom

fF (x) =Γ( p+q

2 )

Γ(p/2)Γ(q/2)(p

q)p/2 x (p/2)−1

(1 + (p/q)x)(p+q)/2∼ Fn−1,m−1.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 27 / 101

Page 81: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

F distribution

Let X1, · · · ,Xn be a random sample from a N(µX , σ2X ), and let Y1, · · · ,Ym

be a random sample from a N(µY , σ2Y ).

If we were interested in comparing the variability of the populations, onquantity of interest would be the ratio σ2

X/σ2Y . Information about this ratio

is contained in S2X/S

2Y , the ratio of sample variances.

Definition

Let X1, · · · ,Xn be a random sample from a N(µX , σ2X ), and let Y1, · · · ,Ym

be a random sample from a N(µY , σ2Y ). The random variable F =

S2X/σ

2X

S2Y /σ

2Y

has Snedecor’s F distribution with p = n − 1 and q = m − 1 degrees offreedom

fF (x) =Γ( p+q

2 )

Γ(p/2)Γ(q/2)(p

q)p/2 x (p/2)−1

(1 + (p/q)x)(p+q)/2∼ Fn−1,m−1.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 27 / 101

Page 82: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

F distribution Cont’d

Properties

A variance ratio may have an F distribution even if the parentpopulations are not normal. It is enough that their pdf aresymmetry functions.

F distribution can be found by the pdf of U/pV /q , where U and V

are independent, U ∼ χ2p and V ∼ χ2

q.

If X ∼ Fp,q, then (p/q)X(1+(p/q)X ) ∼ Beta(p/2, q/2).

If X ∼ Fp,q, then 1X ∼ Fq,p.

If X ∼ tq, then X 2 ∼ F1,q.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 28 / 101

Page 83: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

F distribution Cont’d

Properties

A variance ratio may have an F distribution even if the parentpopulations are not normal. It is enough that their pdf aresymmetry functions.

F distribution can be found by the pdf of U/pV /q , where U and V

are independent, U ∼ χ2p and V ∼ χ2

q.

If X ∼ Fp,q, then (p/q)X(1+(p/q)X ) ∼ Beta(p/2, q/2).

If X ∼ Fp,q, then 1X ∼ Fq,p.

If X ∼ tq, then X 2 ∼ F1,q.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 28 / 101

Page 84: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

F distribution Cont’d

Properties

A variance ratio may have an F distribution even if the parentpopulations are not normal. It is enough that their pdf aresymmetry functions.

F distribution can be found by the pdf of U/pV /q , where U and V

are independent, U ∼ χ2p and V ∼ χ2

q.

If X ∼ Fp,q, then (p/q)X(1+(p/q)X ) ∼ Beta(p/2, q/2).

If X ∼ Fp,q, then 1X ∼ Fq,p.

If X ∼ tq, then X 2 ∼ F1,q.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 28 / 101

Page 85: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

F distribution Cont’d

Properties

A variance ratio may have an F distribution even if the parentpopulations are not normal. It is enough that their pdf aresymmetry functions.

F distribution can be found by the pdf of U/pV /q , where U and V

are independent, U ∼ χ2p and V ∼ χ2

q.

If X ∼ Fp,q, then (p/q)X(1+(p/q)X ) ∼ Beta(p/2, q/2).

If X ∼ Fp,q, then 1X ∼ Fq,p.

If X ∼ tq, then X 2 ∼ F1,q.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 28 / 101

Page 86: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F

F distribution Cont’d

Properties

A variance ratio may have an F distribution even if the parentpopulations are not normal. It is enough that their pdf aresymmetry functions.

F distribution can be found by the pdf of U/pV /q , where U and V

are independent, U ∼ χ2p and V ∼ χ2

q.

If X ∼ Fp,q, then (p/q)X(1+(p/q)X ) ∼ Beta(p/2, q/2).

If X ∼ Fp,q, then 1X ∼ Fq,p.

If X ∼ tq, then X 2 ∼ F1,q.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 28 / 101

Page 87: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Order statistics

The order statistics of a random sample X1, · · · ,Xn are the sample valuesplaced in ascending order. They are denoted by X(1), · · · ,X(n).

The order statistics satisfy X(1) ≤ · · · ≤ X(n).

X(1) = min1≤i≤n

Xi ,

X(2) = second smallest Xi ,

· · · · · · ,X(n) = max

1≤i≤nXi .

Sample range: R = X(n) − X(1).

Sample median: M =

X((n+1)/2), if n is odd;X(n/2) + X(n/2+1), if n is even.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 29 / 101

Page 88: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Order statistics

The order statistics of a random sample X1, · · · ,Xn are the sample valuesplaced in ascending order. They are denoted by X(1), · · · ,X(n).

The order statistics satisfy X(1) ≤ · · · ≤ X(n).

X(1) = min1≤i≤n

Xi ,

X(2) = second smallest Xi ,

· · · · · · ,X(n) = max

1≤i≤nXi .

Sample range: R = X(n) − X(1).

Sample median: M =

X((n+1)/2), if n is odd;X(n/2) + X(n/2+1), if n is even.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 29 / 101

Page 89: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Order statistics

The order statistics of a random sample X1, · · · ,Xn are the sample valuesplaced in ascending order. They are denoted by X(1), · · · ,X(n).

The order statistics satisfy X(1) ≤ · · · ≤ X(n).

X(1) = min1≤i≤n

Xi ,

X(2) = second smallest Xi ,

· · · · · · ,X(n) = max

1≤i≤nXi .

Sample range: R = X(n) − X(1).

Sample median: M =

X((n+1)/2), if n is odd;X(n/2) + X(n/2+1), if n is even.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 29 / 101

Page 90: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Order statistics

The order statistics of a random sample X1, · · · ,Xn are the sample valuesplaced in ascending order. They are denoted by X(1), · · · ,X(n).

The order statistics satisfy X(1) ≤ · · · ≤ X(n).

X(1) = min1≤i≤n

Xi ,

X(2) = second smallest Xi ,

· · · · · · ,X(n) = max

1≤i≤nXi .

Sample range: R = X(n) − X(1).

Sample median: M =

X((n+1)/2), if n is odd;X(n/2) + X(n/2+1), if n is even.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 29 / 101

Page 91: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Order statistics

The order statistics of a random sample X1, · · · ,Xn are the sample valuesplaced in ascending order. They are denoted by X(1), · · · ,X(n).

The order statistics satisfy X(1) ≤ · · · ≤ X(n).

X(1) = min1≤i≤n

Xi ,

X(2) = second smallest Xi ,

· · · · · · ,X(n) = max

1≤i≤nXi .

Sample range: R = X(n) − X(1).

Sample median: M =

X((n+1)/2), if n is odd;X(n/2) + X(n/2+1), if n is even.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 29 / 101

Page 92: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Distribution of order statistics (discrete case)

Theorem

Let X1,X2, · · · ,Xn be a random sample from a discrete distribution withpmf fX (xi ) = pi , where x1 < x2 < · · · are the possible values of X inascending order. Define

P0 = 0,P1 = p1,

· · · · · · ,Pi = p1 + p2 + · · ·+ pi .

Let X(1), · · · ,X(n) denote the order statistics from the sample. Then

P(X(j) ≤ xi ) =n∑

k=j

(n

k

)Pki (1− Pi )

n−k

and

P(X(j) = xi ) =n∑

k=j

(n

k

)[Pk

i (1− Pi )n−k − Pk

i−1(1− Pi−1)n−k ].

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 30 / 101

Page 93: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Proof the theorem

Fix i and let Y be a r.v. that counts the number of X1,X2, · · · ,Xn

that are less than or equal to xi . For each of X1,X2, · · · ,Xn, call theevent Xj ≤ xi a “success” and Xj > xi a “failure”. Then Y isthe number of successes in n trials.

Thus, Y ∼ binomial(n,Pi ).The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi , i.e.,

P(X(j) ≤ xi ) = P(Y ≥ j).

For the other case

P(X(j) = xi ) = P(X(j) ≤ xi )− P(X(j) ≤ xi−1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 31 / 101

Page 94: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Proof the theorem

Fix i and let Y be a r.v. that counts the number of X1,X2, · · · ,Xn

that are less than or equal to xi . For each of X1,X2, · · · ,Xn, call theevent Xj ≤ xi a “success” and Xj > xi a “failure”. Then Y isthe number of successes in n trials.Thus, Y ∼ binomial(n,Pi ).

The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi , i.e.,

P(X(j) ≤ xi ) = P(Y ≥ j).

For the other case

P(X(j) = xi ) = P(X(j) ≤ xi )− P(X(j) ≤ xi−1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 31 / 101

Page 95: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Proof the theorem

Fix i and let Y be a r.v. that counts the number of X1,X2, · · · ,Xn

that are less than or equal to xi . For each of X1,X2, · · · ,Xn, call theevent Xj ≤ xi a “success” and Xj > xi a “failure”. Then Y isthe number of successes in n trials.Thus, Y ∼ binomial(n,Pi ).The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi , i.e.,

P(X(j) ≤ xi ) = P(Y ≥ j).

For the other case

P(X(j) = xi ) = P(X(j) ≤ xi )− P(X(j) ≤ xi−1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 31 / 101

Page 96: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Proof the theorem

Fix i and let Y be a r.v. that counts the number of X1,X2, · · · ,Xn

that are less than or equal to xi . For each of X1,X2, · · · ,Xn, call theevent Xj ≤ xi a “success” and Xj > xi a “failure”. Then Y isthe number of successes in n trials.Thus, Y ∼ binomial(n,Pi ).The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi , i.e.,

P(X(j) ≤ xi ) = P(Y ≥ j).

For the other case

P(X(j) = xi ) = P(X(j) ≤ xi )− P(X(j) ≤ xi−1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 31 / 101

Page 97: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Distribution of order statistics (continuous case)

Theorem

Let X(1), · · · ,X(n) denote the order statistics of a random sample,X1,X2, · · · ,Xn, from a continuous population with cdf FX (x) andpdf fX (x). Then the pdf of X(j) is

fX(j)(x) =

n!

(j − 1)!(n − j)!fX (x)[FX (x)]j−1[1− FX (x)]n−j .

Proof.

We first find the cdf of X(j) and then differentiate it to obtain thepdf. Let Y be a r.v. that counts the number of X1,X2, · · · ,Xn thatare less than or equal to x .

Thus, Y ∼ binomial(n,Pi ). (Note thatwe can write Pi = FX (xi ) in the above theorem.)The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi ,

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 32 / 101

Page 98: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Distribution of order statistics (continuous case)

Theorem

Let X(1), · · · ,X(n) denote the order statistics of a random sample,X1,X2, · · · ,Xn, from a continuous population with cdf FX (x) andpdf fX (x). Then the pdf of X(j) is

fX(j)(x) =

n!

(j − 1)!(n − j)!fX (x)[FX (x)]j−1[1− FX (x)]n−j .

Proof.

We first find the cdf of X(j) and then differentiate it to obtain thepdf. Let Y be a r.v. that counts the number of X1,X2, · · · ,Xn thatare less than or equal to x . Thus, Y ∼ binomial(n,Pi ). (Note thatwe can write Pi = FX (xi ) in the above theorem.)

The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi ,

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 32 / 101

Page 99: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Distribution of order statistics (continuous case)

Theorem

Let X(1), · · · ,X(n) denote the order statistics of a random sample,X1,X2, · · · ,Xn, from a continuous population with cdf FX (x) andpdf fX (x). Then the pdf of X(j) is

fX(j)(x) =

n!

(j − 1)!(n − j)!fX (x)[FX (x)]j−1[1− FX (x)]n−j .

Proof.

We first find the cdf of X(j) and then differentiate it to obtain thepdf. Let Y be a r.v. that counts the number of X1,X2, · · · ,Xn thatare less than or equal to x . Thus, Y ∼ binomial(n,Pi ). (Note thatwe can write Pi = FX (xi ) in the above theorem.)The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi ,

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 32 / 101

Page 100: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Proof Cont’d

Proof.

i.e.,

FX(j)(x) = P(Y ≥ j) =

n∑k=j

(n

k

)[FX (x)]k [1− FX (x)]n−k ,

and the pdf of X(j) is

fX(j)(x) =

d

dxFX(j)

(x)

=n∑

k=j

(n

k

)(k[FX (x)]k−1[1− FX (x)]n−k fX (x)

− (n − k)[FX (x)]k [1− FX (x)]n−k−1fX (x))

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 33 / 101

Page 101: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Proof Cont’d

Proof.

i.e.,

FX(j)(x) = P(Y ≥ j) =

n∑k=j

(n

k

)[FX (x)]k [1− FX (x)]n−k ,

and the pdf of X(j) is

fX(j)(x) =

d

dxFX(j)

(x)

=n∑

k=j

(n

k

)(k[FX (x)]k−1[1− FX (x)]n−k fX (x)

− (n − k)[FX (x)]k [1− FX (x)]n−k−1fX (x))

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 33 / 101

Page 102: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Proof Cont’d

fX(j)(x) = j

(n

j

)fX (x)[FX (x)]j−1[1− FX (x)]n−j+

n∑k=j+1

(n

k

)k[FX (x)]k−1[1− FX (x)]n−k fX (x)

−n−1∑k=j

(n

k

)(n − k)[FX (x)]k [1− FX (x)]n−k−1fX (x)

=n!

(j − 1)!(n − j)!fX (x)[FX (x)]j−1[1− FX (x)]n−j

+n∑

k=j

(n

k + 1

)(k + 1)[FX (x)]k [1− FX (x)]n−k−1fX (x)

−n−1∑k=j

(n

k

)(n − k)[FX (x)]k [1− FX (x)]n−k−1fX (x).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 34 / 101

Page 103: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Uniform order statistic pdf

Example

Let X1,X2, · · · ,Xn be i.i.d. uniform(0, 1), so fX (x) = 1 for x ∈ (0, 1)and FX (x) = x for x ∈ (0, 1). We can obtain that the pdf of the jthorder statistics is

fX(j)(x) =

n!

(j − 1)!(n − j)!x j−1(1− x)n−j

=Γ(n + 1)

Γ(j)Γ(n − j + 1)x j−1(1− x)(n−j+1)−1.

Thus, the jth order statistic form a uniform(0, 1) sample has aBeta(j , n − j + 1) distribution.From this we can deduce that

E (X(j)) =j

n + 1and Var(X(j)) =

j(n − j + 1)

(n + 1)2(n + 2).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 35 / 101

Page 104: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Uniform order statistic pdf

Example

Let X1,X2, · · · ,Xn be i.i.d. uniform(0, 1), so fX (x) = 1 for x ∈ (0, 1)and FX (x) = x for x ∈ (0, 1). We can obtain that the pdf of the jthorder statistics is

fX(j)(x) =

n!

(j − 1)!(n − j)!x j−1(1− x)n−j

=Γ(n + 1)

Γ(j)Γ(n − j + 1)x j−1(1− x)(n−j+1)−1.

Thus, the jth order statistic form a uniform(0, 1) sample has aBeta(j , n − j + 1) distribution.

From this we can deduce that

E (X(j)) =j

n + 1and Var(X(j)) =

j(n − j + 1)

(n + 1)2(n + 2).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 35 / 101

Page 105: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Uniform order statistic pdf

Example

Let X1,X2, · · · ,Xn be i.i.d. uniform(0, 1), so fX (x) = 1 for x ∈ (0, 1)and FX (x) = x for x ∈ (0, 1). We can obtain that the pdf of the jthorder statistics is

fX(j)(x) =

n!

(j − 1)!(n − j)!x j−1(1− x)n−j

=Γ(n + 1)

Γ(j)Γ(n − j + 1)x j−1(1− x)(n−j+1)−1.

Thus, the jth order statistic form a uniform(0, 1) sample has aBeta(j , n − j + 1) distribution.From this we can deduce that

E (X(j)) =j

n + 1and Var(X(j)) =

j(n − j + 1)

(n + 1)2(n + 2).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 35 / 101

Page 106: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Joint distribution of order statistics (continuous case)

Theorem

Let X(1), · · · ,X(n) denote the order statistics of a random sample,X1,X2, · · · ,Xn, from a continuous population with cdf FX (x) andpdf fX (x).

Then the joint pdf of X(i) and X(j), 1 ≤ i ≤ j ≤ n, is

fX(i),X(j)(u, v) =

n!

(i − 1)!(j − 1− i)!(n − j)!fX (u)fX (v)[FX (u)]i−1

[FX (v)− FX (u)]j−1−v [1− FX (v)]n−j .

for −∞ < u < v <∞.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 36 / 101

Page 107: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Joint distribution of order statistics (continuous case)

Theorem

Let X(1), · · · ,X(n) denote the order statistics of a random sample,X1,X2, · · · ,Xn, from a continuous population with cdf FX (x) andpdf fX (x).Then the joint pdf of X(i) and X(j), 1 ≤ i ≤ j ≤ n, is

fX(i),X(j)(u, v) =

n!

(i − 1)!(j − 1− i)!(n − j)!fX (u)fX (v)[FX (u)]i−1

[FX (v)− FX (u)]j−1−v [1− FX (v)]n−j .

for −∞ < u < v <∞.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 36 / 101

Page 108: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Joint distribution of order statistics (continuous case)

Theorem

The joint pdf of three or more order statistics could be derived usingsimilar but even more involved arguments. Perhaps the other mostuseful pdf is

fX(1),··· ,X(n)(x1, · · · , xn) =

n!fX (x1) · · · fX (xn), x1 < · · · < xn0, otherwise.

The joint pdf and the techniques from previous chapter can be used toderive marginal and conditional distributions and distributions of otherfunctions of the order statistics.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 37 / 101

Page 109: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Order Statistics

Joint distribution of order statistics (continuous case)

Theorem

The joint pdf of three or more order statistics could be derived usingsimilar but even more involved arguments. Perhaps the other mostuseful pdf is

fX(1),··· ,X(n)(x1, · · · , xn) =

n!fX (x1) · · · fX (xn), x1 < · · · < xn0, otherwise.

The joint pdf and the techniques from previous chapter can be used toderive marginal and conditional distributions and distributions of otherfunctions of the order statistics.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 37 / 101

Page 110: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Probability

Outline

1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F

2 Order Statistics3 Convergence Concepts

Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution

4 Generating a Random SampleDirect ModelIndirect Model

5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling

6 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 38 / 101

Page 111: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Probability

Convergence in probability

Definition

A sequence of r.v.s, X1,X2, · · · , converges in probability to a r.v. Xif, for every ε > 0,

limn→∞

P(|Xn − X | ≥ ε) = 0

orlimn→∞

P(|Xn − X | < ε) = 1,Xnp−→ X .

The r.v.s X1,X2, · · · are typically not i.i.d. r.v.s, as in arandom sample;

The distribution of Xn changes as the subscript changes;

The distribution of Xn converges to some limiting distributionas the subscript becomes large.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 39 / 101

Page 112: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Probability

Convergence in probability

Definition

A sequence of r.v.s, X1,X2, · · · , converges in probability to a r.v. Xif, for every ε > 0,

limn→∞

P(|Xn − X | ≥ ε) = 0

orlimn→∞

P(|Xn − X | < ε) = 1,Xnp−→ X .

The r.v.s X1,X2, · · · are typically not i.i.d. r.v.s, as in arandom sample;

The distribution of Xn changes as the subscript changes;

The distribution of Xn converges to some limiting distributionas the subscript becomes large.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 39 / 101

Page 113: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Probability

Convergence in probability

Definition

A sequence of r.v.s, X1,X2, · · · , converges in probability to a r.v. Xif, for every ε > 0,

limn→∞

P(|Xn − X | ≥ ε) = 0

orlimn→∞

P(|Xn − X | < ε) = 1,Xnp−→ X .

The r.v.s X1,X2, · · · are typically not i.i.d. r.v.s, as in arandom sample;

The distribution of Xn changes as the subscript changes;

The distribution of Xn converges to some limiting distributionas the subscript becomes large.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 39 / 101

Page 114: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Probability

Convergence in probability

Definition

A sequence of r.v.s, X1,X2, · · · , converges in probability to a r.v. Xif, for every ε > 0,

limn→∞

P(|Xn − X | ≥ ε) = 0

orlimn→∞

P(|Xn − X | < ε) = 1,Xnp−→ X .

The r.v.s X1,X2, · · · are typically not i.i.d. r.v.s, as in arandom sample;

The distribution of Xn changes as the subscript changes;

The distribution of Xn converges to some limiting distributionas the subscript becomes large.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 39 / 101

Page 115: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Probability

Weak law of large numbers

Theorem

Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.Define X n = 1

n

∑ni=1 Xi . Then, for every ε > 0,

limn→∞

P(|X n − µ| < ε) = 1,

that is, X n converges in probability to µ.

Proof.

Using Chebychev’s inequality, for every ε > 0,

P(|X n − µ| < ε) = 1− P(|X n − µ| ≥ ε) = 1− P((X n − µ)2 ≥ ε2)

≥ 1− E (X n − µ)2

ε2= 1− σ2

nε2→ 1.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 40 / 101

Page 116: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Probability

Weak law of large numbers

Theorem

Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.Define X n = 1

n

∑ni=1 Xi . Then, for every ε > 0,

limn→∞

P(|X n − µ| < ε) = 1,

that is, X n converges in probability to µ.

Proof.

Using Chebychev’s inequality, for every ε > 0,

P(|X n − µ| < ε) = 1− P(|X n − µ| ≥ ε) = 1− P((X n − µ)2 ≥ ε2)

≥ 1− E (X n − µ)2

ε2= 1− σ2

nε2→ 1.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 40 / 101

Page 117: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Probability

Consistency of S2

Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.If we define

S2n =

1

n − 1

n∑i=1

(Xi − X n)2,

can we prove a WLLN for S2n?

Using Chebychev’s inequality, we have,

P(|Sn − σ2| ≥ ε) ≤ E (Sn − σ2)2

ε2

=Var(S2

n )

ε2.

Thus, a sufficient condition that S2n converges in probability to σ2 is

that Var(S2n )→ 0 as n→∞.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 41 / 101

Page 118: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Probability

Consistency of S2

Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.If we define

S2n =

1

n − 1

n∑i=1

(Xi − X n)2,

can we prove a WLLN for S2n?

Using Chebychev’s inequality, we have,

P(|Sn − σ2| ≥ ε) ≤ E (Sn − σ2)2

ε2

=Var(S2

n )

ε2.

Thus, a sufficient condition that S2n converges in probability to σ2 is

that Var(S2n )→ 0 as n→∞.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 41 / 101

Page 119: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Probability

Consistency of S2

Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.If we define

S2n =

1

n − 1

n∑i=1

(Xi − X n)2,

can we prove a WLLN for S2n?

Using Chebychev’s inequality, we have,

P(|Sn − σ2| ≥ ε) ≤ E (Sn − σ2)2

ε2

=Var(S2

n )

ε2.

Thus, a sufficient condition that S2n converges in probability to σ2 is

that Var(S2n )→ 0 as n→∞.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 41 / 101

Page 120: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Probability

Consistency of S?

Theorem

Let X1,X2, · · · converge in probability to a r.v. X and that h is acontinuous function. Then h(X1), h(X2), · · · converges in probabilityto h(X ).

Consistency of S

If S2n is a consistent estimator of σ2, then the sample standard devi-

ation Sn =√

S2n = h(S2

n ) is a consistent estimator of σ.Note that Sn is, in fact, a biased estimator of σ, but the bias disap-pears asymptotically.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 42 / 101

Page 121: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Probability

Consistency of S?

Theorem

Let X1,X2, · · · converge in probability to a r.v. X and that h is acontinuous function. Then h(X1), h(X2), · · · converges in probabilityto h(X ).

Consistency of S

If S2n is a consistent estimator of σ2, then the sample standard devi-

ation Sn =√

S2n = h(S2

n ) is a consistent estimator of σ.Note that Sn is, in fact, a biased estimator of σ, but the bias disap-pears asymptotically.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 42 / 101

Page 122: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Almost Sure Convergence

Outline

1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F

2 Order Statistics3 Convergence Concepts

Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution

4 Generating a Random SampleDirect ModelIndirect Model

5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling

6 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 43 / 101

Page 123: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Almost Sure Convergence

Almost Sure Convergence

A type of convergence that is stronger than convergence in probability isalmost sure convergence.

Definition

A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,

P( limn→∞

|Xn − X | < ε) = 1, or Xna.s.−−→ X .

Although they look similar, they are very different statements;

Given a sample space Ω, Xn(ω) and X (ω) are all functionsdefined on Ω. Xn

a.s.−−→ X indicates that the functions Xn(ω)converge to X (ω) for all ω ∈ Ω except perhaps for ω ∈ N,where N ⊂ Ω and P(N) = 0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 44 / 101

Page 124: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Almost Sure Convergence

Almost Sure Convergence

A type of convergence that is stronger than convergence in probability isalmost sure convergence.

Definition

A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,

P( limn→∞

|Xn − X | < ε) = 1, or Xna.s.−−→ X .

Although they look similar, they are very different statements;

Given a sample space Ω, Xn(ω) and X (ω) are all functionsdefined on Ω. Xn

a.s.−−→ X indicates that the functions Xn(ω)converge to X (ω) for all ω ∈ Ω except perhaps for ω ∈ N,where N ⊂ Ω and P(N) = 0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 44 / 101

Page 125: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Almost Sure Convergence

Almost Sure Convergence

A type of convergence that is stronger than convergence in probability isalmost sure convergence.

Definition

A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,

P( limn→∞

|Xn − X | < ε) = 1, or Xna.s.−−→ X .

Although they look similar, they are very different statements;

Given a sample space Ω, Xn(ω) and X (ω) are all functionsdefined on Ω. Xn

a.s.−−→ X indicates that the functions Xn(ω)converge to X (ω) for all ω ∈ Ω except perhaps for ω ∈ N,where N ⊂ Ω and P(N) = 0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 44 / 101

Page 126: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Almost Sure Convergence

Almost Sure Convergence

A type of convergence that is stronger than convergence in probability isalmost sure convergence.

Definition

A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,

P( limn→∞

|Xn − X | < ε) = 1, or Xna.s.−−→ X .

Although they look similar, they are very different statements;

Given a sample space Ω, Xn(ω) and X (ω) are all functionsdefined on Ω. Xn

a.s.−−→ X indicates that the functions Xn(ω)converge to X (ω) for all ω ∈ Ω except perhaps for ω ∈ N,where N ⊂ Ω and P(N) = 0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 44 / 101

Page 127: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Almost Sure Convergence

Almost Sure Convergence

Example

Let the sample space Ω be the closed interval [0, 1] with the uniformpdf.Define r.v.s

Xn(ω) = ω + ωn, and X (ω) = ω.

For every ω ∈ [0, 1), ωn → 0 as n → ∞ and Xn(ω) → ω = X (ω).However, Xn(1) = 2 for every n so Xn(1) does not converge to1 = X (1).But since the convergence occurs on the set [0, 1) and P([0, 1)) = 1,A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,

Xna.s.−−→ X .

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 45 / 101

Page 128: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Almost Sure Convergence

Almost Sure Convergence

Example

Let the sample space Ω be the closed interval [0, 1] with the uniformpdf.Define r.v.s

Xn(ω) = ω + ωn, and X (ω) = ω.

For every ω ∈ [0, 1), ωn → 0 as n → ∞ and Xn(ω) → ω = X (ω).

However, Xn(1) = 2 for every n so Xn(1) does not converge to1 = X (1).But since the convergence occurs on the set [0, 1) and P([0, 1)) = 1,A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,

Xna.s.−−→ X .

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 45 / 101

Page 129: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Almost Sure Convergence

Almost Sure Convergence

Example

Let the sample space Ω be the closed interval [0, 1] with the uniformpdf.Define r.v.s

Xn(ω) = ω + ωn, and X (ω) = ω.

For every ω ∈ [0, 1), ωn → 0 as n → ∞ and Xn(ω) → ω = X (ω).However, Xn(1) = 2 for every n so Xn(1) does not converge to1 = X (1).

But since the convergence occurs on the set [0, 1) and P([0, 1)) = 1,A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,

Xna.s.−−→ X .

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 45 / 101

Page 130: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Almost Sure Convergence

Almost Sure Convergence

Example

Let the sample space Ω be the closed interval [0, 1] with the uniformpdf.Define r.v.s

Xn(ω) = ω + ωn, and X (ω) = ω.

For every ω ∈ [0, 1), ωn → 0 as n → ∞ and Xn(ω) → ω = X (ω).However, Xn(1) = 2 for every n so Xn(1) does not converge to1 = X (1).But since the convergence occurs on the set [0, 1) and P([0, 1)) = 1,A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,

Xna.s.−−→ X .

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 45 / 101

Page 131: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Almost Sure Convergence

Convergence in probability, not a.s.

Example

Let the sample space Ω be the closed interval [0, 1] with the uniformpdf.Define r.v.s

X1(ω) = ω + I[0,1](ω),X2(ω) = ω + I[0, 12

](ω),

X3(ω) = ω + I[ 12,1](ω),X4(ω) = ω + I[0, 1

3](ω),

X5(ω) = ω + I[ 13, 2

3](ω),X6(ω) = ω + I[ 2

3,1](ω), · · ·

Let X1(ω) = ω. Thus Xnp−→ X .

For every ω, the value of Xn(ω) alternates between the values ω and1 + ω infinitely often. Thus,

Xn 6a.s.−−→ X .

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 46 / 101

Page 132: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Almost Sure Convergence

Convergence in probability, not a.s.

Example

Let the sample space Ω be the closed interval [0, 1] with the uniformpdf.Define r.v.s

X1(ω) = ω + I[0,1](ω),X2(ω) = ω + I[0, 12

](ω),

X3(ω) = ω + I[ 12,1](ω),X4(ω) = ω + I[0, 1

3](ω),

X5(ω) = ω + I[ 13, 2

3](ω),X6(ω) = ω + I[ 2

3,1](ω), · · ·

Let X1(ω) = ω. Thus Xnp−→ X .

For every ω, the value of Xn(ω) alternates between the values ω and1 + ω infinitely often. Thus,

Xn 6a.s.−−→ X .

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 46 / 101

Page 133: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Almost Sure Convergence

Convergence in probability, not a.s.

Example

Let the sample space Ω be the closed interval [0, 1] with the uniformpdf.Define r.v.s

X1(ω) = ω + I[0,1](ω),X2(ω) = ω + I[0, 12

](ω),

X3(ω) = ω + I[ 12,1](ω),X4(ω) = ω + I[0, 1

3](ω),

X5(ω) = ω + I[ 13, 2

3](ω),X6(ω) = ω + I[ 2

3,1](ω), · · ·

Let X1(ω) = ω. Thus Xnp−→ X .

For every ω, the value of Xn(ω) alternates between the values ω and1 + ω infinitely often. Thus,

Xn 6a.s.−−→ X .

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 46 / 101

Page 134: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Almost Sure Convergence

Strong law of large numbers

Theorem

Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.Define X n = 1

n

∑ni=1 Xi . Then, for every ε > 0,

P( limn→∞

|X n − µ| < ε) = 1, or Xna.s.−−→ X .

For both the weak and strong laws of large numbers, we havethe assumption of a finite variance. Both laws hold withoutthis assumption. The only moment condition needed is thatE |Xi | <∞;

Convergence almost surely, being the stronger criterion, impliesconvergence in probability;

If a sequence converges in probability, it is possible to find asubsequence that converges almost surely.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 47 / 101

Page 135: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Almost Sure Convergence

Strong law of large numbers

Theorem

Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.Define X n = 1

n

∑ni=1 Xi . Then, for every ε > 0,

P( limn→∞

|X n − µ| < ε) = 1, or Xna.s.−−→ X .

For both the weak and strong laws of large numbers, we havethe assumption of a finite variance. Both laws hold withoutthis assumption. The only moment condition needed is thatE |Xi | <∞;

Convergence almost surely, being the stronger criterion, impliesconvergence in probability;

If a sequence converges in probability, it is possible to find asubsequence that converges almost surely.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 47 / 101

Page 136: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Almost Sure Convergence

Strong law of large numbers

Theorem

Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.Define X n = 1

n

∑ni=1 Xi . Then, for every ε > 0,

P( limn→∞

|X n − µ| < ε) = 1, or Xna.s.−−→ X .

For both the weak and strong laws of large numbers, we havethe assumption of a finite variance. Both laws hold withoutthis assumption. The only moment condition needed is thatE |Xi | <∞;

Convergence almost surely, being the stronger criterion, impliesconvergence in probability;

If a sequence converges in probability, it is possible to find asubsequence that converges almost surely.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 47 / 101

Page 137: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Almost Sure Convergence

Strong law of large numbers

Theorem

Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.Define X n = 1

n

∑ni=1 Xi . Then, for every ε > 0,

P( limn→∞

|X n − µ| < ε) = 1, or Xna.s.−−→ X .

For both the weak and strong laws of large numbers, we havethe assumption of a finite variance. Both laws hold withoutthis assumption. The only moment condition needed is thatE |Xi | <∞;

Convergence almost surely, being the stronger criterion, impliesconvergence in probability;

If a sequence converges in probability, it is possible to find asubsequence that converges almost surely.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 47 / 101

Page 138: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Outline

1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F

2 Order Statistics3 Convergence Concepts

Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution

4 Generating a Random SampleDirect ModelIndirect Model

5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling

6 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 48 / 101

Page 139: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Convergence in distribution

Definition

A sequence of r.v.s, X1,X2, · · · , converges in distribution to a r.v. Xif

limn→∞

FXn(x) = FX (x),

at all points x where FX (x) is continuous.

Example

If r.v.s, X1,X2, · · · are i.i.d. uniform(0, 1) and X(n) = max1≤i≤n Xi ,let us examine if X(n) converges in distribution.As n→∞, we expect X(n) to get close to 1, as X(n) must necessarilybe less than 1, we have for any ε > 0

P(|X(n) − 1| ≥ ε) = P(X(n) ≥ 1 + ε) + P(X(n) ≤ 1− ε)= 0 + P(X(n) ≤ 1− ε).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 49 / 101

Page 140: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Convergence in distribution

Definition

A sequence of r.v.s, X1,X2, · · · , converges in distribution to a r.v. Xif

limn→∞

FXn(x) = FX (x),

at all points x where FX (x) is continuous.

Example

If r.v.s, X1,X2, · · · are i.i.d. uniform(0, 1) and X(n) = max1≤i≤n Xi ,let us examine if X(n) converges in distribution.

As n→∞, we expect X(n) to get close to 1, as X(n) must necessarilybe less than 1, we have for any ε > 0

P(|X(n) − 1| ≥ ε) = P(X(n) ≥ 1 + ε) + P(X(n) ≤ 1− ε)= 0 + P(X(n) ≤ 1− ε).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 49 / 101

Page 141: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Convergence in distribution

Definition

A sequence of r.v.s, X1,X2, · · · , converges in distribution to a r.v. Xif

limn→∞

FXn(x) = FX (x),

at all points x where FX (x) is continuous.

Example

If r.v.s, X1,X2, · · · are i.i.d. uniform(0, 1) and X(n) = max1≤i≤n Xi ,let us examine if X(n) converges in distribution.As n→∞, we expect X(n) to get close to 1, as X(n) must necessarilybe less than 1, we have for any ε > 0

P(|X(n) − 1| ≥ ε) = P(X(n) ≥ 1 + ε) + P(X(n) ≤ 1− ε)= 0 + P(X(n) ≤ 1− ε).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 49 / 101

Page 142: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Convergence in distribution

Definition

A sequence of r.v.s, X1,X2, · · · , converges in distribution to a r.v. Xif

limn→∞

FXn(x) = FX (x),

at all points x where FX (x) is continuous.

Example

If r.v.s, X1,X2, · · · are i.i.d. uniform(0, 1) and X(n) = max1≤i≤n Xi ,let us examine if X(n) converges in distribution.As n→∞, we expect X(n) to get close to 1, as X(n) must necessarilybe less than 1, we have for any ε > 0

P(|X(n) − 1| ≥ ε) = P(X(n) ≥ 1 + ε) + P(X(n) ≤ 1− ε)= 0 + P(X(n) ≤ 1− ε).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 49 / 101

Page 143: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Example Cont’d

Next using the fact that we have an i.i.d. sample, we can write

P(X(n) ≤ 1− ε) = P(Xi ≤ 1− ε, i = 1, · · · , n) = (1− ε)n,

which goes to 0.

So we have proved that X(n) converges to 1 in probability. However,if we take ε = t

n , we then have

P(X(n) ≤ 1− t

n) = (1− t

n)n → e−t ,

which, upon rearranging, yields

P(n(1− X(n)) ≤ t)→ 1− e−t ;

That is, the r.v. n(1 − X(n)) converges in distribution to anexponential(1) r.v.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 50 / 101

Page 144: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Example Cont’d

Next using the fact that we have an i.i.d. sample, we can write

P(X(n) ≤ 1− ε) = P(Xi ≤ 1− ε, i = 1, · · · , n) = (1− ε)n,

which goes to 0.So we have proved that X(n) converges to 1 in probability. However,if we take ε = t

n , we then have

P(X(n) ≤ 1− t

n) = (1− t

n)n → e−t ,

which, upon rearranging, yields

P(n(1− X(n)) ≤ t)→ 1− e−t ;

That is, the r.v. n(1 − X(n)) converges in distribution to anexponential(1) r.v.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 50 / 101

Page 145: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Example Cont’d

Next using the fact that we have an i.i.d. sample, we can write

P(X(n) ≤ 1− ε) = P(Xi ≤ 1− ε, i = 1, · · · , n) = (1− ε)n,

which goes to 0.So we have proved that X(n) converges to 1 in probability. However,if we take ε = t

n , we then have

P(X(n) ≤ 1− t

n) = (1− t

n)n → e−t ,

which, upon rearranging, yields

P(n(1− X(n)) ≤ t)→ 1− e−t ;

That is, the r.v. n(1 − X(n)) converges in distribution to anexponential(1) r.v.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 50 / 101

Page 146: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Example Cont’d

Next using the fact that we have an i.i.d. sample, we can write

P(X(n) ≤ 1− ε) = P(Xi ≤ 1− ε, i = 1, · · · , n) = (1− ε)n,

which goes to 0.So we have proved that X(n) converges to 1 in probability. However,if we take ε = t

n , we then have

P(X(n) ≤ 1− t

n) = (1− t

n)n → e−t ,

which, upon rearranging, yields

P(n(1− X(n)) ≤ t)→ 1− e−t ;

That is, the r.v. n(1 − X(n)) converges in distribution to anexponential(1) r.v.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 50 / 101

Page 147: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Central limit theorem

Let X1,X2, · · · be a sequence of i.i.d r.v.s whose mgfs exist in aneighborhood of 0 (i.e., MXi

(t) exists for |t| < h, for some positiveh). Let E (Xi ) = µ and Var(Xi ) = σ2 > 0. Define

X n =1

n

n∑i=1

Xi .

Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ∈ [−∞,∞],

limn→∞

Gn(x) =

∫ x

−∞

1√2π

e−y2/2dy .

i.e.,√n(X n−µ)σ has a limiting standard normal distribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 51 / 101

Page 148: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Central limit theorem

Let X1,X2, · · · be a sequence of i.i.d r.v.s whose mgfs exist in aneighborhood of 0 (i.e., MXi

(t) exists for |t| < h, for some positiveh). Let E (Xi ) = µ and Var(Xi ) = σ2 > 0. Define

X n =1

n

n∑i=1

Xi .

Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ∈ [−∞,∞],

limn→∞

Gn(x) =

∫ x

−∞

1√2π

e−y2/2dy .

i.e.,√n(X n−µ)σ has a limiting standard normal distribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 51 / 101

Page 149: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Central limit theorem

Let X1,X2, · · · be a sequence of i.i.d r.v.s whose mgfs exist in aneighborhood of 0 (i.e., MXi

(t) exists for |t| < h, for some positiveh). Let E (Xi ) = µ and Var(Xi ) = σ2 > 0. Define

X n =1

n

n∑i=1

Xi .

Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ∈ [−∞,∞],

limn→∞

Gn(x) =

∫ x

−∞

1√2π

e−y2/2dy .

i.e.,√n(X n−µ)σ has a limiting standard normal distribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 51 / 101

Page 150: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

The proof of the central limit theorem

Proof.

We will show that, for |t| < h, the mgf of√n(X n−µ)σ converges to

et2/2, the mgf of a N(0, 1) r.v.

Define Yi = Xi−µσ , and let MY (t) denote the common mfg of the

Y ′i s, which exists for |t| < σh, we have

M√n(X n−µ)/σ(t) = M∑ni=1 Yi/

√n(t) = M∑n

i=1 Yi(t√n

) =(MY (

t√n

))n.

We now expand MY ( t√n

) in a Taylor series around 0. We have

MY (t√n

) =∞∑k=0

M(k)Y (0)

(t/√n)k

k!,

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 52 / 101

Page 151: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

The proof of the central limit theorem

Proof.

We will show that, for |t| < h, the mgf of√n(X n−µ)σ converges to

et2/2, the mgf of a N(0, 1) r.v.

Define Yi = Xi−µσ , and let MY (t) denote the common mfg of the

Y ′i s, which exists for |t| < σh,

we have

M√n(X n−µ)/σ(t) = M∑ni=1 Yi/

√n(t) = M∑n

i=1 Yi(t√n

) =(MY (

t√n

))n.

We now expand MY ( t√n

) in a Taylor series around 0. We have

MY (t√n

) =∞∑k=0

M(k)Y (0)

(t/√n)k

k!,

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 52 / 101

Page 152: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

The proof of the central limit theorem

Proof.

We will show that, for |t| < h, the mgf of√n(X n−µ)σ converges to

et2/2, the mgf of a N(0, 1) r.v.

Define Yi = Xi−µσ , and let MY (t) denote the common mfg of the

Y ′i s, which exists for |t| < σh, we have

M√n(X n−µ)/σ(t) = M∑ni=1 Yi/

√n(t) = M∑n

i=1 Yi(t√n

) =(MY (

t√n

))n.

We now expand MY ( t√n

) in a Taylor series around 0. We have

MY (t√n

) =∞∑k=0

M(k)Y (0)

(t/√n)k

k!,

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 52 / 101

Page 153: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

The proof of the central limit theorem

Proof.

We will show that, for |t| < h, the mgf of√n(X n−µ)σ converges to

et2/2, the mgf of a N(0, 1) r.v.

Define Yi = Xi−µσ , and let MY (t) denote the common mfg of the

Y ′i s, which exists for |t| < σh, we have

M√n(X n−µ)/σ(t) = M∑ni=1 Yi/

√n(t) = M∑n

i=1 Yi(t√n

) =(MY (

t√n

))n.

We now expand MY ( t√n

) in a Taylor series around 0. We have

MY (t√n

) =∞∑k=0

M(k)Y (0)

(t/√n)k

k!,

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 52 / 101

Page 154: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Proof Cont’d

where M(k)Y (0) = dkMY (t)

dtk|t=0. Since the mgfs exist for |t| < σh, the

power series expansion is valid if |t| <√nσh.

Using the facts that M(0)Y = 1, M

(1)Y = 0, and M

(2)Y = 1, we have

MY (t√n

) = 1 +(t/√n)2

2!+ RY (t/

√n),

where RY is the remainder term in the Taylor expansion,

RY (t√n

) =∞∑k=3

M(k)Y (0)

(t/√n)k

k!.

An application of Taylor’s Theorem (5.5.21) shows that, for fixedt 6= 0, we have

limn→∞

RY (t/√n)

(t/√n)2

= 0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 53 / 101

Page 155: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Proof Cont’d

where M(k)Y (0) = dkMY (t)

dtk|t=0. Since the mgfs exist for |t| < σh, the

power series expansion is valid if |t| <√nσh.

Using the facts that M(0)Y = 1, M

(1)Y = 0, and M

(2)Y = 1, we have

MY (t√n

) = 1 +(t/√n)2

2!+ RY (t/

√n),

where RY is the remainder term in the Taylor expansion,

RY (t√n

) =∞∑k=3

M(k)Y (0)

(t/√n)k

k!.

An application of Taylor’s Theorem (5.5.21) shows that, for fixedt 6= 0, we have

limn→∞

RY (t/√n)

(t/√n)2

= 0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 53 / 101

Page 156: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Proof Cont’d

where M(k)Y (0) = dkMY (t)

dtk|t=0. Since the mgfs exist for |t| < σh, the

power series expansion is valid if |t| <√nσh.

Using the facts that M(0)Y = 1, M

(1)Y = 0, and M

(2)Y = 1, we have

MY (t√n

) = 1 +(t/√n)2

2!+ RY (t/

√n),

where RY is the remainder term in the Taylor expansion,

RY (t√n

) =∞∑k=3

M(k)Y (0)

(t/√n)k

k!.

An application of Taylor’s Theorem (5.5.21) shows that, for fixedt 6= 0, we have

limn→∞

RY (t/√n)

(t/√n)2

= 0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 53 / 101

Page 157: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Proof Cont’d

Since t is fixed, we also have

limn→∞

RY (t/√n)

(t/√n)2

= limn→∞

nRY (t/√n) = 0,

and is also true at t = 0 since RY (0/√n) = 0.

Thus, for any fixedt, we can write

limn→∞

(MY (

t√n

))n

= limn→∞

[1 +

(t/√n)2

2!+ RY (t/

√n)]n

= limn→∞

[1 +

1

n(t2

2+ nRY (t/

√n))]n

= et2/2.

Note that et2/2 is the mgf of the N(0, 1) distribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 54 / 101

Page 158: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Proof Cont’d

Since t is fixed, we also have

limn→∞

RY (t/√n)

(t/√n)2

= limn→∞

nRY (t/√n) = 0,

and is also true at t = 0 since RY (0/√n) = 0. Thus, for any fixed

t, we can write

limn→∞

(MY (

t√n

))n

= limn→∞

[1 +

(t/√n)2

2!+ RY (t/

√n)]n

= limn→∞

[1 +

1

n(t2

2+ nRY (t/

√n))]n

= et2/2.

Note that et2/2 is the mgf of the N(0, 1) distribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 54 / 101

Page 159: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Proof Cont’d

Since t is fixed, we also have

limn→∞

RY (t/√n)

(t/√n)2

= limn→∞

nRY (t/√n) = 0,

and is also true at t = 0 since RY (0/√n) = 0. Thus, for any fixed

t, we can write

limn→∞

(MY (

t√n

))n

= limn→∞

[1 +

(t/√n)2

2!+ RY (t/

√n)]n

= limn→∞

[1 +

1

n(t2

2+ nRY (t/

√n))]n

= et2/2.

Note that et2/2 is the mgf of the N(0, 1) distribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 54 / 101

Page 160: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Strong form of the central limit theorem

Let X1,X2, · · · be a sequence of i.i.d r.v.s with E (Xi ) = µ and 0 <Var(Xi ) = σ2 <∞. Define

X n =1

n

n∑i=1

Xi .

Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ,

limn→∞

Gn(x) =

∫ x

−∞

1√2π

e−y2/2dy .

i.e.,√n(X n−µ)σ has a limiting standard normal distribution.

In particular, all of the assumptions about mgfs are not needed, i.e., theuse of characteristic functions can replace them.This theorem is general enough for almost all statistical purposes. It holdswhen the parent distribution has finite variance.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 55 / 101

Page 161: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Strong form of the central limit theorem

Let X1,X2, · · · be a sequence of i.i.d r.v.s with E (Xi ) = µ and 0 <Var(Xi ) = σ2 <∞. Define

X n =1

n

n∑i=1

Xi .

Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ,

limn→∞

Gn(x) =

∫ x

−∞

1√2π

e−y2/2dy .

i.e.,√n(X n−µ)σ has a limiting standard normal distribution.

In particular, all of the assumptions about mgfs are not needed, i.e., theuse of characteristic functions can replace them.This theorem is general enough for almost all statistical purposes. It holdswhen the parent distribution has finite variance.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 55 / 101

Page 162: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Strong form of the central limit theorem

Let X1,X2, · · · be a sequence of i.i.d r.v.s with E (Xi ) = µ and 0 <Var(Xi ) = σ2 <∞. Define

X n =1

n

n∑i=1

Xi .

Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ,

limn→∞

Gn(x) =

∫ x

−∞

1√2π

e−y2/2dy .

i.e.,√n(X n−µ)σ has a limiting standard normal distribution.

In particular, all of the assumptions about mgfs are not needed, i.e., theuse of characteristic functions can replace them.This theorem is general enough for almost all statistical purposes. It holdswhen the parent distribution has finite variance.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 55 / 101

Page 163: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Strong form of the central limit theorem

Let X1,X2, · · · be a sequence of i.i.d r.v.s with E (Xi ) = µ and 0 <Var(Xi ) = σ2 <∞. Define

X n =1

n

n∑i=1

Xi .

Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ,

limn→∞

Gn(x) =

∫ x

−∞

1√2π

e−y2/2dy .

i.e.,√n(X n−µ)σ has a limiting standard normal distribution.

In particular, all of the assumptions about mgfs are not needed, i.e., theuse of characteristic functions can replace them.

This theorem is general enough for almost all statistical purposes. It holdswhen the parent distribution has finite variance.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 55 / 101

Page 164: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Strong form of the central limit theorem

Let X1,X2, · · · be a sequence of i.i.d r.v.s with E (Xi ) = µ and 0 <Var(Xi ) = σ2 <∞. Define

X n =1

n

n∑i=1

Xi .

Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ,

limn→∞

Gn(x) =

∫ x

−∞

1√2π

e−y2/2dy .

i.e.,√n(X n−µ)σ has a limiting standard normal distribution.

In particular, all of the assumptions about mgfs are not needed, i.e., theuse of characteristic functions can replace them.This theorem is general enough for almost all statistical purposes.

It holdswhen the parent distribution has finite variance.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 55 / 101

Page 165: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Strong form of the central limit theorem

Let X1,X2, · · · be a sequence of i.i.d r.v.s with E (Xi ) = µ and 0 <Var(Xi ) = σ2 <∞. Define

X n =1

n

n∑i=1

Xi .

Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ,

limn→∞

Gn(x) =

∫ x

−∞

1√2π

e−y2/2dy .

i.e.,√n(X n−µ)σ has a limiting standard normal distribution.

In particular, all of the assumptions about mgfs are not needed, i.e., theuse of characteristic functions can replace them.This theorem is general enough for almost all statistical purposes. It holdswhen the parent distribution has finite variance.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 55 / 101

Page 166: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Normal approximation to the negative binomial

Let X1,X2, · · · be a random sample from a negative binomial(r , p) distri-bution. Recall that

E (X ) =r(1− p)

p, and Var(X ) =

r(1− p)

p2,

and the central limit theorem tells us that

√n(X − r(1− p)/p)√

r(1− p)/p2→ N(0, 1).

For r = 10, p = 12 , and n = 30, and calculations would be

P(X ≤ 11) = P( 30∑

i=1

Xi ≤ 330)

= · · · = 0.8916.

P(X ≤ 11) = P(√30(X − 10)√

20≤√

30(11− 10)√20

) ≈ 0.8888.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 56 / 101

Page 167: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Normal approximation to the negative binomial

Let X1,X2, · · · be a random sample from a negative binomial(r , p) distri-bution. Recall that

E (X ) =r(1− p)

p, and Var(X ) =

r(1− p)

p2,

and the central limit theorem tells us that

√n(X − r(1− p)/p)√

r(1− p)/p2→ N(0, 1).

For r = 10, p = 12 , and n = 30, and calculations would be

P(X ≤ 11) = P( 30∑

i=1

Xi ≤ 330)

= · · · = 0.8916.

P(X ≤ 11) = P(√30(X − 10)√

20≤√

30(11− 10)√20

) ≈ 0.8888.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 56 / 101

Page 168: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Slutsky’s Theorem

Let Xn → X in distribution and Yn → a, a constant, in probability,then

a. YnXn → aX in distribution.

b. Xn + Yn → X + a in distribution.

Let√n(X n−µ)σ → N(0, 1), but the value of σ is unknown.

If limn→∞ Var(S2n ) = 0, then S2

n → σ2 in probability, i.e., σSn→ 1 in

probability.Hence, Slutsky’s theorem tells us

√n(X n − µ)

Sn=

σ

Sn

√n(X n − µ)

σ→ N(0, 1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 57 / 101

Page 169: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Slutsky’s Theorem

Let Xn → X in distribution and Yn → a, a constant, in probability,then

a. YnXn → aX in distribution.

b. Xn + Yn → X + a in distribution.

Let√n(X n−µ)σ → N(0, 1), but the value of σ is unknown.

If limn→∞ Var(S2n ) = 0, then S2

n → σ2 in probability, i.e., σSn→ 1 in

probability.Hence, Slutsky’s theorem tells us

√n(X n − µ)

Sn=

σ

Sn

√n(X n − µ)

σ→ N(0, 1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 57 / 101

Page 170: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Slutsky’s Theorem

Let Xn → X in distribution and Yn → a, a constant, in probability,then

a. YnXn → aX in distribution.

b. Xn + Yn → X + a in distribution.

Let√n(X n−µ)σ → N(0, 1), but the value of σ is unknown.

If limn→∞ Var(S2n ) = 0, then S2

n → σ2 in probability, i.e., σSn→ 1 in

probability.Hence, Slutsky’s theorem tells us

√n(X n − µ)

Sn=

σ

Sn

√n(X n − µ)

σ→ N(0, 1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 57 / 101

Page 171: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Slutsky’s Theorem

Let Xn → X in distribution and Yn → a, a constant, in probability,then

a. YnXn → aX in distribution.

b. Xn + Yn → X + a in distribution.

Let√n(X n−µ)σ → N(0, 1), but the value of σ is unknown.

If limn→∞ Var(S2n ) = 0, then S2

n → σ2 in probability, i.e., σSn→ 1 in

probability.Hence, Slutsky’s theorem tells us

√n(X n − µ)

Sn=

σ

Sn

√n(X n − µ)

σ→ N(0, 1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 57 / 101

Page 172: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Slutsky’s Theorem

Let Xn → X in distribution and Yn → a, a constant, in probability,then

a. YnXn → aX in distribution.

b. Xn + Yn → X + a in distribution.

Let√n(X n−µ)σ → N(0, 1), but the value of σ is unknown.

If limn→∞ Var(S2n ) = 0, then S2

n → σ2 in probability, i.e., σSn→ 1 in

probability.

Hence, Slutsky’s theorem tells us

√n(X n − µ)

Sn=

σ

Sn

√n(X n − µ)

σ→ N(0, 1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 57 / 101

Page 173: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Convergence Concepts Convergence in Distribution

Slutsky’s Theorem

Let Xn → X in distribution and Yn → a, a constant, in probability,then

a. YnXn → aX in distribution.

b. Xn + Yn → X + a in distribution.

Let√n(X n−µ)σ → N(0, 1), but the value of σ is unknown.

If limn→∞ Var(S2n ) = 0, then S2

n → σ2 in probability, i.e., σSn→ 1 in

probability.Hence, Slutsky’s theorem tells us

√n(X n − µ)

Sn=

σ

Sn

√n(X n − µ)

σ→ N(0, 1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 57 / 101

Page 174: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample

Exponential lifetime

Example

Suppose that a particular electrical component is to be modeled with anexponential(λ) lifetime. The manufacturer is interested in determining theprobability that, out of c components, at least t of them will last h hours.

Taking this one step at a time, we have

p1 = P(component lasts at least h hours) = P(X ≥ h|λ),

and assuming that the components are independent, we can model theoutcomes of the c components as Bernoulli trials, so

p2 = P(at least t components last h hours)

=c∑

k=t

(c

k

)pk1 (1− p1)c−k .

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 58 / 101

Page 175: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample

Exponential lifetime

Example

Suppose that a particular electrical component is to be modeled with anexponential(λ) lifetime. The manufacturer is interested in determining theprobability that, out of c components, at least t of them will last h hours.Taking this one step at a time, we have

p1 = P(component lasts at least h hours) = P(X ≥ h|λ),

and assuming that the components are independent, we can model theoutcomes of the c components as Bernoulli trials, so

p2 = P(at least t components last h hours)

=c∑

k=t

(c

k

)pk1 (1− p1)c−k .

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 58 / 101

Page 176: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample

Example Cont’d

Moreover, the exponential model has the advantage that p1 can be ex-pressed in closed form, i.e.,

p1 =

∫ ∞h

1

λe−x/λdx = e−h/λ.

However, if each component were modeled with, say, a Gamma distribution,then p1 may not be expressible in closed form. This would make calculationof p2 even more involved.A simulation approach to the calculation of expressions is to generate r.v.swith the desired distribution and then use the Weak Law of Large Numbersto validate the simulation. That is, if Yi , i = 1, 2, · · · , n, are i.i.d., then aconsequence of that theorem is

1

n

n∑i=1

h(Yi )→ E (h(Y )) in probability.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 59 / 101

Page 177: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample

Example Cont’d

Moreover, the exponential model has the advantage that p1 can be ex-pressed in closed form, i.e.,

p1 =

∫ ∞h

1

λe−x/λdx = e−h/λ.

However, if each component were modeled with, say, a Gamma distribution,then p1 may not be expressible in closed form. This would make calculationof p2 even more involved.

A simulation approach to the calculation of expressions is to generate r.v.swith the desired distribution and then use the Weak Law of Large Numbersto validate the simulation. That is, if Yi , i = 1, 2, · · · , n, are i.i.d., then aconsequence of that theorem is

1

n

n∑i=1

h(Yi )→ E (h(Y )) in probability.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 59 / 101

Page 178: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample

Example Cont’d

Moreover, the exponential model has the advantage that p1 can be ex-pressed in closed form, i.e.,

p1 =

∫ ∞h

1

λe−x/λdx = e−h/λ.

However, if each component were modeled with, say, a Gamma distribution,then p1 may not be expressible in closed form. This would make calculationof p2 even more involved.A simulation approach to the calculation of expressions is to generate r.v.swith the desired distribution and then use the Weak Law of Large Numbersto validate the simulation. That is, if Yi , i = 1, 2, · · · , n, are i.i.d., then aconsequence of that theorem is

1

n

n∑i=1

h(Yi )→ E (h(Y )) in probability.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 59 / 101

Page 179: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample

Approximate calculation

The probability p2 can be calculated using the following steps. For j =1, · · · , n :

a. Generate X1, · · · ,Xc i.i.d. ∼ exponential(λ);

b. Set Yj = 1 if at least t Xi s are ≥ h; otherwise, set Yj = 0.

Then, because Yj ∼ Bernoulli(p2) and E (Yj) = p2, then

1

n

n∑i=1

Yi → p2 in probability.

There two major concerns should be emphasize

We must examine how to generate the r.v.s that we need;

We then use a version of the Law of Large Numbers to validate thesimulation approximation.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 60 / 101

Page 180: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample

Approximate calculation

The probability p2 can be calculated using the following steps. For j =1, · · · , n :

a. Generate X1, · · · ,Xc i.i.d. ∼ exponential(λ);

b. Set Yj = 1 if at least t Xi s are ≥ h; otherwise, set Yj = 0.

Then, because Yj ∼ Bernoulli(p2) and E (Yj) = p2, then

1

n

n∑i=1

Yi → p2 in probability.

There two major concerns should be emphasize

We must examine how to generate the r.v.s that we need;

We then use a version of the Law of Large Numbers to validate thesimulation approximation.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 60 / 101

Page 181: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample

Approximate calculation

The probability p2 can be calculated using the following steps. For j =1, · · · , n :

a. Generate X1, · · · ,Xc i.i.d. ∼ exponential(λ);

b. Set Yj = 1 if at least t Xi s are ≥ h; otherwise, set Yj = 0.

Then, because Yj ∼ Bernoulli(p2) and E (Yj) = p2, then

1

n

n∑i=1

Yi → p2 in probability.

There two major concerns should be emphasize

We must examine how to generate the r.v.s that we need;

We then use a version of the Law of Large Numbers to validate thesimulation approximation.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 60 / 101

Page 182: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample

Approximate calculation

The probability p2 can be calculated using the following steps. For j =1, · · · , n :

a. Generate X1, · · · ,Xc i.i.d. ∼ exponential(λ);

b. Set Yj = 1 if at least t Xi s are ≥ h; otherwise, set Yj = 0.

Then, because Yj ∼ Bernoulli(p2) and E (Yj) = p2, then

1

n

n∑i=1

Yi → p2 in probability.

There two major concerns should be emphasize

We must examine how to generate the r.v.s that we need;

We then use a version of the Law of Large Numbers to validate thesimulation approximation.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 60 / 101

Page 183: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample

Approximate calculation

The probability p2 can be calculated using the following steps. For j =1, · · · , n :

a. Generate X1, · · · ,Xc i.i.d. ∼ exponential(λ);

b. Set Yj = 1 if at least t Xi s are ≥ h; otherwise, set Yj = 0.

Then, because Yj ∼ Bernoulli(p2) and E (Yj) = p2, then

1

n

n∑i=1

Yi → p2 in probability.

There two major concerns should be emphasize

We must examine how to generate the r.v.s that we need;

We then use a version of the Law of Large Numbers to validate thesimulation approximation.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 60 / 101

Page 184: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample

Approximate calculation

The probability p2 can be calculated using the following steps. For j =1, · · · , n :

a. Generate X1, · · · ,Xc i.i.d. ∼ exponential(λ);

b. Set Yj = 1 if at least t Xi s are ≥ h; otherwise, set Yj = 0.

Then, because Yj ∼ Bernoulli(p2) and E (Yj) = p2, then

1

n

n∑i=1

Yi → p2 in probability.

There two major concerns should be emphasize

We must examine how to generate the r.v.s that we need;

We then use a version of the Law of Large Numbers to validate thesimulation approximation.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 60 / 101

Page 185: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample

Pseudo-random number generation

We start with the assumption that we can generate i.i.d. uniform r.v.sU1, · · · ,Um. There exist many algorithms for generating pseudo-random numbers that will pass almost all tests for uniformity.Since we are starting from the uniform r.v.s, our problem here isreally not the problem of generating the desired r.v.s, but rather oftransforming the uniform r.v.s to the desired distribution.

There are two general methods for doing this

Direct model

Indirect model

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 61 / 101

Page 186: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample

Pseudo-random number generation

We start with the assumption that we can generate i.i.d. uniform r.v.sU1, · · · ,Um. There exist many algorithms for generating pseudo-random numbers that will pass almost all tests for uniformity.Since we are starting from the uniform r.v.s, our problem here isreally not the problem of generating the desired r.v.s, but rather oftransforming the uniform r.v.s to the desired distribution.There are two general methods for doing this

Direct model

Indirect model

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 61 / 101

Page 187: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Direct Model

Outline

1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F

2 Order Statistics3 Convergence Concepts

Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution

4 Generating a Random SampleDirect ModelIndirect Model

5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling

6 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 62 / 101

Page 188: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Direct Model

Pseudo-random number generation

We start with the assumption that we can generate i.i.d. uniform r.v.sU1, · · · ,Um. There exist many algorithms for generating pseudo-random numbers that will pass almost all tests for uniformity.Since we are starting from the uniform r.v.s, our problem here isreally not the problem of generating the desired r.v.s, but rather oftransforming the uniform r.v.s to the desired distribution.

There are two general methods for doing this

Direct model

Indirect model

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 63 / 101

Page 189: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Direct Model

Pseudo-random number generation

We start with the assumption that we can generate i.i.d. uniform r.v.sU1, · · · ,Um. There exist many algorithms for generating pseudo-random numbers that will pass almost all tests for uniformity.Since we are starting from the uniform r.v.s, our problem here isreally not the problem of generating the desired r.v.s, but rather oftransforming the uniform r.v.s to the desired distribution.There are two general methods for doing this

Direct model

Indirect model

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 63 / 101

Page 190: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Direct Model

Probability integral transform

A direct model of generating a r.v. is one for which there exists a closedform function g(u) such that the transformed variable Y = g(U) has thedesired distribution when U ∼ uniform(0, 1).

As might be recalled, thiswas already accomplished for continuous r.v.s in the probability integraltransform, where any distribution was transformed to the uniform. Hencethe inverse transformation solves our problem.

Example

If Y is a continuous r.v. with cdf FY , then the r.v. F−1Y (U), where U ∼

uniform(0, 1), has distribution FY . If Y ∼ exponential(λ), then

F−1Y (U) = λ log (1− U)

is an exponential(λ) r.v.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 64 / 101

Page 191: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Direct Model

Probability integral transform

A direct model of generating a r.v. is one for which there exists a closedform function g(u) such that the transformed variable Y = g(U) has thedesired distribution when U ∼ uniform(0, 1). As might be recalled, thiswas already accomplished for continuous r.v.s in the probability integraltransform, where any distribution was transformed to the uniform. Hencethe inverse transformation solves our problem.

Example

If Y is a continuous r.v. with cdf FY , then the r.v. F−1Y (U), where U ∼

uniform(0, 1), has distribution FY . If Y ∼ exponential(λ), then

F−1Y (U) = λ log (1− U)

is an exponential(λ) r.v.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 64 / 101

Page 192: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Direct Model

Probability integral transform

A direct model of generating a r.v. is one for which there exists a closedform function g(u) such that the transformed variable Y = g(U) has thedesired distribution when U ∼ uniform(0, 1). As might be recalled, thiswas already accomplished for continuous r.v.s in the probability integraltransform, where any distribution was transformed to the uniform. Hencethe inverse transformation solves our problem.

Example

If Y is a continuous r.v. with cdf FY , then the r.v. F−1Y (U), where U ∼

uniform(0, 1), has distribution FY . If Y ∼ exponential(λ), then

F−1Y (U) = λ log (1− U)

is an exponential(λ) r.v.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 64 / 101

Page 193: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Direct Model

Analysis of probability integral transform

Thus, if we generate U1, · · · ,Un as i.i.d. uniform r.v.s, Y =−λ log (1− Ui ), i = 1, · · · , n, are i.i.d. exponential(λ) r.v.s. As an ex-ample, for n = 10, 000, we generate ui and have

1

n

∑ui = 0.5019 and

1

n − 1

∑(ui − u)2 = 0.0842.

According to the WLLN, we know that

U → E (U) =1

2and S2 → Var(U) =

1

12= 0.0833.

The transformed variables Yi = −2 log (1− ui ) have an exponential(2)distribution, and we find that

1

n

∑yi = 2.0004 and

1

n − 1

∑(yi − y)2 = 4.0908.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 65 / 101

Page 194: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Direct Model

Analysis of probability integral transform

Thus, if we generate U1, · · · ,Un as i.i.d. uniform r.v.s, Y =−λ log (1− Ui ), i = 1, · · · , n, are i.i.d. exponential(λ) r.v.s. As an ex-ample, for n = 10, 000, we generate ui and have

1

n

∑ui = 0.5019 and

1

n − 1

∑(ui − u)2 = 0.0842.

According to the WLLN, we know that

U → E (U) =1

2and S2 → Var(U) =

1

12= 0.0833.

The transformed variables Yi = −2 log (1− ui ) have an exponential(2)distribution, and we find that

1

n

∑yi = 2.0004 and

1

n − 1

∑(yi − y)2 = 4.0908.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 65 / 101

Page 195: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Direct Model

Analysis of probability integral transform

Thus, if we generate U1, · · · ,Un as i.i.d. uniform r.v.s, Y =−λ log (1− Ui ), i = 1, · · · , n, are i.i.d. exponential(λ) r.v.s. As an ex-ample, for n = 10, 000, we generate ui and have

1

n

∑ui = 0.5019 and

1

n − 1

∑(ui − u)2 = 0.0842.

According to the WLLN, we know that

U → E (U) =1

2and S2 → Var(U) =

1

12= 0.0833.

The transformed variables Yi = −2 log (1− ui ) have an exponential(2)distribution, and we find that

1

n

∑yi = 2.0004 and

1

n − 1

∑(yi − y)2 = 4.0908.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 65 / 101

Page 196: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Direct Model

Analysis of probability integral transform

The relationship between the exponential and other distributions allows thequick generation of many r.v.s. For example

Yj = −λ log (Uj) ∼ exponential(λ)

Y = −2v∑

j=1

log (Uj) ∼ χ22v

Y = −βa∑

j=1

log (Uj) ∼ gamma(a, β)

Y =

∑aj=1 log (Uj)∑a+bj=1 log (Uj)

∼ beta(a, b).

Unfortunately, there are limits to this transformation. For example, wecannot use it to generate χ2 r.v.s with odd degrees of freedom.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 66 / 101

Page 197: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Direct Model

Box-Muller algorithm

F−1Y (u) = y ⇔ u =

∫ y

−∞fy (t)dt.

Application of this formula to the exponential distribution was particularhandy, as the integral equation has a simple solution.

However, in manycases no closed-form solution. In this case, the other types of transforma-tions or indirect methods may be feasible.

Box-muller algorithm

Generate U1 and U2, two independent uniform(0, 1) r.v.s, and set

R =√−2 logU1 and θ = 2πU2.

Then

X = R cos θ ∼ N(0, 1) and Y = R sin θ ∼ N(0, 1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 67 / 101

Page 198: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Direct Model

Box-Muller algorithm

F−1Y (u) = y ⇔ u =

∫ y

−∞fy (t)dt.

Application of this formula to the exponential distribution was particularhandy, as the integral equation has a simple solution. However, in manycases no closed-form solution.

In this case, the other types of transforma-tions or indirect methods may be feasible.

Box-muller algorithm

Generate U1 and U2, two independent uniform(0, 1) r.v.s, and set

R =√−2 logU1 and θ = 2πU2.

Then

X = R cos θ ∼ N(0, 1) and Y = R sin θ ∼ N(0, 1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 67 / 101

Page 199: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Direct Model

Box-Muller algorithm

F−1Y (u) = y ⇔ u =

∫ y

−∞fy (t)dt.

Application of this formula to the exponential distribution was particularhandy, as the integral equation has a simple solution. However, in manycases no closed-form solution. In this case, the other types of transforma-tions or indirect methods may be feasible.

Box-muller algorithm

Generate U1 and U2, two independent uniform(0, 1) r.v.s, and set

R =√−2 logU1 and θ = 2πU2.

Then

X = R cos θ ∼ N(0, 1) and Y = R sin θ ∼ N(0, 1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 67 / 101

Page 200: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Direct Model

Box-Muller algorithm

F−1Y (u) = y ⇔ u =

∫ y

−∞fy (t)dt.

Application of this formula to the exponential distribution was particularhandy, as the integral equation has a simple solution. However, in manycases no closed-form solution. In this case, the other types of transforma-tions or indirect methods may be feasible.

Box-muller algorithm

Generate U1 and U2, two independent uniform(0, 1) r.v.s, and set

R =√−2 logU1 and θ = 2πU2.

Then

X = R cos θ ∼ N(0, 1) and Y = R sin θ ∼ N(0, 1).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 67 / 101

Page 201: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Direct Model

Generating discrete r.v.s

Generating processing

If Y is a discrete r.v. taking on values y1 < y2 < · · · < yk , then wecan similarly write

P[FY (yi ) < U ≤ FY (yi+1)] = FY (yi+1)− FY (yi ) = P(Y = yi+1).

To generate Yi ∼ FY (y),

a. Generate U ∼ uniform(0, 1);

b. If FY (yi+1) ≥ U > FY (yi ), set Y = yi+1.

We define y0 = −∞ and FY (y0) = 0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 68 / 101

Page 202: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Outline

1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F

2 Order Statistics3 Convergence Concepts

Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution

4 Generating a Random SampleDirect ModelIndirect Model

5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling

6 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 69 / 101

Page 203: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Beta r.v. generation I

Suppose the goal is to generate Y ∼ beta(a, b). If both a andb are integers, then the direct transformation method can be used.

However, if a and b are not integers, for example a = 2.7 and b = 6.3.

If (U,V ) are independent uniform(0, 1)r.v.s, then the probability of the shadedarea is

P(V ≤ y ,U ≤ 1

cfY (V ))

=

∫ y

0

∫ fY (v)/c

0dudv

=1

c

∫ y

0fY (v)dv =

1

cP(Y ≤ y)

Here, if y = 1, we have

P(U ≤ 1

cfY (V )) =

1

c.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 70 / 101

Page 204: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Beta r.v. generation I

Suppose the goal is to generate Y ∼ beta(a, b). If both a andb are integers, then the direct transformation method can be used.However, if a and b are not integers, for example a = 2.7 and b = 6.3.

If (U,V ) are independent uniform(0, 1)r.v.s, then the probability of the shadedarea is

P(V ≤ y ,U ≤ 1

cfY (V ))

=

∫ y

0

∫ fY (v)/c

0dudv

=1

c

∫ y

0fY (v)dv =

1

cP(Y ≤ y)

Here, if y = 1, we have

P(U ≤ 1

cfY (V )) =

1

c.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 70 / 101

Page 205: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Beta r.v. generation I

Suppose the goal is to generate Y ∼ beta(a, b). If both a andb are integers, then the direct transformation method can be used.However, if a and b are not integers, for example a = 2.7 and b = 6.3.

If (U,V ) are independent uniform(0, 1)r.v.s, then the probability of the shadedarea is

P(V ≤ y ,U ≤ 1

cfY (V ))

=

∫ y

0

∫ fY (v)/c

0dudv

=1

c

∫ y

0fY (v)dv =

1

cP(Y ≤ y)

Here, if y = 1, we have

P(U ≤ 1

cfY (V )) =

1

c.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 70 / 101

Page 206: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Beta r.v. generation I

Suppose the goal is to generate Y ∼ beta(a, b). If both a andb are integers, then the direct transformation method can be used.However, if a and b are not integers, for example a = 2.7 and b = 6.3.

If (U,V ) are independent uniform(0, 1)r.v.s, then the probability of the shadedarea is

P(V ≤ y ,U ≤ 1

cfY (V ))

=

∫ y

0

∫ fY (v)/c

0dudv

=1

c

∫ y

0fY (v)dv =

1

cP(Y ≤ y)

Here, if y = 1, we have

P(U ≤ 1

cfY (V )) =

1

c.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 70 / 101

Page 207: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Rejection sampling

Hence, we have

P(Y ≤ y) =P(V ≤ y ,U ≤ 1

c fY (V ))

P(U ≤ 1c fY (V ))

= P(V ≤ y |U ≤ 1

cfY (V )).

The result suggests the following rejection sampling algorithm forgenerating Y ∼ beta(a, b) :

a. Generate (U,V ) independent uniform(0, 1);

b. If U < 1cFY (V ), set Y = V ; otherwise, return to step(a).

This algorithm generates a beta(a, b) r.v. as long as c ≥ maxy fY (y)and, in fact, can be generalized to any bounded density with boundedsupport.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 71 / 101

Page 208: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Rejection sampling

Hence, we have

P(Y ≤ y) =P(V ≤ y ,U ≤ 1

c fY (V ))

P(U ≤ 1c fY (V ))

= P(V ≤ y |U ≤ 1

cfY (V )).

The result suggests the following rejection sampling algorithm forgenerating Y ∼ beta(a, b) :

a. Generate (U,V ) independent uniform(0, 1);

b. If U < 1cFY (V ), set Y = V ; otherwise, return to step(a).

This algorithm generates a beta(a, b) r.v. as long as c ≥ maxy fY (y)and, in fact, can be generalized to any bounded density with boundedsupport.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 71 / 101

Page 209: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Rejection sampling

Hence, we have

P(Y ≤ y) =P(V ≤ y ,U ≤ 1

c fY (V ))

P(U ≤ 1c fY (V ))

= P(V ≤ y |U ≤ 1

cfY (V )).

The result suggests the following rejection sampling algorithm forgenerating Y ∼ beta(a, b) :

a. Generate (U,V ) independent uniform(0, 1);

b. If U < 1cFY (V ), set Y = V ; otherwise, return to step(a).

This algorithm generates a beta(a, b) r.v. as long as c ≥ maxy fY (y)and, in fact, can be generalized to any bounded density with boundedsupport.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 71 / 101

Page 210: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Rejection sampling

Hence, we have

P(Y ≤ y) =P(V ≤ y ,U ≤ 1

c fY (V ))

P(U ≤ 1c fY (V ))

= P(V ≤ y |U ≤ 1

cfY (V )).

The result suggests the following rejection sampling algorithm forgenerating Y ∼ beta(a, b) :

a. Generate (U,V ) independent uniform(0, 1);

b. If U < 1cFY (V ), set Y = V ; otherwise, return to step(a).

This algorithm generates a beta(a, b) r.v. as long as c ≥ maxy fY (y)and, in fact, can be generalized to any bounded density with boundedsupport.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 71 / 101

Page 211: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Rejection sampling

Hence, we have

P(Y ≤ y) =P(V ≤ y ,U ≤ 1

c fY (V ))

P(U ≤ 1c fY (V ))

= P(V ≤ y |U ≤ 1

cfY (V )).

The result suggests the following rejection sampling algorithm forgenerating Y ∼ beta(a, b) :

a. Generate (U,V ) independent uniform(0, 1);

b. If U < 1cFY (V ), set Y = V ; otherwise, return to step(a).

This algorithm generates a beta(a, b) r.v. as long as c ≥ maxy fY (y)and, in fact, can be generalized to any bounded density with boundedsupport.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 71 / 101

Page 212: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Analysis of rejection sampling

It should be clear that the optimal choice of c is c = maxy fY (y).To see why this is so, note that the algorithm is open-ended in thesense that we do not know how many (U,V ) pairs will be needed toget on r.v. Y .

Let N be the number of (U,V ) pairs required for one Y . Then,recalling that P(U ≤ 1

c fY (V )) = 1c , we see that N is a geometric( 1

c )r.v. Thus to generate one Y , we expect to need E (N) = c pairs(U,V ), and in this sense minimizing c will optimize the algorithm.Examining the figure, we see that the algorithm is wasteful in the areawhere U > 1

cFY (V ). This is because we are using a uniform r.v. Vto get a beta r.v. Y . To improve, we might start with somethingcloser to the beta distribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 72 / 101

Page 213: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Analysis of rejection sampling

It should be clear that the optimal choice of c is c = maxy fY (y).To see why this is so, note that the algorithm is open-ended in thesense that we do not know how many (U,V ) pairs will be needed toget on r.v. Y .Let N be the number of (U,V ) pairs required for one Y . Then,recalling that P(U ≤ 1

c fY (V )) = 1c , we see that N is a geometric( 1

c )r.v.

Thus to generate one Y , we expect to need E (N) = c pairs(U,V ), and in this sense minimizing c will optimize the algorithm.Examining the figure, we see that the algorithm is wasteful in the areawhere U > 1

cFY (V ). This is because we are using a uniform r.v. Vto get a beta r.v. Y . To improve, we might start with somethingcloser to the beta distribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 72 / 101

Page 214: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Analysis of rejection sampling

It should be clear that the optimal choice of c is c = maxy fY (y).To see why this is so, note that the algorithm is open-ended in thesense that we do not know how many (U,V ) pairs will be needed toget on r.v. Y .Let N be the number of (U,V ) pairs required for one Y . Then,recalling that P(U ≤ 1

c fY (V )) = 1c , we see that N is a geometric( 1

c )r.v. Thus to generate one Y , we expect to need E (N) = c pairs(U,V ), and in this sense minimizing c will optimize the algorithm.

Examining the figure, we see that the algorithm is wasteful in the areawhere U > 1

cFY (V ). This is because we are using a uniform r.v. Vto get a beta r.v. Y . To improve, we might start with somethingcloser to the beta distribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 72 / 101

Page 215: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Analysis of rejection sampling

It should be clear that the optimal choice of c is c = maxy fY (y).To see why this is so, note that the algorithm is open-ended in thesense that we do not know how many (U,V ) pairs will be needed toget on r.v. Y .Let N be the number of (U,V ) pairs required for one Y . Then,recalling that P(U ≤ 1

c fY (V )) = 1c , we see that N is a geometric( 1

c )r.v. Thus to generate one Y , we expect to need E (N) = c pairs(U,V ), and in this sense minimizing c will optimize the algorithm.Examining the figure, we see that the algorithm is wasteful in the areawhere U > 1

cFY (V ).

This is because we are using a uniform r.v. Vto get a beta r.v. Y . To improve, we might start with somethingcloser to the beta distribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 72 / 101

Page 216: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Analysis of rejection sampling

It should be clear that the optimal choice of c is c = maxy fY (y).To see why this is so, note that the algorithm is open-ended in thesense that we do not know how many (U,V ) pairs will be needed toget on r.v. Y .Let N be the number of (U,V ) pairs required for one Y . Then,recalling that P(U ≤ 1

c fY (V )) = 1c , we see that N is a geometric( 1

c )r.v. Thus to generate one Y , we expect to need E (N) = c pairs(U,V ), and in this sense minimizing c will optimize the algorithm.Examining the figure, we see that the algorithm is wasteful in the areawhere U > 1

cFY (V ). This is because we are using a uniform r.v. Vto get a beta r.v. Y .

To improve, we might start with somethingcloser to the beta distribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 72 / 101

Page 217: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Analysis of rejection sampling

It should be clear that the optimal choice of c is c = maxy fY (y).To see why this is so, note that the algorithm is open-ended in thesense that we do not know how many (U,V ) pairs will be needed toget on r.v. Y .Let N be the number of (U,V ) pairs required for one Y . Then,recalling that P(U ≤ 1

c fY (V )) = 1c , we see that N is a geometric( 1

c )r.v. Thus to generate one Y , we expect to need E (N) = c pairs(U,V ), and in this sense minimizing c will optimize the algorithm.Examining the figure, we see that the algorithm is wasteful in the areawhere U > 1

cFY (V ). This is because we are using a uniform r.v. Vto get a beta r.v. Y . To improve, we might start with somethingcloser to the beta distribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 72 / 101

Page 218: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Theorem of Accept/Reject sampling

Let Y ∼ fY (y) and V ∼ fV (v), where fY and fV have commonsupport with

M = supy

fY (y)/fV (y) <∞.

To generate a r.v. Y ∼ fY ;

a. Generate U ∼ uniform(0, 1), andV ∼ fV , independent;

b. If U < 1M fY (V )/fV (V ), set Y = V ;

otherwise, return to step(a).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 73 / 101

Page 219: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Theorem of Accept/Reject sampling

Let Y ∼ fY (y) and V ∼ fV (v), where fY and fV have commonsupport with

M = supy

fY (y)/fV (y) <∞.

To generate a r.v. Y ∼ fY ;

a. Generate U ∼ uniform(0, 1), andV ∼ fV , independent;

b. If U < 1M fY (V )/fV (V ), set Y = V ;

otherwise, return to step(a).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 73 / 101

Page 220: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Theorem of Accept/Reject sampling

Let Y ∼ fY (y) and V ∼ fV (v), where fY and fV have commonsupport with

M = supy

fY (y)/fV (y) <∞.

To generate a r.v. Y ∼ fY ;

a. Generate U ∼ uniform(0, 1), andV ∼ fV , independent;

b. If U < 1M fY (V )/fV (V ), set Y = V ;

otherwise, return to step(a).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 73 / 101

Page 221: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Proof

Proof.

The generated r.v. Y has cdf

P(Y ≤ y) = P(V ≤ y |stop) = P(V ≤ y |U <1

MfY (V )/fV (V ))

=P(V ≤ y ,U < 1

M fY (V )/fV (V ))

P(U < 1M fY (V )/fV (V ))

=

∫ y

−∞∫ 1

M fY (V )/fV (V )

0dufV (v)dv∫ +∞

−∞∫ 1

M fY (V )/fV (V )

0dufV (v)dv

=

∫ y

−∞fY (v)dv .

Hence, the number of trials needed to generate one Y is a geometric( 1M )

r.v., and M is the expected number of trials.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 74 / 101

Page 222: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Beta r.v. generation II

To generate Y ∼ beta(2.7, 6.3), we can consider the following algo-rithm

a. Generate U ∼ uniform(0, 1), and V ∼ beta(2, 6), independent;

b. If U < 1M fY (V )/fV (V ), set Y = V ; otherwise, return to

step(a).

This Accept/reject algorithm will generate the required Y as longas supy fY (y)/fV (y) ≤ M < ∞. For the given densities, we haveM = 1.67, so the requirement is satisfied.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 75 / 101

Page 223: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Beta r.v. generation II

To generate Y ∼ beta(2.7, 6.3), we can consider the following algo-rithm

a. Generate U ∼ uniform(0, 1), and V ∼ beta(2, 6), independent;

b. If U < 1M fY (V )/fV (V ), set Y = V ; otherwise, return to

step(a).

This Accept/reject algorithm will generate the required Y as longas supy fY (y)/fV (y) ≤ M < ∞. For the given densities, we haveM = 1.67, so the requirement is satisfied.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 75 / 101

Page 224: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Generating a Random Sample Indirect Model

Beta r.v. generation II

To generate Y ∼ beta(2.7, 6.3), we can consider the following algo-rithm

a. Generate U ∼ uniform(0, 1), and V ∼ beta(2, 6), independent;

b. If U < 1M fY (V )/fV (V ), set Y = V ; otherwise, return to

step(a).

This Accept/reject algorithm will generate the required Y as longas supy fY (y)/fV (y) ≤ M < ∞. For the given densities, we haveM = 1.67, so the requirement is satisfied.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 75 / 101

Page 225: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo

The Main idea of MCMC

We cannot sample directly from the target distribution p(x) in theintegral

∫f (x)p(x)dx .

Create a Markov chain whose transition matrix does notdepend on the normalization term.

Make sure the chain has a stationary distribution and it isequal to the target distribution.

After sufficient number of iterations, the chain will convergethe stationary distribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 76 / 101

Page 226: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo

The Main idea of MCMC

We cannot sample directly from the target distribution p(x) in theintegral

∫f (x)p(x)dx .

Create a Markov chain whose transition matrix does notdepend on the normalization term.

Make sure the chain has a stationary distribution and it isequal to the target distribution.

After sufficient number of iterations, the chain will convergethe stationary distribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 76 / 101

Page 227: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo

The Main idea of MCMC

We cannot sample directly from the target distribution p(x) in theintegral

∫f (x)p(x)dx .

Create a Markov chain whose transition matrix does notdepend on the normalization term.

Make sure the chain has a stationary distribution and it isequal to the target distribution.

After sufficient number of iterations, the chain will convergethe stationary distribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 76 / 101

Page 228: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo

Markov Chain Monte Carlo

Overview

Markov Chain Monte Carlo (MCMC) methods are a class of algorithms forsampling from a probability distribution based on constructing a Markovchain that has the desired distribution as its stationary distribution.

The algorithm is proposed in 1953, which is the top-10 mostimportant algorithms in the 20th century.

MCMC works by generating a sequence of sample values in such away that, as more and more sample values are produced, thedistribution of values more closely approximates the desireddistribution, π(i).

That is, a Markov Chain has stationary distribution π(i) associatedwith transition probability matrix P.

The Markov Chain converges to the stationary distribution π(i) forarbitrary initial status x0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 77 / 101

Page 229: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo

Markov Chain Monte Carlo

Overview

Markov Chain Monte Carlo (MCMC) methods are a class of algorithms forsampling from a probability distribution based on constructing a Markovchain that has the desired distribution as its stationary distribution.

The algorithm is proposed in 1953, which is the top-10 mostimportant algorithms in the 20th century.

MCMC works by generating a sequence of sample values in such away that, as more and more sample values are produced, thedistribution of values more closely approximates the desireddistribution, π(i).

That is, a Markov Chain has stationary distribution π(i) associatedwith transition probability matrix P.

The Markov Chain converges to the stationary distribution π(i) forarbitrary initial status x0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 77 / 101

Page 230: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo

Markov Chain Monte Carlo

Overview

Markov Chain Monte Carlo (MCMC) methods are a class of algorithms forsampling from a probability distribution based on constructing a Markovchain that has the desired distribution as its stationary distribution.

The algorithm is proposed in 1953, which is the top-10 mostimportant algorithms in the 20th century.

MCMC works by generating a sequence of sample values in such away that, as more and more sample values are produced, thedistribution of values more closely approximates the desireddistribution, π(i).

That is, a Markov Chain has stationary distribution π(i) associatedwith transition probability matrix P.

The Markov Chain converges to the stationary distribution π(i) forarbitrary initial status x0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 77 / 101

Page 231: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo

Markov Chain Monte Carlo

Overview

Markov Chain Monte Carlo (MCMC) methods are a class of algorithms forsampling from a probability distribution based on constructing a Markovchain that has the desired distribution as its stationary distribution.

The algorithm is proposed in 1953, which is the top-10 mostimportant algorithms in the 20th century.

MCMC works by generating a sequence of sample values in such away that, as more and more sample values are produced, thedistribution of values more closely approximates the desireddistribution, π(i).

That is, a Markov Chain has stationary distribution π(i) associatedwith transition probability matrix P.

The Markov Chain converges to the stationary distribution π(i) forarbitrary initial status x0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 77 / 101

Page 232: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo

Markov Chain Monte Carlo

Overview

Markov Chain Monte Carlo (MCMC) methods are a class of algorithms forsampling from a probability distribution based on constructing a Markovchain that has the desired distribution as its stationary distribution.

The algorithm is proposed in 1953, which is the top-10 mostimportant algorithms in the 20th century.

MCMC works by generating a sequence of sample values in such away that, as more and more sample values are produced, thedistribution of values more closely approximates the desireddistribution, π(i).

That is, a Markov Chain has stationary distribution π(i) associatedwith transition probability matrix P.

The Markov Chain converges to the stationary distribution π(i) forarbitrary initial status x0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 77 / 101

Page 233: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Outline

1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F

2 Order Statistics3 Convergence Concepts

Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution

4 Generating a Random SampleDirect ModelIndirect Model

5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling

6 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 78 / 101

Page 234: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Random process

A random process is a collection of r.v.s indexed by some set I , takingvalues in a set S .

I is the index set, usually time, e.g., Z+, R, and R+, etc.

S is the state space, e.g., Z+, Rn, and a, b, c, etc.

We classify random processes according to

Index set can be discrete or continuous;

State space can be finite, countable or uncountable(continuous).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 79 / 101

Page 235: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Random process

A random process is a collection of r.v.s indexed by some set I , takingvalues in a set S .

I is the index set, usually time, e.g., Z+, R, and R+, etc.

S is the state space, e.g., Z+, Rn, and a, b, c, etc.

We classify random processes according to

Index set can be discrete or continuous;

State space can be finite, countable or uncountable(continuous).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 79 / 101

Page 236: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Markov property

More formally, X (t) is Markovian if the following property holds:

P(X (tn) = jn|X (tn−1) = jn−1, · · · ,X (t1) = j1)

= P(X (tn) = jn|X (tn−1) = jn−1)

for all finite sequence of times t1 < · · · < tn ∈ I and of statesj1, · · · , jn ∈ S .

A random process is called a Markov Process if conditionalon the current state, its future is independent of its past.

Process X0,X1, · · · satisfies the Markov property if

P(Xn+1 = xn+1|X0 = x0, · · · ,Xn = xn) = P(Xn+1 = xn+1|Xn = xn)

for all n and all xi ∈ Ω.

The term Markov property refers to the memoryless propertyof a stochastic process.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 80 / 101

Page 237: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Markov property

More formally, X (t) is Markovian if the following property holds:

P(X (tn) = jn|X (tn−1) = jn−1, · · · ,X (t1) = j1)

= P(X (tn) = jn|X (tn−1) = jn−1)

for all finite sequence of times t1 < · · · < tn ∈ I and of statesj1, · · · , jn ∈ S .

A random process is called a Markov Process if conditionalon the current state, its future is independent of its past.

Process X0,X1, · · · satisfies the Markov property if

P(Xn+1 = xn+1|X0 = x0, · · · ,Xn = xn) = P(Xn+1 = xn+1|Xn = xn)

for all n and all xi ∈ Ω.

The term Markov property refers to the memoryless propertyof a stochastic process.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 80 / 101

Page 238: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Markov property

More formally, X (t) is Markovian if the following property holds:

P(X (tn) = jn|X (tn−1) = jn−1, · · · ,X (t1) = j1)

= P(X (tn) = jn|X (tn−1) = jn−1)

for all finite sequence of times t1 < · · · < tn ∈ I and of statesj1, · · · , jn ∈ S .

A random process is called a Markov Process if conditionalon the current state, its future is independent of its past.

Process X0,X1, · · · satisfies the Markov property if

P(Xn+1 = xn+1|X0 = x0, · · · ,Xn = xn) = P(Xn+1 = xn+1|Xn = xn)

for all n and all xi ∈ Ω.

The term Markov property refers to the memoryless propertyof a stochastic process.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 80 / 101

Page 239: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Markov property

More formally, X (t) is Markovian if the following property holds:

P(X (tn) = jn|X (tn−1) = jn−1, · · · ,X (t1) = j1)

= P(X (tn) = jn|X (tn−1) = jn−1)

for all finite sequence of times t1 < · · · < tn ∈ I and of statesj1, · · · , jn ∈ S .

A random process is called a Markov Process if conditionalon the current state, its future is independent of its past.

Process X0,X1, · · · satisfies the Markov property if

P(Xn+1 = xn+1|X0 = x0, · · · ,Xn = xn) = P(Xn+1 = xn+1|Xn = xn)

for all n and all xi ∈ Ω.

The term Markov property refers to the memoryless propertyof a stochastic process.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 80 / 101

Page 240: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Discrete time Markov Chain

A stochastic process X0,X1, · · · of discrete time and discrete spaceis a Markov chain if it satisfies the Markov property, i.e.,

P(Xn+1 = xn+1|X0 = x0, · · · ,Xn = xn) = P(Xn+1 = xn+1|Xn = xn)

for all n and all xi ∈ Ω.

A Markov chain Xt is said to be time homogeneous ifP(Xs+t = j |Xs = i) is independent of s. When this holds,putting s = 0 gives

P(Xs+t = j |Xs = i) = P(Xt = j |X0 = i).

If moreover P(Xn+1 = j |Xn = i) = Pij is independent of n,then X is said to be time homogeneous Markov chain.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 81 / 101

Page 241: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Discrete time Markov Chain

A stochastic process X0,X1, · · · of discrete time and discrete spaceis a Markov chain if it satisfies the Markov property, i.e.,

P(Xn+1 = xn+1|X0 = x0, · · · ,Xn = xn) = P(Xn+1 = xn+1|Xn = xn)

for all n and all xi ∈ Ω.

A Markov chain Xt is said to be time homogeneous ifP(Xs+t = j |Xs = i) is independent of s. When this holds,putting s = 0 gives

P(Xs+t = j |Xs = i) = P(Xt = j |X0 = i).

If moreover P(Xn+1 = j |Xn = i) = Pij is independent of n,then X is said to be time homogeneous Markov chain.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 81 / 101

Page 242: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Discrete time Markov Chain

A stochastic process X0,X1, · · · of discrete time and discrete spaceis a Markov chain if it satisfies the Markov property, i.e.,

P(Xn+1 = xn+1|X0 = x0, · · · ,Xn = xn) = P(Xn+1 = xn+1|Xn = xn)

for all n and all xi ∈ Ω.

A Markov chain Xt is said to be time homogeneous ifP(Xs+t = j |Xs = i) is independent of s. When this holds,putting s = 0 gives

P(Xs+t = j |Xs = i) = P(Xt = j |X0 = i).

If moreover P(Xn+1 = j |Xn = i) = Pij is independent of n,then X is said to be time homogeneous Markov chain.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 81 / 101

Page 243: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Examples

Ω = A,B; π(A,B) = q, π(A,A) = 1− q, π(B,A) = r ,π(B,B) = q for some 0 < q, r < 1 and µ(A) = 1, µ(B) = 0,so that at time 0 we always start at A.

Ω = Z; π(a, a− 1) = 12 , π(a, a + 1) = 1

2 , π(a, b) = 0 ifb 6= a± 1 for every a ∈ Z, µ(0) = 1 and µ(a) = 0 if a 6= 0, soat time 0 we always start at 0.

A graph G = (V ,E ) consists of a vertex set V and an edge setE , where the elements of E are unordered pairs of vertices:E ⊂ (x , y) : x , y ∈ V , x 6= y. The degree deg(x) of a vertexx is the number of neighbours of x . In this case, Ω = V and isfinite, π(a, b) = 1

deg(a) if b is a neighbour of a and π(a, b) = 0

otherwise. µ(a) = 1|V | for every a ∈ V , so at time 0 we start

from the uniform distribution on V

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 82 / 101

Page 244: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Examples

Ω = A,B; π(A,B) = q, π(A,A) = 1− q, π(B,A) = r ,π(B,B) = q for some 0 < q, r < 1 and µ(A) = 1, µ(B) = 0,so that at time 0 we always start at A.

Ω = Z; π(a, a− 1) = 12 , π(a, a + 1) = 1

2 , π(a, b) = 0 ifb 6= a± 1 for every a ∈ Z, µ(0) = 1 and µ(a) = 0 if a 6= 0, soat time 0 we always start at 0.

A graph G = (V ,E ) consists of a vertex set V and an edge setE , where the elements of E are unordered pairs of vertices:E ⊂ (x , y) : x , y ∈ V , x 6= y. The degree deg(x) of a vertexx is the number of neighbours of x . In this case, Ω = V and isfinite, π(a, b) = 1

deg(a) if b is a neighbour of a and π(a, b) = 0

otherwise. µ(a) = 1|V | for every a ∈ V , so at time 0 we start

from the uniform distribution on V

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 82 / 101

Page 245: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Examples

Ω = A,B; π(A,B) = q, π(A,A) = 1− q, π(B,A) = r ,π(B,B) = q for some 0 < q, r < 1 and µ(A) = 1, µ(B) = 0,so that at time 0 we always start at A.

Ω = Z; π(a, a− 1) = 12 , π(a, a + 1) = 1

2 , π(a, b) = 0 ifb 6= a± 1 for every a ∈ Z, µ(0) = 1 and µ(a) = 0 if a 6= 0, soat time 0 we always start at 0.

A graph G = (V ,E ) consists of a vertex set V and an edge setE , where the elements of E are unordered pairs of vertices:E ⊂ (x , y) : x , y ∈ V , x 6= y. The degree deg(x) of a vertexx is the number of neighbours of x . In this case, Ω = V and isfinite, π(a, b) = 1

deg(a) if b is a neighbour of a and π(a, b) = 0

otherwise. µ(a) = 1|V | for every a ∈ V , so at time 0 we start

from the uniform distribution on V

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 82 / 101

Page 246: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Transition matrix

Definition

Let a Markov chain have P(t+1)x,y = P[Xt+1 = y |Xt = x], and the finite state space be

Ω = [n]. This gives us a transition matrix P(t+1) at time t. The transition matrix is anN ×N matrix of nonnegative entries such that the sum over each row of P(t) is 1, since∀n and ∀xi ∈ Ω ∑

y

P(t+1)x,y =

∑y

P[Xt+1 = y |Xt = x] = 1

For example, P(t+1) =

12

12 0 0

13

23 0 0

0 0 34

14

0 0 14

34

P

(t+1)1,2 = P[Xt+1 = 2|Xt = 1] = 1

2 and

P(t+1)1,3 = P[Xt+1 = 3|Xt = 1] = 0∑4i=1 P

(t+1)1,i = 1

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 83 / 101

Page 247: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Transition matrix

Definition

Let a Markov chain have P(t+1)x,y = P[Xt+1 = y |Xt = x], and the finite state space be

Ω = [n]. This gives us a transition matrix P(t+1) at time t. The transition matrix is anN ×N matrix of nonnegative entries such that the sum over each row of P(t) is 1, since∀n and ∀xi ∈ Ω ∑

y

P(t+1)x,y =

∑y

P[Xt+1 = y |Xt = x] = 1

For example, P(t+1) =

12

12 0 0

13

23 0 0

0 0 34

14

0 0 14

34

P(t+1)1,2 = P[Xt+1 = 2|Xt = 1] = 1

2 and

P(t+1)1,3 = P[Xt+1 = 3|Xt = 1] = 0∑4i=1 P

(t+1)1,i = 1

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 83 / 101

Page 248: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Transition matrix

Definition

Let a Markov chain have P(t+1)x,y = P[Xt+1 = y |Xt = x], and the finite state space be

Ω = [n]. This gives us a transition matrix P(t+1) at time t. The transition matrix is anN ×N matrix of nonnegative entries such that the sum over each row of P(t) is 1, since∀n and ∀xi ∈ Ω ∑

y

P(t+1)x,y =

∑y

P[Xt+1 = y |Xt = x] = 1

For example, P(t+1) =

12

12 0 0

13

23 0 0

0 0 34

14

0 0 14

34

P

(t+1)1,2 = P[Xt+1 = 2|Xt = 1] = 1

2 and

P(t+1)1,3 = P[Xt+1 = 3|Xt = 1] = 0

∑4i=1 P

(t+1)1,i = 1

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 83 / 101

Page 249: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Transition matrix

Definition

Let a Markov chain have P(t+1)x,y = P[Xt+1 = y |Xt = x], and the finite state space be

Ω = [n]. This gives us a transition matrix P(t+1) at time t. The transition matrix is anN ×N matrix of nonnegative entries such that the sum over each row of P(t) is 1, since∀n and ∀xi ∈ Ω ∑

y

P(t+1)x,y =

∑y

P[Xt+1 = y |Xt = x] = 1

For example, P(t+1) =

12

12 0 0

13

23 0 0

0 0 34

14

0 0 14

34

P

(t+1)1,2 = P[Xt+1 = 2|Xt = 1] = 1

2 and

P(t+1)1,3 = P[Xt+1 = 3|Xt = 1] = 0∑4i=1 P

(t+1)1,i = 1

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 83 / 101

Page 250: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

State distribution

Definition

Let π(t) be the state distribution of the chain at time t, that π(t)x = P[Xt =

x ].

For a finite chain, π(t) is a vector of N nonnegative entries such that∑x π

(t)x = 1. Then, it holds that π(t+1) = π(t)P(t+1). We apply the

law of total probability

π(t+1)y =

∑x

P[Xt+1 = y |Xt = x ]P[Xt = x ] =∑x

π(t)x P(t+1)

x,y = (π(t)P(t+1))y

Let π(t) = (0.4, 0.6, 0, 0) be a state distribution, thenπ(t+1) = (0.4, 0.6, 0, 0)

Let π(t) = (0, 0, 0.5, 0.5) be a state distribution, thenπ(t+1) = (0, 0, 0.5, 0.5)

Let π(t) = (0.1, 0.9, 0, 0) be a state distribution, thenπ(t+1) = (0.35, 0.65, 0, 0)

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 84 / 101

Page 251: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

State distribution

Definition

Let π(t) be the state distribution of the chain at time t, that π(t)x = P[Xt =

x ].For a finite chain, π(t) is a vector of N nonnegative entries such that∑

x π(t)x = 1. Then, it holds that π(t+1) = π(t)P(t+1). We apply the

law of total probability

π(t+1)y =

∑x

P[Xt+1 = y |Xt = x ]P[Xt = x ] =∑x

π(t)x P(t+1)

x,y = (π(t)P(t+1))y

Let π(t) = (0.4, 0.6, 0, 0) be a state distribution, thenπ(t+1) = (0.4, 0.6, 0, 0)

Let π(t) = (0, 0, 0.5, 0.5) be a state distribution, thenπ(t+1) = (0, 0, 0.5, 0.5)

Let π(t) = (0.1, 0.9, 0, 0) be a state distribution, thenπ(t+1) = (0.35, 0.65, 0, 0)

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 84 / 101

Page 252: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

State distribution

Definition

Let π(t) be the state distribution of the chain at time t, that π(t)x = P[Xt =

x ].For a finite chain, π(t) is a vector of N nonnegative entries such that∑

x π(t)x = 1. Then, it holds that π(t+1) = π(t)P(t+1). We apply the

law of total probability

π(t+1)y =

∑x

P[Xt+1 = y |Xt = x ]P[Xt = x ] =∑x

π(t)x P(t+1)

x,y = (π(t)P(t+1))y

Let π(t) = (0.4, 0.6, 0, 0) be a state distribution, thenπ(t+1) = (0.4, 0.6, 0, 0)

Let π(t) = (0, 0, 0.5, 0.5) be a state distribution, thenπ(t+1) = (0, 0, 0.5, 0.5)

Let π(t) = (0.1, 0.9, 0, 0) be a state distribution, thenπ(t+1) = (0.35, 0.65, 0, 0)

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 84 / 101

Page 253: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

State distribution

Definition

Let π(t) be the state distribution of the chain at time t, that π(t)x = P[Xt =

x ].For a finite chain, π(t) is a vector of N nonnegative entries such that∑

x π(t)x = 1. Then, it holds that π(t+1) = π(t)P(t+1). We apply the

law of total probability

π(t+1)y =

∑x

P[Xt+1 = y |Xt = x ]P[Xt = x ] =∑x

π(t)x P(t+1)

x,y = (π(t)P(t+1))y

Let π(t) = (0.4, 0.6, 0, 0) be a state distribution, thenπ(t+1) = (0.4, 0.6, 0, 0)

Let π(t) = (0, 0, 0.5, 0.5) be a state distribution, thenπ(t+1) = (0, 0, 0.5, 0.5)

Let π(t) = (0.1, 0.9, 0, 0) be a state distribution, thenπ(t+1) = (0.35, 0.65, 0, 0)

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 84 / 101

Page 254: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

State distribution

Definition

Let π(t) be the state distribution of the chain at time t, that π(t)x = P[Xt =

x ].For a finite chain, π(t) is a vector of N nonnegative entries such that∑

x π(t)x = 1. Then, it holds that π(t+1) = π(t)P(t+1). We apply the

law of total probability

π(t+1)y =

∑x

P[Xt+1 = y |Xt = x ]P[Xt = x ] =∑x

π(t)x P(t+1)

x,y = (π(t)P(t+1))y

Let π(t) = (0.4, 0.6, 0, 0) be a state distribution, thenπ(t+1) = (0.4, 0.6, 0, 0)

Let π(t) = (0, 0, 0.5, 0.5) be a state distribution, thenπ(t+1) = (0, 0, 0.5, 0.5)

Let π(t) = (0.1, 0.9, 0, 0) be a state distribution, thenπ(t+1) = (0.35, 0.65, 0, 0)

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 84 / 101

Page 255: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Stationary distributions

Definition

A stationary distribution of a finite Markov chain with transition matrix Pis a probability distribution π such that πP = π.

For some Markov chains, no matter what the initial distribution is,after running the chain for a while, the distribution of the chainapproaches the stationary distribution

E.g., P20 =

0.4 0.6 0 00.4 0.6 0 00 0 0.5 0.50 0 0.5 0.5

. The chain could converge to

any distribution which is a linear combination of (0.4, 0.6, 0, 0) and(0, 0, 0.5, 0.5). We observe that the original chain P can be brokeninto two disjoint Markov chains, which have their own stationarydistributions. We say that the chain is reducible

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 85 / 101

Page 256: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Stationary distributions

Definition

A stationary distribution of a finite Markov chain with transition matrix Pis a probability distribution π such that πP = π.

For some Markov chains, no matter what the initial distribution is,after running the chain for a while, the distribution of the chainapproaches the stationary distribution

E.g., P20 =

0.4 0.6 0 00.4 0.6 0 00 0 0.5 0.50 0 0.5 0.5

. The chain could converge to

any distribution which is a linear combination of (0.4, 0.6, 0, 0) and(0, 0, 0.5, 0.5). We observe that the original chain P can be brokeninto two disjoint Markov chains, which have their own stationarydistributions. We say that the chain is reducible

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 85 / 101

Page 257: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Stationary distributions

Definition

A stationary distribution of a finite Markov chain with transition matrix Pis a probability distribution π such that πP = π.

For some Markov chains, no matter what the initial distribution is,after running the chain for a while, the distribution of the chainapproaches the stationary distribution

E.g., P20 =

0.4 0.6 0 00.4 0.6 0 00 0 0.5 0.50 0 0.5 0.5

. The chain could converge to

any distribution which is a linear combination of (0.4, 0.6, 0, 0) and(0, 0, 0.5, 0.5). We observe that the original chain P can be brokeninto two disjoint Markov chains, which have their own stationarydistributions. We say that the chain is reducible

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 85 / 101

Page 258: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Irreducibility

Definition

State y is accessible from state x if it is possible for the chain to visit statey if the chain starts in state x , in other words, Pn(x , y) > 0, ∀n. State xcommunicates with state y if y is accessible from x and x is accessiblefrom y . We say that the Markov chain is irreducible if all pairs of statescommunicates.

y is accessible from x means that y is connected from x in thetransition graph, i.e., there is a directed path from x to y

x communicates with y means that x and y are strongly connectedin the transition graph

A finite Markov chain is irreducible if and only if its transition graphis strongly connected

The Markov chain associated with transition matrix P is notirreducible

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 86 / 101

Page 259: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Irreducibility

Definition

State y is accessible from state x if it is possible for the chain to visit statey if the chain starts in state x , in other words, Pn(x , y) > 0, ∀n. State xcommunicates with state y if y is accessible from x and x is accessiblefrom y . We say that the Markov chain is irreducible if all pairs of statescommunicates.

y is accessible from x means that y is connected from x in thetransition graph, i.e., there is a directed path from x to y

x communicates with y means that x and y are strongly connectedin the transition graph

A finite Markov chain is irreducible if and only if its transition graphis strongly connected

The Markov chain associated with transition matrix P is notirreducible

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 86 / 101

Page 260: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Irreducibility

Definition

State y is accessible from state x if it is possible for the chain to visit statey if the chain starts in state x , in other words, Pn(x , y) > 0, ∀n. State xcommunicates with state y if y is accessible from x and x is accessiblefrom y . We say that the Markov chain is irreducible if all pairs of statescommunicates.

y is accessible from x means that y is connected from x in thetransition graph, i.e., there is a directed path from x to y

x communicates with y means that x and y are strongly connectedin the transition graph

A finite Markov chain is irreducible if and only if its transition graphis strongly connected

The Markov chain associated with transition matrix P is notirreducible

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 86 / 101

Page 261: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Irreducibility

Definition

State y is accessible from state x if it is possible for the chain to visit statey if the chain starts in state x , in other words, Pn(x , y) > 0, ∀n. State xcommunicates with state y if y is accessible from x and x is accessiblefrom y . We say that the Markov chain is irreducible if all pairs of statescommunicates.

y is accessible from x means that y is connected from x in thetransition graph, i.e., there is a directed path from x to y

x communicates with y means that x and y are strongly connectedin the transition graph

A finite Markov chain is irreducible if and only if its transition graphis strongly connected

The Markov chain associated with transition matrix P is notirreducible

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 86 / 101

Page 262: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Irreducibility

Definition

State y is accessible from state x if it is possible for the chain to visit statey if the chain starts in state x , in other words, Pn(x , y) > 0, ∀n. State xcommunicates with state y if y is accessible from x and x is accessiblefrom y . We say that the Markov chain is irreducible if all pairs of statescommunicates.

y is accessible from x means that y is connected from x in thetransition graph, i.e., there is a directed path from x to y

x communicates with y means that x and y are strongly connectedin the transition graph

A finite Markov chain is irreducible if and only if its transition graphis strongly connected

The Markov chain associated with transition matrix P is notirreducible

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 86 / 101

Page 263: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Aperiodicity

Definition

The period of a state x is the greatest common divisor (gcd), such thatdx = gcdn|(Pn)x,x > 0. A state is aperiodic if its period is 1. A Markovchain is aperiodic if all its states are aperiodic.

For example, suppose that the period of state x is dx = 3. Then,starting from state x , chain x ,©,©,,©,©,, · · · , only thesquares are possible to be x .

In the transition graph of a finite Markov chain, (Pn)x,x > 0 isequivalent to that x is on a cycle of length n. Period of a state x isthe greatest common devisor of the lengths of cycles passing x .

Theorem

1. If the states x and y communicate, then dx = dy .2. We have (Pn)x,x = 0 if n mod(dx) 6= 0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 87 / 101

Page 264: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Aperiodicity

Definition

The period of a state x is the greatest common divisor (gcd), such thatdx = gcdn|(Pn)x,x > 0. A state is aperiodic if its period is 1. A Markovchain is aperiodic if all its states are aperiodic.

For example, suppose that the period of state x is dx = 3. Then,starting from state x , chain x ,©,©,,©,©,, · · · , only thesquares are possible to be x .

In the transition graph of a finite Markov chain, (Pn)x,x > 0 isequivalent to that x is on a cycle of length n. Period of a state x isthe greatest common devisor of the lengths of cycles passing x .

Theorem

1. If the states x and y communicate, then dx = dy .2. We have (Pn)x,x = 0 if n mod(dx) 6= 0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 87 / 101

Page 265: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Aperiodicity

Definition

The period of a state x is the greatest common divisor (gcd), such thatdx = gcdn|(Pn)x,x > 0. A state is aperiodic if its period is 1. A Markovchain is aperiodic if all its states are aperiodic.

For example, suppose that the period of state x is dx = 3. Then,starting from state x , chain x ,©,©,,©,©,, · · · , only thesquares are possible to be x .

In the transition graph of a finite Markov chain, (Pn)x,x > 0 isequivalent to that x is on a cycle of length n. Period of a state x isthe greatest common devisor of the lengths of cycles passing x .

Theorem

1. If the states x and y communicate, then dx = dy .2. We have (Pn)x,x = 0 if n mod(dx) 6= 0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 87 / 101

Page 266: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Aperiodicity

Definition

The period of a state x is the greatest common divisor (gcd), such thatdx = gcdn|(Pn)x,x > 0. A state is aperiodic if its period is 1. A Markovchain is aperiodic if all its states are aperiodic.

For example, suppose that the period of state x is dx = 3. Then,starting from state x , chain x ,©,©,,©,©,, · · · , only thesquares are possible to be x .

In the transition graph of a finite Markov chain, (Pn)x,x > 0 isequivalent to that x is on a cycle of length n. Period of a state x isthe greatest common devisor of the lengths of cycles passing x .

Theorem

1. If the states x and y communicate, then dx = dy .2. We have (Pn)x,x = 0 if n mod(dx) 6= 0.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 87 / 101

Page 267: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Existence of the stationary distribution

Theorem

Let X0,X1, · · · , be an irreducible and aperiodic Markov chain withtransition matrix P. Then, limn→∞ Pn

ij exists and independent of i ,denoted as limn→∞ Pn

ij = π(j). We also have

limn→∞

Pn =

π(1) π(2) · · · π(j) · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·

(1)

π(j) =∑∞

i=1 π(i)Pij , and∑∞

i=1 π(i) = 1.

π is the unique and non-negative solution for equation πP = π.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 88 / 101

Page 268: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Existence of the stationary distribution

Theorem

Let X0,X1, · · · , be an irreducible and aperiodic Markov chain withtransition matrix P. Then, limn→∞ Pn

ij exists and independent of i ,denoted as limn→∞ Pn

ij = π(j). We also have

limn→∞

Pn =

π(1) π(2) · · · π(j) · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·

(1)

π(j) =∑∞

i=1 π(i)Pij , and∑∞

i=1 π(i) = 1.

π is the unique and non-negative solution for equation πP = π.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 88 / 101

Page 269: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Existence of the stationary distribution

Theorem

Let X0,X1, · · · , be an irreducible and aperiodic Markov chain withtransition matrix P. Then, limn→∞ Pn

ij exists and independent of i ,denoted as limn→∞ Pn

ij = π(j). We also have

limn→∞

Pn =

π(1) π(2) · · · π(j) · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·

(1)

π(j) =∑∞

i=1 π(i)Pij , and∑∞

i=1 π(i) = 1.

π is the unique and non-negative solution for equation πP = π.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 88 / 101

Page 270: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Existence of the stationary distribution

Theorem

Let X0,X1, · · · , be an irreducible and aperiodic Markov chain withtransition matrix P. Then, limn→∞ Pn

ij exists and independent of i ,denoted as limn→∞ Pn

ij = π(j). We also have

limn→∞

Pn =

π(1) π(2) · · · π(j) · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·

(1)

π(j) =∑∞

i=1 π(i)Pij , and∑∞

i=1 π(i) = 1.

π is the unique and non-negative solution for equation πP = π.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 88 / 101

Page 271: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Detailed Balance Condition

Theorem

Let X0,X1, · · · , be an aperiodic Markov chain with transition matrix P anddistribution π. If the following condition holds,

π(i)Pij = π(j)Pji , for all i , j (2)

then π(x) is the stationary distribution of the Markov chain. The aboveequation is called the detailed balance condition.

Proof:∑∞i=1 π(i)Pij =

∑∞i=1 π(j)Pji = π(j)

∑∞i=1 Pji = π(j)⇒ πP = π.

In general, π(i)Pij 6= π(j)Pji . That is, π(i) may not be the stationarydistribution.

The natural question is how to revise the Markov Chain such that πbecomes a stationary distribution. For example, we introduce afunction α(i , j) s.t. π(i)Pijα(i , j) = π(j)Pjiα(j , i)

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 89 / 101

Page 272: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Detailed Balance Condition

Theorem

Let X0,X1, · · · , be an aperiodic Markov chain with transition matrix P anddistribution π. If the following condition holds,

π(i)Pij = π(j)Pji , for all i , j (2)

then π(x) is the stationary distribution of the Markov chain. The aboveequation is called the detailed balance condition.

Proof:∑∞i=1 π(i)Pij =

∑∞i=1 π(j)Pji = π(j)

∑∞i=1 Pji = π(j)⇒ πP = π.

In general, π(i)Pij 6= π(j)Pji . That is, π(i) may not be the stationarydistribution.

The natural question is how to revise the Markov Chain such that πbecomes a stationary distribution. For example, we introduce afunction α(i , j) s.t. π(i)Pijα(i , j) = π(j)Pjiα(j , i)

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 89 / 101

Page 273: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Detailed Balance Condition

Theorem

Let X0,X1, · · · , be an aperiodic Markov chain with transition matrix P anddistribution π. If the following condition holds,

π(i)Pij = π(j)Pji , for all i , j (2)

then π(x) is the stationary distribution of the Markov chain. The aboveequation is called the detailed balance condition.

Proof:∑∞i=1 π(i)Pij =

∑∞i=1 π(j)Pji = π(j)

∑∞i=1 Pji = π(j)⇒ πP = π.

In general, π(i)Pij 6= π(j)Pji . That is, π(i) may not be the stationarydistribution.

The natural question is how to revise the Markov Chain such that πbecomes a stationary distribution. For example, we introduce afunction α(i , j) s.t. π(i)Pijα(i , j) = π(j)Pjiα(j , i)

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 89 / 101

Page 274: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Markov Chain and Random Walk

Detailed Balance Condition

Theorem

Let X0,X1, · · · , be an aperiodic Markov chain with transition matrix P anddistribution π. If the following condition holds,

π(i)Pij = π(j)Pji , for all i , j (2)

then π(x) is the stationary distribution of the Markov chain. The aboveequation is called the detailed balance condition.

Proof:∑∞i=1 π(i)Pij =

∑∞i=1 π(j)Pji = π(j)

∑∞i=1 Pji = π(j)⇒ πP = π.

In general, π(i)Pij 6= π(j)Pji . That is, π(i) may not be the stationarydistribution.

The natural question is how to revise the Markov Chain such that πbecomes a stationary distribution. For example, we introduce afunction α(i , j) s.t. π(i)Pijα(i , j) = π(j)Pjiα(j , i)

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 89 / 101

Page 275: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo MCMC Sampling Algorithm

Outline

1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F

2 Order Statistics3 Convergence Concepts

Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution

4 Generating a Random SampleDirect ModelIndirect Model

5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling

6 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 90 / 101

Page 276: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo MCMC Sampling Algorithm

Revising the Markov Chain

Choosing a reasonable parameter

How to choose α(i , j) such that π(i)Pijα(i , j) = π(j)Pjiα(j , i).

In terms of symmetry, we simply chooseα(i , j) = π(j)Pji , α(j , i) = π(i)Pij .

Therefore, we have π(i)Pijα(i , j)︸ ︷︷ ︸Qij

= π(j)Pjiα(j , i)︸ ︷︷ ︸Qji

.

The transition matrix:Qij = Pijα(i , j), if j 6= i ;Qii = Pii +

∑k 6=i Pi,k(1− α(i , k)), Otherwise.

Accept with probability α(i , j);

Otherwise stay in the currentlocation.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 91 / 101

Page 277: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo MCMC Sampling Algorithm

Revising the Markov Chain

Choosing a reasonable parameter

How to choose α(i , j) such that π(i)Pijα(i , j) = π(j)Pjiα(j , i).

In terms of symmetry, we simply chooseα(i , j) = π(j)Pji , α(j , i) = π(i)Pij .

Therefore, we have π(i)Pijα(i , j)︸ ︷︷ ︸Qij

= π(j)Pjiα(j , i)︸ ︷︷ ︸Qji

.

The transition matrix:Qij = Pijα(i , j), if j 6= i ;Qii = Pii +

∑k 6=i Pi,k(1− α(i , k)), Otherwise.

Accept with probability α(i , j);

Otherwise stay in the currentlocation.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 91 / 101

Page 278: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo MCMC Sampling Algorithm

Revising the Markov Chain

Choosing a reasonable parameter

How to choose α(i , j) such that π(i)Pijα(i , j) = π(j)Pjiα(j , i).

In terms of symmetry, we simply chooseα(i , j) = π(j)Pji , α(j , i) = π(i)Pij .

Therefore, we have π(i)Pijα(i , j)︸ ︷︷ ︸Qij

= π(j)Pjiα(j , i)︸ ︷︷ ︸Qji

.

The transition matrix:Qij = Pijα(i , j), if j 6= i ;Qii = Pii +

∑k 6=i Pi,k(1− α(i , k)), Otherwise.

Accept with probability α(i , j);

Otherwise stay in the currentlocation.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 91 / 101

Page 279: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo MCMC Sampling Algorithm

Revising the Markov Chain

Choosing a reasonable parameter

How to choose α(i , j) such that π(i)Pijα(i , j) = π(j)Pjiα(j , i).

In terms of symmetry, we simply chooseα(i , j) = π(j)Pji , α(j , i) = π(i)Pij .

Therefore, we have π(i)Pijα(i , j)︸ ︷︷ ︸Qij

= π(j)Pjiα(j , i)︸ ︷︷ ︸Qji

.

The transition matrix:Qij = Pijα(i , j), if j 6= i ;Qii = Pii +

∑k 6=i Pi,k(1− α(i , k)), Otherwise.

Accept with probability α(i , j);

Otherwise stay in the currentlocation.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 91 / 101

Page 280: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo MCMC Sampling Algorithm

Revising the Markov Chain

Choosing a reasonable parameter

How to choose α(i , j) such that π(i)Pijα(i , j) = π(j)Pjiα(j , i).

In terms of symmetry, we simply chooseα(i , j) = π(j)Pji , α(j , i) = π(i)Pij .

Therefore, we have π(i)Pijα(i , j)︸ ︷︷ ︸Qij

= π(j)Pjiα(j , i)︸ ︷︷ ︸Qji

.

The transition matrix:Qij = Pijα(i , j), if j 6= i ;Qii = Pii +

∑k 6=i Pi,k(1− α(i , k)), Otherwise.

Accept with probability α(i , j);

Otherwise stay in the currentlocation.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 91 / 101

Page 281: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo MCMC Sampling Algorithm

MCMC Sampling Algorithm

Let P(x2|x1) be a proposed distribution.0: initialize x (0);1: for i = 0 to N − 1 do2: sample u ∼ U[0, 1];3: sample x ∼ P(x |x (i));4: if u < α(x , x (i)) = π(x)P(x (i)|x),5: then x (i+1) = x ;6: else reject x , and x (i+1) = x (i);7: endif8: endfor9: output Last N samples;

Observation

Let α(i , j) = 0.1, andα(j , i) = 0.2 satisfy thedetailed balance condi-tion, thus we have

π(i)Pij0.1 = π(j)Pji0.2.

The small value ofα(i , j) results in a highrejection ratio. Wetherefore modify the e-quation as follows:

π(i)Pij0.5 = π(j)Pji1.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 92 / 101

Page 282: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Outline

1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F

2 Order Statistics3 Convergence Concepts

Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution

4 Generating a Random SampleDirect ModelIndirect Model

5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling

6 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 93 / 101

Page 283: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Metropolis-Hastings algorithm

Let P(x2|x1) be a proposeddistribution.0: initialize x (0);1: for i = 0 to max do2: sample u ∼ U[0, 1];3: sample x ∼ P(x |x (i));

4: if u < min1, π(x)P(x(i)|x)

π(x(i))P(x |x(i)),

5: then x (i+1) = x ;6: else reject x , andx (i+1) = x (i);7: endif8: endfor9: output Last N samples;

Observation

If we let

α(i , j) = min

1,π(x)P(x (i)|x)

π(x (i))P(x |x (i))

,

we can get a high accept ratio, andfurther improve the algorithm effi-ciency.However, for high-dimensionalP, Metropolis-Hastings algorithmmay be inefficient because ofα < 1. Is there a way to find atransition matrix with acceptanceratio α = 1?

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 94 / 101

Page 284: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Properties of MCMC

Trade-off between Mixing rate and Acceptance ratio, where

Acceptance ratio = E[α(xi , x)]

Mixing ratio = rate that the chain moves around the dist.

We can have multiple transition matrices Pi (i.e., proposaldistribution), and apply them in turn.

Observation

For example:

Sample:x (t+1)|x (t) ∼ N (0.5x (t), 1.0);

Convergence:x (t)|x (0) ∼ N (0, 1.33), t → +∞.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 95 / 101

Page 285: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Intuition

Example: two-dimensional case

Let P(x , y) be a two-dimensional probability distribution, and twopoints A(x1, y1) and B(x1, y2). We have

P(x1, y1)P(y2|x1) = P(x1)P(y1|x1)P(y2|x1) (3)

P(x1, y2)P(y1|x1) = P(x1)P(y2|x1)P(y1|x1) (4)

That is

P(x1, y1)P(y2|x1) = P(x1, y2)P(y1|x1) (5)

i.e.,

P(A)P(y2|x1) = P(B)P(y1|x1) (6)

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 96 / 101

Page 286: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Intuition Cont’d

If p(y |xi ) is considered as the transition probability of two pointswhose x-axis coordinates are xi . Therefore, transition betweenthese two points satisfies the detailed balance condition, i.e.,P(A)P(y1|x1) = P(B)P(y2|x1) holds.

Transition matrix

The transition probabilities between two points A and B are given by T (A→ B).

T (A→ B) =

p(yB |x1), if xA = xB = x1;p(xB |y1), if yA = yB = y1;0, otherwise.

It is easy to confirm that the detailed balance condition holds, i.e., p(A)T (A → B) =p(B)T (B → A).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 97 / 101

Page 287: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Intuition Cont’d

If p(y |xi ) is considered as the transition probability of two pointswhose x-axis coordinates are xi . Therefore, transition betweenthese two points satisfies the detailed balance condition, i.e.,P(A)P(y1|x1) = P(B)P(y2|x1) holds.

Transition matrix

The transition probabilities between two points A and B are given by T (A→ B).

T (A→ B) =

p(yB |x1), if xA = xB = x1;p(xB |y1), if yA = yB = y1;0, otherwise.

It is easy to confirm that the detailed balance condition holds, i.e., p(A)T (A → B) =p(B)T (B → A).

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 97 / 101

Page 288: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Multivariate case

For multivariate case

Let P(xi |x−i ) = P(xi |x1, · · · , xi−1, xi+1, · · · , xn). The transition probabilities are given

by T (x→ x′) = P(x

′i |x−i ). Then, we have:

T (x→ x′)p(x) = P(x

′i |x−i )P(xi |x−i )P(x−i )

T (x′→ x)p(x

′) = P(xi |x

′−i )P(x

′i |x′−i )P(x

′−i )

Note that x′−i = x−i . That is,

T (x→ x′)p(x) = T (x

′→ x)p(x

′). (7)

Therefore, the detailed balance condition also holds.

Gibbs sampling is feasible if it is easy to sample from the conditional probabilitydistribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 98 / 101

Page 289: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Multivariate case

For multivariate case

Let P(xi |x−i ) = P(xi |x1, · · · , xi−1, xi+1, · · · , xn). The transition probabilities are given

by T (x→ x′) = P(x

′i |x−i ). Then, we have:

T (x→ x′)p(x) = P(x

′i |x−i )P(xi |x−i )P(x−i )

T (x′→ x)p(x

′) = P(xi |x

′−i )P(x

′i |x′−i )P(x

′−i )

Note that x′−i = x−i . That is,

T (x→ x′)p(x) = T (x

′→ x)p(x

′). (7)

Therefore, the detailed balance condition also holds.

Gibbs sampling is feasible if it is easy to sample from the conditional probabilitydistribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 98 / 101

Page 290: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Multivariate case

For multivariate case

Let P(xi |x−i ) = P(xi |x1, · · · , xi−1, xi+1, · · · , xn). The transition probabilities are given

by T (x→ x′) = P(x

′i |x−i ). Then, we have:

T (x→ x′)p(x) = P(x

′i |x−i )P(xi |x−i )P(x−i )

T (x′→ x)p(x

′) = P(xi |x

′−i )P(x

′i |x′−i )P(x

′−i )

Note that x′−i = x−i . That is,

T (x→ x′)p(x) = T (x

′→ x)p(x

′). (7)

Therefore, the detailed balance condition also holds.

Gibbs sampling is feasible if it is easy to sample from the conditional probabilitydistribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 98 / 101

Page 291: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Multivariate case

For multivariate case

Let P(xi |x−i ) = P(xi |x1, · · · , xi−1, xi+1, · · · , xn). The transition probabilities are given

by T (x→ x′) = P(x

′i |x−i ). Then, we have:

T (x→ x′)p(x) = P(x

′i |x−i )P(xi |x−i )P(x−i )

T (x′→ x)p(x

′) = P(xi |x

′−i )P(x

′i |x′−i )P(x

′−i )

Note that x′−i = x−i . That is,

T (x→ x′)p(x) = T (x

′→ x)p(x

′). (7)

Therefore, the detailed balance condition also holds.

Gibbs sampling is feasible if it is easy to sample from the conditional probabilitydistribution.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 98 / 101

Page 292: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Gibbs sampling algorithm (proposed distribution P(xi |x−i))

0: initialize x1, · · · , xn;1: for τ = 0 to max do2: sample xτ+1

1 ∼ P(x1|xτ2 , xτ3 , · · · , xτn );3: · · · ;4: sample xτ+1

j ∼ P(xj |xτ+11 , · · · , xτ+1

j−1 , xτj+1, · · · , x

τn );

5: · · · ;6: sample xτ+1

n ∼ P(xn|xτ+11 , xτ+1

2 , · · · , xτ+1n−1 );

7: output Last N samples;

Gibbs sampling is a type of random walk through parameter space, and can be consideredas a Metroplis-Hastings algorithm with a special proposal distribution.At each iteration, we are drawing from conditional posterior probabilities. This mean-s that the proposal move is always accepted. Hence, if we can draw samples fromthe conditional distributions, Gibbs sampling can be much more efficient than regularMetropolis-Hastings.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 99 / 101

Page 293: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Gibbs sampling algorithm (proposed distribution P(xi |x−i))

0: initialize x1, · · · , xn;1: for τ = 0 to max do2: sample xτ+1

1 ∼ P(x1|xτ2 , xτ3 , · · · , xτn );3: · · · ;4: sample xτ+1

j ∼ P(xj |xτ+11 , · · · , xτ+1

j−1 , xτj+1, · · · , x

τn );

5: · · · ;6: sample xτ+1

n ∼ P(xn|xτ+11 , xτ+1

2 , · · · , xτ+1n−1 );

7: output Last N samples;

Gibbs sampling is a type of random walk through parameter space, and can be consideredas a Metroplis-Hastings algorithm with a special proposal distribution.

At each iteration, we are drawing from conditional posterior probabilities. This mean-s that the proposal move is always accepted. Hence, if we can draw samples fromthe conditional distributions, Gibbs sampling can be much more efficient than regularMetropolis-Hastings.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 99 / 101

Page 294: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Gibbs sampling algorithm (proposed distribution P(xi |x−i))

0: initialize x1, · · · , xn;1: for τ = 0 to max do2: sample xτ+1

1 ∼ P(x1|xτ2 , xτ3 , · · · , xτn );3: · · · ;4: sample xτ+1

j ∼ P(xj |xτ+11 , · · · , xτ+1

j−1 , xτj+1, · · · , x

τn );

5: · · · ;6: sample xτ+1

n ∼ P(xn|xτ+11 , xτ+1

2 , · · · , xτ+1n−1 );

7: output Last N samples;

Gibbs sampling is a type of random walk through parameter space, and can be consideredas a Metroplis-Hastings algorithm with a special proposal distribution.At each iteration, we are drawing from conditional posterior probabilities. This mean-s that the proposal move is always accepted. Hence, if we can draw samples fromthe conditional distributions, Gibbs sampling can be much more efficient than regularMetropolis-Hastings.

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 99 / 101

Page 295: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Properties of Gibbs Sampling

Properties

No need to tune the proposal distribution;

Good trade-off between acceptance and mixing: Acceptance ratio is always 1.

Need to be able to derive conditional probability distributions.

Acceleration of Gibbs sampling: (givenp(a, b, c) draw samples from a and c)

Blocked Gibbs:(1) Draw (a,b) given c;(2) Draw c given (a,b);

Collapsed Gibbs:(1) Draw a given c;(2) Draw c given a;

Marginalize whenever you can.

R code for MCMC and Gibbs sampling, please refer tohttps://blog.csdn.net/abcjennifer/article/details/25908495

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 100 / 101

Page 296: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Properties of Gibbs Sampling

Properties

No need to tune the proposal distribution;

Good trade-off between acceptance and mixing: Acceptance ratio is always 1.

Need to be able to derive conditional probability distributions.

Acceleration of Gibbs sampling: (givenp(a, b, c) draw samples from a and c)

Blocked Gibbs:(1) Draw (a,b) given c;(2) Draw c given (a,b);

Collapsed Gibbs:(1) Draw a given c;(2) Draw c given a;

Marginalize whenever you can.

R code for MCMC and Gibbs sampling, please refer tohttps://blog.csdn.net/abcjennifer/article/details/25908495

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 100 / 101

Page 297: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Properties of Gibbs Sampling

Properties

No need to tune the proposal distribution;

Good trade-off between acceptance and mixing: Acceptance ratio is always 1.

Need to be able to derive conditional probability distributions.

Acceleration of Gibbs sampling: (givenp(a, b, c) draw samples from a and c)

Blocked Gibbs:(1) Draw (a,b) given c;(2) Draw c given (a,b);

Collapsed Gibbs:(1) Draw a given c;(2) Draw c given a;

Marginalize whenever you can.

R code for MCMC and Gibbs sampling, please refer tohttps://blog.csdn.net/abcjennifer/article/details/25908495

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 100 / 101

Page 298: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Properties of Gibbs Sampling

Properties

No need to tune the proposal distribution;

Good trade-off between acceptance and mixing: Acceptance ratio is always 1.

Need to be able to derive conditional probability distributions.

Acceleration of Gibbs sampling: (givenp(a, b, c) draw samples from a and c)

Blocked Gibbs:(1) Draw (a,b) given c;(2) Draw c given (a,b);

Collapsed Gibbs:(1) Draw a given c;(2) Draw c given a;

Marginalize whenever you can.

R code for MCMC and Gibbs sampling, please refer tohttps://blog.csdn.net/abcjennifer/article/details/25908495

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 100 / 101

Page 299: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Properties of Gibbs Sampling

Properties

No need to tune the proposal distribution;

Good trade-off between acceptance and mixing: Acceptance ratio is always 1.

Need to be able to derive conditional probability distributions.

Acceleration of Gibbs sampling: (givenp(a, b, c) draw samples from a and c)

Blocked Gibbs:(1) Draw (a,b) given c;(2) Draw c given (a,b);

Collapsed Gibbs:(1) Draw a given c;(2) Draw c given a;

Marginalize whenever you can.

R code for MCMC and Gibbs sampling, please refer tohttps://blog.csdn.net/abcjennifer/article/details/25908495

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 100 / 101

Page 300: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling

Properties of Gibbs Sampling

Properties

No need to tune the proposal distribution;

Good trade-off between acceptance and mixing: Acceptance ratio is always 1.

Need to be able to derive conditional probability distributions.

Acceleration of Gibbs sampling: (givenp(a, b, c) draw samples from a and c)

Blocked Gibbs:(1) Draw (a,b) given c;(2) Draw c given (a,b);

Collapsed Gibbs:(1) Draw a given c;(2) Draw c given a;

Marginalize whenever you can.

R code for MCMC and Gibbs sampling, please refer tohttps://blog.csdn.net/abcjennifer/article/details/25908495

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 100 / 101

Page 301: Statistical Inference - Lecture 5: Properties of a Random Sampledase.ecnu.edu.cn/mgao/teaching/StatI_2020_Spring/slides/5_sample.pdf · In particular, if the population pdf or pmf

Take-aways

Take-aways

Conclusions

Basic Concepts of Random Samples

Sampling from the Normal Distribution

Properties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F

Order Statistics

Convergence Concepts

Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution

Generating a Random Sample

MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 101 / 101