statistical inference - lecture 5: properties of a random...
TRANSCRIPT
Statistical InferenceLecture 5: Properties of a Random Sample
MING GAO
DASE @ ECNU(for course related communications)
Apr. 14, 2020
Outline
1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F
2 Order Statistics3 Convergence Concepts
Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution
4 Generating a Random SampleDirect ModelIndirect Model
5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling
6 Take-aways
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 2 / 101
Random vector
Definition
The r.v.s X1, · · · ,Xn are called a random sample of size n from thepopulation f (x) if X1, · · · ,Xn are mutually independent r.v.s and themarginal pdf or pmf of each Xi is the same function f (x).
That is, X1, · · · ,Xn are called independent and identicallydistributed (shorted in i.i.d) r.v.s with pdf or pmf f (x).
From the definition, the joint pdf or pmf of X1, · · · ,Xn is givenby
f (x1, · · · , xn) =n∏
i=1
f (xi ).
In particular, if the population pdf or pmf is a memeber of aparametric family with pdf or pmf given by f (x |θ), then thejoint pdf or pmf is
f (x1, · · · , xn|θ) =n∏
i=1
f (xi |θ).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 3 / 101
Random vector
Definition
The r.v.s X1, · · · ,Xn are called a random sample of size n from thepopulation f (x) if X1, · · · ,Xn are mutually independent r.v.s and themarginal pdf or pmf of each Xi is the same function f (x).
That is, X1, · · · ,Xn are called independent and identicallydistributed (shorted in i.i.d) r.v.s with pdf or pmf f (x).
From the definition, the joint pdf or pmf of X1, · · · ,Xn is givenby
f (x1, · · · , xn) =n∏
i=1
f (xi ).
In particular, if the population pdf or pmf is a memeber of aparametric family with pdf or pmf given by f (x |θ), then thejoint pdf or pmf is
f (x1, · · · , xn|θ) =n∏
i=1
f (xi |θ).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 3 / 101
Random vector
Definition
The r.v.s X1, · · · ,Xn are called a random sample of size n from thepopulation f (x) if X1, · · · ,Xn are mutually independent r.v.s and themarginal pdf or pmf of each Xi is the same function f (x).
That is, X1, · · · ,Xn are called independent and identicallydistributed (shorted in i.i.d) r.v.s with pdf or pmf f (x).
From the definition, the joint pdf or pmf of X1, · · · ,Xn is givenby
f (x1, · · · , xn) =n∏
i=1
f (xi ).
In particular, if the population pdf or pmf is a memeber of aparametric family with pdf or pmf given by f (x |θ), then thejoint pdf or pmf is
f (x1, · · · , xn|θ) =n∏
i=1
f (xi |θ).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 3 / 101
Random vector
Definition
The r.v.s X1, · · · ,Xn are called a random sample of size n from thepopulation f (x) if X1, · · · ,Xn are mutually independent r.v.s and themarginal pdf or pmf of each Xi is the same function f (x).
That is, X1, · · · ,Xn are called independent and identicallydistributed (shorted in i.i.d) r.v.s with pdf or pmf f (x).
From the definition, the joint pdf or pmf of X1, · · · ,Xn is givenby
f (x1, · · · , xn) =n∏
i=1
f (xi ).
In particular, if the population pdf or pmf is a memeber of aparametric family with pdf or pmf given by f (x |θ), then thejoint pdf or pmf is
f (x1, · · · , xn|θ) =n∏
i=1
f (xi |θ).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 3 / 101
Sample pdf-exponential
Let X1, · · · ,Xn be a random sample form an exponential(β) popu-lation. The joint pdf of the sample is
f (x1, · · · , xn|β) =n∏
i=1
f (xi |β) =1
βne−
x1+···+xnβ .
With replacement sampling
Suppose a value is chosen from the population in such a way thateach of the N values is equally likely (prob. = 1
N ) to be chosen.The value chosen at any stage is “replaced” in the population and isavailable for choice again at the next stage.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 4 / 101
Sample pdf-exponential
Let X1, · · · ,Xn be a random sample form an exponential(β) popu-lation. The joint pdf of the sample is
f (x1, · · · , xn|β) =n∏
i=1
f (xi |β) =1
βne−
x1+···+xnβ .
With replacement sampling
Suppose a value is chosen from the population in such a way thateach of the N values is equally likely (prob. = 1
N ) to be chosen.
The value chosen at any stage is “replaced” in the population and isavailable for choice again at the next stage.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 4 / 101
Sample pdf-exponential
Let X1, · · · ,Xn be a random sample form an exponential(β) popu-lation. The joint pdf of the sample is
f (x1, · · · , xn|β) =n∏
i=1
f (xi |β) =1
βne−
x1+···+xnβ .
With replacement sampling
Suppose a value is chosen from the population in such a way thateach of the N values is equally likely (prob. = 1
N ) to be chosen.The value chosen at any stage is “replaced” in the population and isavailable for choice again at the next stage.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 4 / 101
Without replacement sampling
The first sample is recorded as X1 = y , where y ∈ x1, x2, · · · , xN.Now a second value is chosen from the remaining N − 1 values.
P(X2 = x) =N∑i=1
P(X2 = x |X1 = xi )P(X1 = xi )
= (N − 1)1
N − 1
1
N=
1
N
Similar arguments can be used to show that each of the Xi hasthe same marginal distribution, which they are notindependent.
The nonzero probability in the conditional distribution of Xi
given X1, · · · ,Xi−1 is 1N−i+1 , which is close to 1
N if i ≤ n issmall compared with N
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 5 / 101
Without replacement sampling
The first sample is recorded as X1 = y , where y ∈ x1, x2, · · · , xN.Now a second value is chosen from the remaining N − 1 values.
P(X2 = x) =N∑i=1
P(X2 = x |X1 = xi )P(X1 = xi )
= (N − 1)1
N − 1
1
N=
1
N
Similar arguments can be used to show that each of the Xi hasthe same marginal distribution, which they are notindependent.
The nonzero probability in the conditional distribution of Xi
given X1, · · · ,Xi−1 is 1N−i+1 , which is close to 1
N if i ≤ n issmall compared with N
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 5 / 101
Without replacement sampling
The first sample is recorded as X1 = y , where y ∈ x1, x2, · · · , xN.Now a second value is chosen from the remaining N − 1 values.
P(X2 = x) =N∑i=1
P(X2 = x |X1 = xi )P(X1 = xi )
= (N − 1)1
N − 1
1
N=
1
N
Similar arguments can be used to show that each of the Xi hasthe same marginal distribution, which they are notindependent.
The nonzero probability in the conditional distribution of Xi
given X1, · · · ,Xi−1 is 1N−i+1 , which is close to 1
N if i ≤ n issmall compared with N
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 5 / 101
Without replacement sampling
The first sample is recorded as X1 = y , where y ∈ x1, x2, · · · , xN.Now a second value is chosen from the remaining N − 1 values.
P(X2 = x) =N∑i=1
P(X2 = x |X1 = xi )P(X1 = xi )
= (N − 1)1
N − 1
1
N=
1
N
Similar arguments can be used to show that each of the Xi hasthe same marginal distribution, which they are notindependent.
The nonzero probability in the conditional distribution of Xi
given X1, · · · ,Xi−1 is 1N−i+1 , which is close to 1
N if i ≤ n issmall compared with N
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 5 / 101
Sampling distribution
Definition
Let X1, · · · ,Xn be a random sample of size n form an population, andT (X1, · · · ,Xn) be a real-valued or vector-valued function. Then the r.v. orrandom vector Y = T (X1, · · · ,Xn) is called a statistic. The pdf or pmf ofstatistic Y is called the sampling distribution of Y .
Sample mean and sample variance
The sample mean and sample variance are the statistics defined by
X =x1 + · · ·+ Xn
n=
1
n
n∑i=1
Xi .
S2 =1
n − 1
n∑i=1
(Xi − X )2.
The sample standard deviation is the statistic defined by S =√S2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 6 / 101
Sampling distribution
Definition
Let X1, · · · ,Xn be a random sample of size n form an population, andT (X1, · · · ,Xn) be a real-valued or vector-valued function. Then the r.v. orrandom vector Y = T (X1, · · · ,Xn) is called a statistic. The pdf or pmf ofstatistic Y is called the sampling distribution of Y .
Sample mean and sample variance
The sample mean and sample variance are the statistics defined by
X =x1 + · · ·+ Xn
n=
1
n
n∑i=1
Xi .
S2 =1
n − 1
n∑i=1
(Xi − X )2.
The sample standard deviation is the statistic defined by S =√S2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 6 / 101
Theorem
Definition
Let x1, · · · , xn be any numbers and x = 1n
∑ni=1 xi . Then
a. mina
∑ni=1(xi − a)2 =
∑ni=1(xi − x)2,
b. (n − 1)s2 =∑n
i=1(xi − x)2 =∑n
i=1 x2i − nx2.
Proof.
n∑i=1
(xi − a)2 =n∑
i=1
(xi − x + x − a)2
=n∑
i=1
(xi − x)2 + 2n∑
i=1
(xi − x)(x − a) +n∑
i=1
(x − a)2
=n∑
i=1
(xi − x)2 +n∑
i=1
(x − a)2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 7 / 101
Theorem
Definition
Let x1, · · · , xn be any numbers and x = 1n
∑ni=1 xi . Then
a. mina
∑ni=1(xi − a)2 =
∑ni=1(xi − x)2,
b. (n − 1)s2 =∑n
i=1(xi − x)2 =∑n
i=1 x2i − nx2.
Proof.
n∑i=1
(xi − a)2 =n∑
i=1
(xi − x + x − a)2
=n∑
i=1
(xi − x)2 + 2n∑
i=1
(xi − x)(x − a) +n∑
i=1
(x − a)2
=n∑
i=1
(xi − x)2 +n∑
i=1
(x − a)2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 7 / 101
Theorem
Definition
Let x1, · · · , xn be any numbers and x = 1n
∑ni=1 xi . Then
a. mina
∑ni=1(xi − a)2 =
∑ni=1(xi − x)2,
b. (n − 1)s2 =∑n
i=1(xi − x)2 =∑n
i=1 x2i − nx2.
Proof.
n∑i=1
(xi − a)2 =n∑
i=1
(xi − x + x − a)2
=n∑
i=1
(xi − x)2 + 2n∑
i=1
(xi − x)(x − a) +n∑
i=1
(x − a)2
=n∑
i=1
(xi − x)2 +n∑
i=1
(x − a)2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 7 / 101
Theorem
Definition
Let x1, · · · , xn be any numbers and x = 1n
∑ni=1 xi . Then
a. mina
∑ni=1(xi − a)2 =
∑ni=1(xi − x)2,
b. (n − 1)s2 =∑n
i=1(xi − x)2 =∑n
i=1 x2i − nx2.
Proof.
n∑i=1
(xi − a)2 =n∑
i=1
(xi − x + x − a)2
=n∑
i=1
(xi − x)2 + 2n∑
i=1
(xi − x)(x − a) +n∑
i=1
(x − a)2
=n∑
i=1
(xi − x)2 +n∑
i=1
(x − a)2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 7 / 101
Lemma
Definition
Let X1, · · · ,Xn be a random sample from a population and let g(x) be afunction s.t. E (g(X1)) and Var(g(X1)) exist. Then
a.
E( n∑
i=1
g(Xi ))
= n · E (g(X1)),
b.
Var( n∑
i=1
g(Xi ))
= n · Var(g(X1)).
Note that
E[(g(Xi )− E (g(Xi )))(g(Xj)− E (g(Xj)))
]= Cov(g(Xi ), g(Xj)) = 0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 8 / 101
Lemma
Definition
Let X1, · · · ,Xn be a random sample from a population and let g(x) be afunction s.t. E (g(X1)) and Var(g(X1)) exist. Then
a.
E( n∑
i=1
g(Xi ))
= n · E (g(X1)),
b.
Var( n∑
i=1
g(Xi ))
= n · Var(g(X1)).
Note that
E[(g(Xi )− E (g(Xi )))(g(Xj)− E (g(Xj)))
]= Cov(g(Xi ), g(Xj)) = 0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 8 / 101
Lemma
Definition
Let X1, · · · ,Xn be a random sample from a population and let g(x) be afunction s.t. E (g(X1)) and Var(g(X1)) exist. Then
a.
E( n∑
i=1
g(Xi ))
= n · E (g(X1)),
b.
Var( n∑
i=1
g(Xi ))
= n · Var(g(X1)).
Note that
E[(g(Xi )− E (g(Xi )))(g(Xj)− E (g(Xj)))
]= Cov(g(Xi ), g(Xj)) = 0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 8 / 101
Theorem
Definition
Let X1, · · · ,Xn be a random sample from a population with mean µ andvariance σ2 <∞. Then
a. E (X ) = µ;
b. Var(X ) = σ2
n ;
c. E (S2) = σ2.
Proof.
E (S2) = E( 1
n − 1
[ n∑i=1
X 2i − nX
2])=
1
n − 1(nE (X 2
1 )− nE (X2))
=1
n − 1
(n(σ2 + µ2)− n(
σ2
n+ µ2)
)= σ2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 9 / 101
Unbiased statistics
Definition
If the following holds:
E [T (X1,X2, · · · ,Xn)] = θ,
then the statistic T (X1,X2, · · · ,Xn) is an unbiased estimator of the param-eter θ. Otherwise, T (X1,X2, · · · ,Xn) is a biased estimator of θ.
The statistic X is an unbiased estimator of µ, and S2 is an unbiasedestimator of σ2.
The use of n− 1 in the definition of S2 may have seemed unintuitive.
If S2 were defined as the usual average of the squared deviationswith n rather than n − 1 in the denominator, then E (S2) would ben
n−1σ2 and S2 would not be an unbiased estimator of σ2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 10 / 101
Unbiased statistics
Definition
If the following holds:
E [T (X1,X2, · · · ,Xn)] = θ,
then the statistic T (X1,X2, · · · ,Xn) is an unbiased estimator of the param-eter θ. Otherwise, T (X1,X2, · · · ,Xn) is a biased estimator of θ.
The statistic X is an unbiased estimator of µ, and S2 is an unbiasedestimator of σ2.
The use of n− 1 in the definition of S2 may have seemed unintuitive.
If S2 were defined as the usual average of the squared deviationswith n rather than n − 1 in the denominator, then E (S2) would ben
n−1σ2 and S2 would not be an unbiased estimator of σ2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 10 / 101
Unbiased statistics
Definition
If the following holds:
E [T (X1,X2, · · · ,Xn)] = θ,
then the statistic T (X1,X2, · · · ,Xn) is an unbiased estimator of the param-eter θ. Otherwise, T (X1,X2, · · · ,Xn) is a biased estimator of θ.
The statistic X is an unbiased estimator of µ, and S2 is an unbiasedestimator of σ2.
The use of n− 1 in the definition of S2 may have seemed unintuitive.
If S2 were defined as the usual average of the squared deviationswith n rather than n − 1 in the denominator, then E (S2) would ben
n−1σ2 and S2 would not be an unbiased estimator of σ2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 10 / 101
Unbiased statistics
Definition
If the following holds:
E [T (X1,X2, · · · ,Xn)] = θ,
then the statistic T (X1,X2, · · · ,Xn) is an unbiased estimator of the param-eter θ. Otherwise, T (X1,X2, · · · ,Xn) is a biased estimator of θ.
The statistic X is an unbiased estimator of µ, and S2 is an unbiasedestimator of σ2.
The use of n− 1 in the definition of S2 may have seemed unintuitive.
If S2 were defined as the usual average of the squared deviationswith n rather than n − 1 in the denominator, then E (S2) would ben
n−1σ2 and S2 would not be an unbiased estimator of σ2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 10 / 101
Theorem
Let X1, · · · ,Xn be a random sample from a population with mgfMX (t). Then the mgf of the sample means is
MX (t) =[MX (
t
n)]n.
Example
Let X1, · · · ,Xn be a random sample from a N(µ, σ2) population.Then the mgf of the sample mean is
MX (t) =[exp(
tµ
n+σ2(t/n)2
2)]n
= exp[n(
tµ
n+σ2(t/n)2
2)]
= exp[tµ+
(σ2/n)t2
2)]
Thus, X has a N(µ, σ2
n ) distribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 11 / 101
Theorem
Let X1, · · · ,Xn be a random sample from a population with mgfMX (t). Then the mgf of the sample means is
MX (t) =[MX (
t
n)]n.
Example
Let X1, · · · ,Xn be a random sample from a N(µ, σ2) population.Then the mgf of the sample mean is
MX (t) =[exp(
tµ
n+σ2(t/n)2
2)]n
= exp[n(
tµ
n+σ2(t/n)2
2)]
= exp[tµ+
(σ2/n)t2
2)]
Thus, X has a N(µ, σ2
n ) distribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 11 / 101
Convolution theorem
If X and Y are independent continuous r.v.s with pdfs fX (x) andfY (y), then the pdf of Z = X + Y is
fZ (z) =
∫ ∞−∞
fX (w)fY (z − w)dw .
Proof.
Let W = X . The Jacobian of the transformation from (X ,Y ) to(Z ,W ) is 1. Thus, we obtain the joint pdf of (Z ,W ) as
FZ ,W (z ,w) = FX ,Y (w , z − w) = fX (w)fY (z − w).
Integrating out w , we obtain the marginal pdf of Z as given in theconclusion.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 12 / 101
Convolution theorem
If X and Y are independent continuous r.v.s with pdfs fX (x) andfY (y), then the pdf of Z = X + Y is
fZ (z) =
∫ ∞−∞
fX (w)fY (z − w)dw .
Proof.
Let W = X . The Jacobian of the transformation from (X ,Y ) to(Z ,W ) is 1. Thus, we obtain the joint pdf of (Z ,W ) as
FZ ,W (z ,w) = FX ,Y (w , z − w) = fX (w)fY (z − w).
Integrating out w , we obtain the marginal pdf of Z as given in theconclusion.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 12 / 101
Convolution theorem
If X and Y are independent continuous r.v.s with pdfs fX (x) andfY (y), then the pdf of Z = X + Y is
fZ (z) =
∫ ∞−∞
fX (w)fY (z − w)dw .
Proof.
Let W = X . The Jacobian of the transformation from (X ,Y ) to(Z ,W ) is 1. Thus, we obtain the joint pdf of (Z ,W ) as
FZ ,W (z ,w) = FX ,Y (w , z − w) = fX (w)fY (z − w).
Integrating out w , we obtain the marginal pdf of Z as given in theconclusion.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 12 / 101
Convolution theorem
If X and Y are independent continuous r.v.s with pdfs fX (x) andfY (y), then the pdf of Z = X + Y is
fZ (z) =
∫ ∞−∞
fX (w)fY (z − w)dw .
Proof.
Let W = X . The Jacobian of the transformation from (X ,Y ) to(Z ,W ) is 1. Thus, we obtain the joint pdf of (Z ,W ) as
FZ ,W (z ,w) = FX ,Y (w , z − w) = fX (w)fY (z − w).
Integrating out w , we obtain the marginal pdf of Z as given in theconclusion.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 12 / 101
Sum of Cauchy random variables
If U and V be independent Cauchy r.v.s, where U ∼ Cauchy(0, σ) andV ∼ Cauchy(0, τ); that is
fU(u) =1
πσ
1
1 + (u/σ)2, and fV (v) =
1
πτ
1
1 + (v/τ)2.
Based on the convolution theorem, the pdf of Z = U + V is given by
fZ (z) =
∫ ∞−∞
1
πσ
1
1 + (w/σ)2
1
πτ
1
1 + ((z − w)/τ)2dw
=1
π(σ + τ)
1
1 + (z/(σ + τ))2.
The sum of two independent Cauchy r.v.s is again a Cauchy, withthe scale parameters adding.
If Z1, · · · ,Zn are i.i.d Cauchy(0, 1) r.v., then∑n
i=1 Zi isCauchy(0, n) and Z is Cauchy(0, 1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 13 / 101
Sum of Cauchy random variables
If U and V be independent Cauchy r.v.s, where U ∼ Cauchy(0, σ) andV ∼ Cauchy(0, τ); that is
fU(u) =1
πσ
1
1 + (u/σ)2, and fV (v) =
1
πτ
1
1 + (v/τ)2.
Based on the convolution theorem, the pdf of Z = U + V is given by
fZ (z) =
∫ ∞−∞
1
πσ
1
1 + (w/σ)2
1
πτ
1
1 + ((z − w)/τ)2dw
=1
π(σ + τ)
1
1 + (z/(σ + τ))2.
The sum of two independent Cauchy r.v.s is again a Cauchy, withthe scale parameters adding.
If Z1, · · · ,Zn are i.i.d Cauchy(0, 1) r.v., then∑n
i=1 Zi isCauchy(0, n) and Z is Cauchy(0, 1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 13 / 101
Sum of Cauchy random variables
If U and V be independent Cauchy r.v.s, where U ∼ Cauchy(0, σ) andV ∼ Cauchy(0, τ); that is
fU(u) =1
πσ
1
1 + (u/σ)2, and fV (v) =
1
πτ
1
1 + (v/τ)2.
Based on the convolution theorem, the pdf of Z = U + V is given by
fZ (z) =
∫ ∞−∞
1
πσ
1
1 + (w/σ)2
1
πτ
1
1 + ((z − w)/τ)2dw
=1
π(σ + τ)
1
1 + (z/(σ + τ))2.
The sum of two independent Cauchy r.v.s is again a Cauchy, withthe scale parameters adding.
If Z1, · · · ,Zn are i.i.d Cauchy(0, 1) r.v., then∑n
i=1 Zi isCauchy(0, n) and Z is Cauchy(0, 1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 13 / 101
Sum of Cauchy random variables
If U and V be independent Cauchy r.v.s, where U ∼ Cauchy(0, σ) andV ∼ Cauchy(0, τ); that is
fU(u) =1
πσ
1
1 + (u/σ)2, and fV (v) =
1
πτ
1
1 + (v/τ)2.
Based on the convolution theorem, the pdf of Z = U + V is given by
fZ (z) =
∫ ∞−∞
1
πσ
1
1 + (w/σ)2
1
πτ
1
1 + ((z − w)/τ)2dw
=1
π(σ + τ)
1
1 + (z/(σ + τ))2.
The sum of two independent Cauchy r.v.s is again a Cauchy, withthe scale parameters adding.
If Z1, · · · ,Zn are i.i.d Cauchy(0, 1) r.v., then∑n
i=1 Zi isCauchy(0, n) and Z is Cauchy(0, 1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 13 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Outline
1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F
2 Order Statistics3 Convergence Concepts
Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution
4 Generating a Random SampleDirect ModelIndirect Model
5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling
6 Take-aways
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 14 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Theorem
Let X1, · · · ,Xn be a random sample from a N(µ, σ2) distribution, letX = 1
n
∑ni=1 Xi and S2 = 1
n−1
∑ni=1(Xi − X )2. Then
a. X and S2 are independent r.v.s;
b. X ∼ N(µ, σ2
n );
c. (n−1)S2
σ2 has a χ2(n − 1) distribution.
Proof.
Part (b) has been established. Next, we will prove Parts (a) and(c). Without loss of generality, assume that µ = 1 and σ = 1.For Part (a), we will show that X and S2 are functions ofindependent random vectors.After introducing a lemma, we will prove the Part(c).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 15 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Theorem
Let X1, · · · ,Xn be a random sample from a N(µ, σ2) distribution, letX = 1
n
∑ni=1 Xi and S2 = 1
n−1
∑ni=1(Xi − X )2. Then
a. X and S2 are independent r.v.s;
b. X ∼ N(µ, σ2
n );
c. (n−1)S2
σ2 has a χ2(n − 1) distribution.
Proof.
Part (b) has been established. Next, we will prove Parts (a) and(c). Without loss of generality, assume that µ = 1 and σ = 1.For Part (a), we will show that X and S2 are functions ofindependent random vectors.After introducing a lemma, we will prove the Part(c).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 15 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Theorem
Let X1, · · · ,Xn be a random sample from a N(µ, σ2) distribution, letX = 1
n
∑ni=1 Xi and S2 = 1
n−1
∑ni=1(Xi − X )2. Then
a. X and S2 are independent r.v.s;
b. X ∼ N(µ, σ2
n );
c. (n−1)S2
σ2 has a χ2(n − 1) distribution.
Proof.
Part (b) has been established. Next, we will prove Parts (a) and(c). Without loss of generality, assume that µ = 1 and σ = 1.For Part (a), we will show that X and S2 are functions ofindependent random vectors.After introducing a lemma, we will prove the Part(c).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 15 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Theorem
Let X1, · · · ,Xn be a random sample from a N(µ, σ2) distribution, letX = 1
n
∑ni=1 Xi and S2 = 1
n−1
∑ni=1(Xi − X )2. Then
a. X and S2 are independent r.v.s;
b. X ∼ N(µ, σ2
n );
c. (n−1)S2
σ2 has a χ2(n − 1) distribution.
Proof.
Part (b) has been established. Next, we will prove Parts (a) and(c). Without loss of generality, assume that µ = 1 and σ = 1.For Part (a), we will show that X and S2 are functions ofindependent random vectors.After introducing a lemma, we will prove the Part(c).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 15 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Theorem
Let X1, · · · ,Xn be a random sample from a N(µ, σ2) distribution, letX = 1
n
∑ni=1 Xi and S2 = 1
n−1
∑ni=1(Xi − X )2. Then
a. X and S2 are independent r.v.s;
b. X ∼ N(µ, σ2
n );
c. (n−1)S2
σ2 has a χ2(n − 1) distribution.
Proof.
Part (b) has been established. Next, we will prove Parts (a) and(c). Without loss of generality, assume that µ = 1 and σ = 1.
For Part (a), we will show that X and S2 are functions ofindependent random vectors.After introducing a lemma, we will prove the Part(c).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 15 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Theorem
Let X1, · · · ,Xn be a random sample from a N(µ, σ2) distribution, letX = 1
n
∑ni=1 Xi and S2 = 1
n−1
∑ni=1(Xi − X )2. Then
a. X and S2 are independent r.v.s;
b. X ∼ N(µ, σ2
n );
c. (n−1)S2
σ2 has a χ2(n − 1) distribution.
Proof.
Part (b) has been established. Next, we will prove Parts (a) and(c). Without loss of generality, assume that µ = 1 and σ = 1.For Part (a), we will show that X and S2 are functions ofindependent random vectors.
After introducing a lemma, we will prove the Part(c).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 15 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Theorem
Let X1, · · · ,Xn be a random sample from a N(µ, σ2) distribution, letX = 1
n
∑ni=1 Xi and S2 = 1
n−1
∑ni=1(Xi − X )2. Then
a. X and S2 are independent r.v.s;
b. X ∼ N(µ, σ2
n );
c. (n−1)S2
σ2 has a χ2(n − 1) distribution.
Proof.
Part (b) has been established. Next, we will prove Parts (a) and(c). Without loss of generality, assume that µ = 1 and σ = 1.For Part (a), we will show that X and S2 are functions ofindependent random vectors.After introducing a lemma, we will prove the Part(c).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 15 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Part (a) proof Cont’d
Note that we can write S2 as a function of n − 1 deviations as follows:
S2 =1
n − 1
n∑i=1
(Xi − X )2
=1
n − 1
((X1 − X )2 +
n∑i=2
(Xi − X )2)
=1
n − 1
([ n∑i=2
(Xi − X )]2
+n∑
i=2
(Xi − X )2).
Thus, S2 can be written as a function only of (X2 − X , · · · ,Xn − X ). Wewill now show that these r.v.s are independent of X . The joint pdf of thesample X1, · · · ,Xn is given by
f (x1, · · · , xn) =1
(2π)n/2e−(1/2)
∑ni=1 x
2i
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 16 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Part (a) proof Cont’d
Make the transformation
y1 = x , x1 = y1 −n∑
i=2
yi
y2 = x2 − x , x2 = y1 + y2
· · · · · ·yn = xn − x , xn = y1 + yn
This is a linear transformation with a Jacobian equal to 1n . We have
f (y1, · · · , yn) =1
(2π)n/2e−(1/2)(y1−
∑ni=2 yi )
2
e−(1/2)∑n
i=2(y1+yi )2
=[ 1
(2π)1/2e−(ny2
1 )/2][ n1/2
(2π)(n−1)/2e−(1/2)[
∑ni=2 y
2i +(
∑ni=2 yi )
2]]
Since the joint pdf of Y1, · · · ,Yn factors, Y1 is independent of Y2, · · · ,Yn.That is, X and S2 are independent r.v.s.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 17 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Facts about Chi squared r.v.s
We use the notation χ2p to denote a Chi squared r.v. with p degrees of
freedom.
a. If Z is a N(0, 1) r.v., then Z 2 ∼ χ21; that is, the square of a standard
normal r.v. is a chi squared r.v.;
b. If X1, · · · ,Xn are independent and Xi ∼ χ2pi , then
X1 + · · ·+ Xn ∼ χ2p1+···+pn ; that is, independent chi squared variables
add to a chi squared variable, and the degrees of freedom also add.
Proof.
Part (a) has established in Chapter 2.Since a χ2
p r.v. is a gamma( p2 , 2), application of the example gives part
(b).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 18 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Facts about Chi squared r.v.s
We use the notation χ2p to denote a Chi squared r.v. with p degrees of
freedom.
a. If Z is a N(0, 1) r.v., then Z 2 ∼ χ21; that is, the square of a standard
normal r.v. is a chi squared r.v.;
b. If X1, · · · ,Xn are independent and Xi ∼ χ2pi , then
X1 + · · ·+ Xn ∼ χ2p1+···+pn ; that is, independent chi squared variables
add to a chi squared variable, and the degrees of freedom also add.
Proof.
Part (a) has established in Chapter 2.Since a χ2
p r.v. is a gamma( p2 , 2), application of the example gives part
(b).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 18 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Facts about Chi squared r.v.s
We use the notation χ2p to denote a Chi squared r.v. with p degrees of
freedom.
a. If Z is a N(0, 1) r.v., then Z 2 ∼ χ21; that is, the square of a standard
normal r.v. is a chi squared r.v.;
b. If X1, · · · ,Xn are independent and Xi ∼ χ2pi , then
X1 + · · ·+ Xn ∼ χ2p1+···+pn ; that is, independent chi squared variables
add to a chi squared variable, and the degrees of freedom also add.
Proof.
Part (a) has established in Chapter 2.Since a χ2
p r.v. is a gamma( p2 , 2), application of the example gives part
(b).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 18 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Facts about Chi squared r.v.s
We use the notation χ2p to denote a Chi squared r.v. with p degrees of
freedom.
a. If Z is a N(0, 1) r.v., then Z 2 ∼ χ21; that is, the square of a standard
normal r.v. is a chi squared r.v.;
b. If X1, · · · ,Xn are independent and Xi ∼ χ2pi , then
X1 + · · ·+ Xn ∼ χ2p1+···+pn ; that is, independent chi squared variables
add to a chi squared variable, and the degrees of freedom also add.
Proof.
Part (a) has established in Chapter 2.
Since a χ2p r.v. is a gamma( p
2 , 2), application of the example gives part(b).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 18 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Facts about Chi squared r.v.s
We use the notation χ2p to denote a Chi squared r.v. with p degrees of
freedom.
a. If Z is a N(0, 1) r.v., then Z 2 ∼ χ21; that is, the square of a standard
normal r.v. is a chi squared r.v.;
b. If X1, · · · ,Xn are independent and Xi ∼ χ2pi , then
X1 + · · ·+ Xn ∼ χ2p1+···+pn ; that is, independent chi squared variables
add to a chi squared variable, and the degrees of freedom also add.
Proof.
Part (a) has established in Chapter 2.Since a χ2
p r.v. is a gamma( p2 , 2), application of the example gives part
(b).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 18 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Part (c) proof
We will employ and induction argument to establish the distributionof S2, using the notation X k and S2
k to denote the sample mean andvariance based on the first k observations.
It is straightforward to establish that
(n − 1)S2n = (n − 2)S2
n−1 +n − 1
n(Xn − X n−1)2.
Now consider n = 2, we have S22 = 1
2 (X2 −X1)2. Since the distribu-
tion of X2−X1√2
is N(0, 1), Part (a) shows that S22 ∼ χ2
1.
Proceeding with the induction, we assume that for n = k , (k−1)S2k ∼
χ2k−1.
For n = k + 1, we have the following form
kS2k+1 = (k − 1)S2
k +k
k + 1(Xk+1 − X k)2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 19 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Part (c) proof
We will employ and induction argument to establish the distributionof S2, using the notation X k and S2
k to denote the sample mean andvariance based on the first k observations.It is straightforward to establish that
(n − 1)S2n = (n − 2)S2
n−1 +n − 1
n(Xn − X n−1)2.
Now consider n = 2, we have S22 = 1
2 (X2 −X1)2. Since the distribu-
tion of X2−X1√2
is N(0, 1), Part (a) shows that S22 ∼ χ2
1.
Proceeding with the induction, we assume that for n = k , (k−1)S2k ∼
χ2k−1.
For n = k + 1, we have the following form
kS2k+1 = (k − 1)S2
k +k
k + 1(Xk+1 − X k)2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 19 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Part (c) proof
We will employ and induction argument to establish the distributionof S2, using the notation X k and S2
k to denote the sample mean andvariance based on the first k observations.It is straightforward to establish that
(n − 1)S2n = (n − 2)S2
n−1 +n − 1
n(Xn − X n−1)2.
Now consider n = 2, we have S22 = 1
2 (X2 −X1)2.
Since the distribu-
tion of X2−X1√2
is N(0, 1), Part (a) shows that S22 ∼ χ2
1.
Proceeding with the induction, we assume that for n = k , (k−1)S2k ∼
χ2k−1.
For n = k + 1, we have the following form
kS2k+1 = (k − 1)S2
k +k
k + 1(Xk+1 − X k)2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 19 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Part (c) proof
We will employ and induction argument to establish the distributionof S2, using the notation X k and S2
k to denote the sample mean andvariance based on the first k observations.It is straightforward to establish that
(n − 1)S2n = (n − 2)S2
n−1 +n − 1
n(Xn − X n−1)2.
Now consider n = 2, we have S22 = 1
2 (X2 −X1)2. Since the distribu-
tion of X2−X1√2
is N(0, 1), Part (a) shows that S22 ∼ χ2
1.
Proceeding with the induction, we assume that for n = k , (k−1)S2k ∼
χ2k−1.
For n = k + 1, we have the following form
kS2k+1 = (k − 1)S2
k +k
k + 1(Xk+1 − X k)2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 19 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Part (c) proof
We will employ and induction argument to establish the distributionof S2, using the notation X k and S2
k to denote the sample mean andvariance based on the first k observations.It is straightforward to establish that
(n − 1)S2n = (n − 2)S2
n−1 +n − 1
n(Xn − X n−1)2.
Now consider n = 2, we have S22 = 1
2 (X2 −X1)2. Since the distribu-
tion of X2−X1√2
is N(0, 1), Part (a) shows that S22 ∼ χ2
1.
Proceeding with the induction, we assume that for n = k , (k−1)S2k ∼
χ2k−1.
For n = k + 1, we have the following form
kS2k+1 = (k − 1)S2
k +k
k + 1(Xk+1 − X k)2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 19 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Part (c) proof
We will employ and induction argument to establish the distributionof S2, using the notation X k and S2
k to denote the sample mean andvariance based on the first k observations.It is straightforward to establish that
(n − 1)S2n = (n − 2)S2
n−1 +n − 1
n(Xn − X n−1)2.
Now consider n = 2, we have S22 = 1
2 (X2 −X1)2. Since the distribu-
tion of X2−X1√2
is N(0, 1), Part (a) shows that S22 ∼ χ2
1.
Proceeding with the induction, we assume that for n = k , (k−1)S2k ∼
χ2k−1.
For n = k + 1, we have the following form
kS2k+1 = (k − 1)S2
k +k
k + 1(Xk+1 − X k)2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 19 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Part (c) proof Cont’d
According to the induction hypothesis, if kk+1 (Xk+1 − X k)2 ∼ χ2
1,
and independent of S2k , it will follow from Part (b) that kS2
k+1 ∼ χ2k .
Since the vector (Xk+1,X k) is independent of S2k . Furthermore,
Xk+1 − X k ∼ N(0, 1) since
Var(Xk+1 − X k) =k + 1
k.
Therefore, kk+1 (Xk+1 − X k)2 ∼ χ2
1.Based on the following lemma, we will provide the other manner toestablish the theorem.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 20 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Part (c) proof Cont’d
According to the induction hypothesis, if kk+1 (Xk+1 − X k)2 ∼ χ2
1,
and independent of S2k , it will follow from Part (b) that kS2
k+1 ∼ χ2k .
Since the vector (Xk+1,X k) is independent of S2k . Furthermore,
Xk+1 − X k ∼ N(0, 1) since
Var(Xk+1 − X k) =k + 1
k.
Therefore, kk+1 (Xk+1 − X k)2 ∼ χ2
1.Based on the following lemma, we will provide the other manner toestablish the theorem.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 20 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Part (c) proof Cont’d
According to the induction hypothesis, if kk+1 (Xk+1 − X k)2 ∼ χ2
1,
and independent of S2k , it will follow from Part (b) that kS2
k+1 ∼ χ2k .
Since the vector (Xk+1,X k) is independent of S2k . Furthermore,
Xk+1 − X k ∼ N(0, 1) since
Var(Xk+1 − X k) =k + 1
k.
Therefore, kk+1 (Xk+1 − X k)2 ∼ χ2
1.
Based on the following lemma, we will provide the other manner toestablish the theorem.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 20 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Part (c) proof Cont’d
According to the induction hypothesis, if kk+1 (Xk+1 − X k)2 ∼ χ2
1,
and independent of S2k , it will follow from Part (b) that kS2
k+1 ∼ χ2k .
Since the vector (Xk+1,X k) is independent of S2k . Furthermore,
Xk+1 − X k ∼ N(0, 1) since
Var(Xk+1 − X k) =k + 1
k.
Therefore, kk+1 (Xk+1 − X k)2 ∼ χ2
1.Based on the following lemma, we will provide the other manner toestablish the theorem.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 20 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Lemma
Let Xj ∼ N(µj , σ2j ), j = 1, · · · , n be independent. For constants aij
and brj(j = 1, · · · , n; i = 1, · · · , k ; r = 1, · · · ,m), where k + m ≤ n,define
Ui =n∑
j=1
aijXj , and Vr =n∑
j=1
brjXj
.
The r.v.s Ui and Vr are independent if and only ifCov(Ui ,Vr ) = 0. Furthermore, Cov(Ui ,Vr ) =
∑nj=1 aijbrjσ
2j .
The random vectors (U1, · · · ,Uk) and (V1, · · · ,Vm) areindependent if and only if Ui is independent of Vr for all pairsi , r(i = 1, · · · , k ; r = 1, · · · ,m).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 21 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Lemma
Let Xj ∼ N(µj , σ2j ), j = 1, · · · , n be independent. For constants aij
and brj(j = 1, · · · , n; i = 1, · · · , k ; r = 1, · · · ,m), where k + m ≤ n,define
Ui =n∑
j=1
aijXj , and Vr =n∑
j=1
brjXj
.
The r.v.s Ui and Vr are independent if and only ifCov(Ui ,Vr ) = 0. Furthermore, Cov(Ui ,Vr ) =
∑nj=1 aijbrjσ
2j .
The random vectors (U1, · · · ,Uk) and (V1, · · · ,Vm) areindependent if and only if Ui is independent of Vr for all pairsi , r(i = 1, · · · , k ; r = 1, · · · ,m).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 21 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Lemma
Let Xj ∼ N(µj , σ2j ), j = 1, · · · , n be independent. For constants aij
and brj(j = 1, · · · , n; i = 1, · · · , k ; r = 1, · · · ,m), where k + m ≤ n,define
Ui =n∑
j=1
aijXj , and Vr =n∑
j=1
brjXj
.
The r.v.s Ui and Vr are independent if and only ifCov(Ui ,Vr ) = 0. Furthermore, Cov(Ui ,Vr ) =
∑nj=1 aijbrjσ
2j .
The random vectors (U1, · · · ,Uk) and (V1, · · · ,Vm) areindependent if and only if Ui is independent of Vr for all pairsi , r(i = 1, · · · , k ; r = 1, · · · ,m).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 21 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Prove the theorem
We can use the above lemma to provide an alternative proof of theindependent of X and S2 in normal sampling.
Since we can write S2 as a function of n − 1 deviations (X2 −X , · · · ,Xn − X ), we must show that these r.v.s are uncorrelatedwith X . As an illustration of the application of the lemma, we write
X =1
n
n∑i=1
Xi ,
Xj − X =1
n
n∑i=1
(δij − 1)Xi ,
where δij = 1 if i = j and δij = 0 otherwise.It is then easy to show that X and Xj − X are independent.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 22 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Prove the theorem
We can use the above lemma to provide an alternative proof of theindependent of X and S2 in normal sampling.Since we can write S2 as a function of n − 1 deviations (X2 −X , · · · ,Xn − X ), we must show that these r.v.s are uncorrelatedwith X . As an illustration of the application of the lemma, we write
X =1
n
n∑i=1
Xi ,
Xj − X =1
n
n∑i=1
(δij − 1)Xi ,
where δij = 1 if i = j and δij = 0 otherwise.
It is then easy to show that X and Xj − X are independent.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 22 / 101
Sampling from the Normal Distribution Properties of the Sample Mean and Variance
Prove the theorem
We can use the above lemma to provide an alternative proof of theindependent of X and S2 in normal sampling.Since we can write S2 as a function of n − 1 deviations (X2 −X , · · · ,Xn − X ), we must show that these r.v.s are uncorrelatedwith X . As an illustration of the application of the lemma, we write
X =1
n
n∑i=1
Xi ,
Xj − X =1
n
n∑i=1
(δij − 1)Xi ,
where δij = 1 if i = j and δij = 0 otherwise.It is then easy to show that X and Xj − X are independent.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 22 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
Outline
1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F
2 Order Statistics3 Convergence Concepts
Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution
4 Generating a Random SampleDirect ModelIndirect Model
5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling
6 Take-aways
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 23 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
Student’s T distribution
If X1, · · · ,Xn are a random sample from a N(µ, σ2), we know that X−µσ/√n∼
N(0, 1).Most of the time, however, σ is unknown. We can use S to replace σ. Inthis case
X − µS/√n
=(X − µ)/(σ/
√n)√
S2/σ2
.=
N(0, 1)√χ2n−1/(n − 1)
.
Definition
If X1, · · · ,Xn are a random sample from a N(µ, σ2). The quantity X−µS/√n
has Student’s T distribution with p = (n − 1) degrees of freedom with thepdf as
fT (t) =Γ((p + 1)/2)
Γ(p/2)
1
(pπ)1/2
[ 2
(1 + t2/p)
](p+1)/2 ∼ tp.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 24 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
Student’s T distribution
If X1, · · · ,Xn are a random sample from a N(µ, σ2), we know that X−µσ/√n∼
N(0, 1).Most of the time, however, σ is unknown. We can use S to replace σ. Inthis case
X − µS/√n
=(X − µ)/(σ/
√n)√
S2/σ2
.=
N(0, 1)√χ2n−1/(n − 1)
.
Definition
If X1, · · · ,Xn are a random sample from a N(µ, σ2). The quantity X−µS/√n
has Student’s T distribution with p = (n − 1) degrees of freedom with thepdf as
fT (t) =Γ((p + 1)/2)
Γ(p/2)
1
(pπ)1/2
[ 2
(1 + t2/p)
](p+1)/2 ∼ tp.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 24 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
Student’s T distribution Cont’d
Notice that if p = 1, then tp becomes the pdf of the Cauchy distribution,which occurs for samples of size 2.
If we start with U and V defined above,their joint pdf is
fU,V (u, v) =1
(2π)1/2e−u
2/2 1
Γ(p/2)2p/2v (p/2)−1e−v/2.
Now make the transformation
t =u√v/p
, and w = v .
The Jacobian of the transformation is (w/p)1/2, and the marginal pdf ofT is given
fT (t) =
∫ ∞−∞
fU,V (t(w/p)1/2,w)(w/p)1/2dw
=Γ((p + 1)/2)
Γ(p/2)
1
(pπ)1/2
[ 2
(1 + t2/p)
](p+1)/2 ∼ tp.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 25 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
Student’s T distribution Cont’d
Notice that if p = 1, then tp becomes the pdf of the Cauchy distribution,which occurs for samples of size 2. If we start with U and V defined above,their joint pdf is
fU,V (u, v) =1
(2π)1/2e−u
2/2 1
Γ(p/2)2p/2v (p/2)−1e−v/2.
Now make the transformation
t =u√v/p
, and w = v .
The Jacobian of the transformation is (w/p)1/2, and the marginal pdf ofT is given
fT (t) =
∫ ∞−∞
fU,V (t(w/p)1/2,w)(w/p)1/2dw
=Γ((p + 1)/2)
Γ(p/2)
1
(pπ)1/2
[ 2
(1 + t2/p)
](p+1)/2 ∼ tp.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 25 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
Student’s T distribution Cont’d
Notice that if p = 1, then tp becomes the pdf of the Cauchy distribution,which occurs for samples of size 2. If we start with U and V defined above,their joint pdf is
fU,V (u, v) =1
(2π)1/2e−u
2/2 1
Γ(p/2)2p/2v (p/2)−1e−v/2.
Now make the transformation
t =u√v/p
, and w = v .
The Jacobian of the transformation is (w/p)1/2, and the marginal pdf ofT is given
fT (t) =
∫ ∞−∞
fU,V (t(w/p)1/2,w)(w/p)1/2dw
=Γ((p + 1)/2)
Γ(p/2)
1
(pπ)1/2
[ 2
(1 + t2/p)
](p+1)/2 ∼ tp.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 25 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
Student’s T distribution Cont’d
Notice that if p = 1, then tp becomes the pdf of the Cauchy distribution,which occurs for samples of size 2. If we start with U and V defined above,their joint pdf is
fU,V (u, v) =1
(2π)1/2e−u
2/2 1
Γ(p/2)2p/2v (p/2)−1e−v/2.
Now make the transformation
t =u√v/p
, and w = v .
The Jacobian of the transformation is (w/p)1/2, and the marginal pdf ofT is given
fT (t) =
∫ ∞−∞
fU,V (t(w/p)1/2,w)(w/p)1/2dw
=Γ((p + 1)/2)
Γ(p/2)
1
(pπ)1/2
[ 2
(1 + t2/p)
](p+1)/2 ∼ tp.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 25 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
Student’s T distribution Cont’d
Student’s t has no mgf because it does not have moments of allorders.
In fact, if there are p degrees of freedom, then there are only p − 1moments, i.e., a t1 has no mean, a t2 has no variance.Note that T1 ∼ t1 is a Cauchy r.v., it’s mean does not exist.Let Tp is a r.v. with a tp distribution, then
E (Tp) = 0, if p > 0
Var(Tp) =p
p − 2and p > 2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 26 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
Student’s T distribution Cont’d
Student’s t has no mgf because it does not have moments of allorders.In fact, if there are p degrees of freedom, then there are only p − 1moments, i.e., a t1 has no mean, a t2 has no variance.
Note that T1 ∼ t1 is a Cauchy r.v., it’s mean does not exist.Let Tp is a r.v. with a tp distribution, then
E (Tp) = 0, if p > 0
Var(Tp) =p
p − 2and p > 2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 26 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
Student’s T distribution Cont’d
Student’s t has no mgf because it does not have moments of allorders.In fact, if there are p degrees of freedom, then there are only p − 1moments, i.e., a t1 has no mean, a t2 has no variance.Note that T1 ∼ t1 is a Cauchy r.v., it’s mean does not exist.
Let Tp is a r.v. with a tp distribution, then
E (Tp) = 0, if p > 0
Var(Tp) =p
p − 2and p > 2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 26 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
Student’s T distribution Cont’d
Student’s t has no mgf because it does not have moments of allorders.In fact, if there are p degrees of freedom, then there are only p − 1moments, i.e., a t1 has no mean, a t2 has no variance.Note that T1 ∼ t1 is a Cauchy r.v., it’s mean does not exist.Let Tp is a r.v. with a tp distribution, then
E (Tp) = 0, if p > 0
Var(Tp) =p
p − 2and p > 2.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 26 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
F distribution
Let X1, · · · ,Xn be a random sample from a N(µX , σ2X ), and let Y1, · · · ,Ym
be a random sample from a N(µY , σ2Y ).
If we were interested in comparing the variability of the populations, onquantity of interest would be the ratio σ2
X/σ2Y . Information about this ratio
is contained in S2X/S
2Y , the ratio of sample variances.
Definition
Let X1, · · · ,Xn be a random sample from a N(µX , σ2X ), and let Y1, · · · ,Ym
be a random sample from a N(µY , σ2Y ). The random variable F =
S2X/σ
2X
S2Y /σ
2Y
has Snedecor’s F distribution with p = n − 1 and q = m − 1 degrees offreedom
fF (x) =Γ( p+q
2 )
Γ(p/2)Γ(q/2)(p
q)p/2 x (p/2)−1
(1 + (p/q)x)(p+q)/2∼ Fn−1,m−1.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 27 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
F distribution
Let X1, · · · ,Xn be a random sample from a N(µX , σ2X ), and let Y1, · · · ,Ym
be a random sample from a N(µY , σ2Y ).
If we were interested in comparing the variability of the populations, onquantity of interest would be the ratio σ2
X/σ2Y . Information about this ratio
is contained in S2X/S
2Y , the ratio of sample variances.
Definition
Let X1, · · · ,Xn be a random sample from a N(µX , σ2X ), and let Y1, · · · ,Ym
be a random sample from a N(µY , σ2Y ). The random variable F =
S2X/σ
2X
S2Y /σ
2Y
has Snedecor’s F distribution with p = n − 1 and q = m − 1 degrees offreedom
fF (x) =Γ( p+q
2 )
Γ(p/2)Γ(q/2)(p
q)p/2 x (p/2)−1
(1 + (p/q)x)(p+q)/2∼ Fn−1,m−1.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 27 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
F distribution Cont’d
Properties
A variance ratio may have an F distribution even if the parentpopulations are not normal. It is enough that their pdf aresymmetry functions.
F distribution can be found by the pdf of U/pV /q , where U and V
are independent, U ∼ χ2p and V ∼ χ2
q.
If X ∼ Fp,q, then (p/q)X(1+(p/q)X ) ∼ Beta(p/2, q/2).
If X ∼ Fp,q, then 1X ∼ Fq,p.
If X ∼ tq, then X 2 ∼ F1,q.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 28 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
F distribution Cont’d
Properties
A variance ratio may have an F distribution even if the parentpopulations are not normal. It is enough that their pdf aresymmetry functions.
F distribution can be found by the pdf of U/pV /q , where U and V
are independent, U ∼ χ2p and V ∼ χ2
q.
If X ∼ Fp,q, then (p/q)X(1+(p/q)X ) ∼ Beta(p/2, q/2).
If X ∼ Fp,q, then 1X ∼ Fq,p.
If X ∼ tq, then X 2 ∼ F1,q.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 28 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
F distribution Cont’d
Properties
A variance ratio may have an F distribution even if the parentpopulations are not normal. It is enough that their pdf aresymmetry functions.
F distribution can be found by the pdf of U/pV /q , where U and V
are independent, U ∼ χ2p and V ∼ χ2
q.
If X ∼ Fp,q, then (p/q)X(1+(p/q)X ) ∼ Beta(p/2, q/2).
If X ∼ Fp,q, then 1X ∼ Fq,p.
If X ∼ tq, then X 2 ∼ F1,q.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 28 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
F distribution Cont’d
Properties
A variance ratio may have an F distribution even if the parentpopulations are not normal. It is enough that their pdf aresymmetry functions.
F distribution can be found by the pdf of U/pV /q , where U and V
are independent, U ∼ χ2p and V ∼ χ2
q.
If X ∼ Fp,q, then (p/q)X(1+(p/q)X ) ∼ Beta(p/2, q/2).
If X ∼ Fp,q, then 1X ∼ Fq,p.
If X ∼ tq, then X 2 ∼ F1,q.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 28 / 101
Sampling from the Normal Distribution The Derived Distributions: Student’s T and Snedecor’s F
F distribution Cont’d
Properties
A variance ratio may have an F distribution even if the parentpopulations are not normal. It is enough that their pdf aresymmetry functions.
F distribution can be found by the pdf of U/pV /q , where U and V
are independent, U ∼ χ2p and V ∼ χ2
q.
If X ∼ Fp,q, then (p/q)X(1+(p/q)X ) ∼ Beta(p/2, q/2).
If X ∼ Fp,q, then 1X ∼ Fq,p.
If X ∼ tq, then X 2 ∼ F1,q.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 28 / 101
Order Statistics
Order statistics
The order statistics of a random sample X1, · · · ,Xn are the sample valuesplaced in ascending order. They are denoted by X(1), · · · ,X(n).
The order statistics satisfy X(1) ≤ · · · ≤ X(n).
X(1) = min1≤i≤n
Xi ,
X(2) = second smallest Xi ,
· · · · · · ,X(n) = max
1≤i≤nXi .
Sample range: R = X(n) − X(1).
Sample median: M =
X((n+1)/2), if n is odd;X(n/2) + X(n/2+1), if n is even.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 29 / 101
Order Statistics
Order statistics
The order statistics of a random sample X1, · · · ,Xn are the sample valuesplaced in ascending order. They are denoted by X(1), · · · ,X(n).
The order statistics satisfy X(1) ≤ · · · ≤ X(n).
X(1) = min1≤i≤n
Xi ,
X(2) = second smallest Xi ,
· · · · · · ,X(n) = max
1≤i≤nXi .
Sample range: R = X(n) − X(1).
Sample median: M =
X((n+1)/2), if n is odd;X(n/2) + X(n/2+1), if n is even.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 29 / 101
Order Statistics
Order statistics
The order statistics of a random sample X1, · · · ,Xn are the sample valuesplaced in ascending order. They are denoted by X(1), · · · ,X(n).
The order statistics satisfy X(1) ≤ · · · ≤ X(n).
X(1) = min1≤i≤n
Xi ,
X(2) = second smallest Xi ,
· · · · · · ,X(n) = max
1≤i≤nXi .
Sample range: R = X(n) − X(1).
Sample median: M =
X((n+1)/2), if n is odd;X(n/2) + X(n/2+1), if n is even.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 29 / 101
Order Statistics
Order statistics
The order statistics of a random sample X1, · · · ,Xn are the sample valuesplaced in ascending order. They are denoted by X(1), · · · ,X(n).
The order statistics satisfy X(1) ≤ · · · ≤ X(n).
X(1) = min1≤i≤n
Xi ,
X(2) = second smallest Xi ,
· · · · · · ,X(n) = max
1≤i≤nXi .
Sample range: R = X(n) − X(1).
Sample median: M =
X((n+1)/2), if n is odd;X(n/2) + X(n/2+1), if n is even.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 29 / 101
Order Statistics
Order statistics
The order statistics of a random sample X1, · · · ,Xn are the sample valuesplaced in ascending order. They are denoted by X(1), · · · ,X(n).
The order statistics satisfy X(1) ≤ · · · ≤ X(n).
X(1) = min1≤i≤n
Xi ,
X(2) = second smallest Xi ,
· · · · · · ,X(n) = max
1≤i≤nXi .
Sample range: R = X(n) − X(1).
Sample median: M =
X((n+1)/2), if n is odd;X(n/2) + X(n/2+1), if n is even.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 29 / 101
Order Statistics
Distribution of order statistics (discrete case)
Theorem
Let X1,X2, · · · ,Xn be a random sample from a discrete distribution withpmf fX (xi ) = pi , where x1 < x2 < · · · are the possible values of X inascending order. Define
P0 = 0,P1 = p1,
· · · · · · ,Pi = p1 + p2 + · · ·+ pi .
Let X(1), · · · ,X(n) denote the order statistics from the sample. Then
P(X(j) ≤ xi ) =n∑
k=j
(n
k
)Pki (1− Pi )
n−k
and
P(X(j) = xi ) =n∑
k=j
(n
k
)[Pk
i (1− Pi )n−k − Pk
i−1(1− Pi−1)n−k ].
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 30 / 101
Order Statistics
Proof the theorem
Fix i and let Y be a r.v. that counts the number of X1,X2, · · · ,Xn
that are less than or equal to xi . For each of X1,X2, · · · ,Xn, call theevent Xj ≤ xi a “success” and Xj > xi a “failure”. Then Y isthe number of successes in n trials.
Thus, Y ∼ binomial(n,Pi ).The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi , i.e.,
P(X(j) ≤ xi ) = P(Y ≥ j).
For the other case
P(X(j) = xi ) = P(X(j) ≤ xi )− P(X(j) ≤ xi−1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 31 / 101
Order Statistics
Proof the theorem
Fix i and let Y be a r.v. that counts the number of X1,X2, · · · ,Xn
that are less than or equal to xi . For each of X1,X2, · · · ,Xn, call theevent Xj ≤ xi a “success” and Xj > xi a “failure”. Then Y isthe number of successes in n trials.Thus, Y ∼ binomial(n,Pi ).
The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi , i.e.,
P(X(j) ≤ xi ) = P(Y ≥ j).
For the other case
P(X(j) = xi ) = P(X(j) ≤ xi )− P(X(j) ≤ xi−1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 31 / 101
Order Statistics
Proof the theorem
Fix i and let Y be a r.v. that counts the number of X1,X2, · · · ,Xn
that are less than or equal to xi . For each of X1,X2, · · · ,Xn, call theevent Xj ≤ xi a “success” and Xj > xi a “failure”. Then Y isthe number of successes in n trials.Thus, Y ∼ binomial(n,Pi ).The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi , i.e.,
P(X(j) ≤ xi ) = P(Y ≥ j).
For the other case
P(X(j) = xi ) = P(X(j) ≤ xi )− P(X(j) ≤ xi−1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 31 / 101
Order Statistics
Proof the theorem
Fix i and let Y be a r.v. that counts the number of X1,X2, · · · ,Xn
that are less than or equal to xi . For each of X1,X2, · · · ,Xn, call theevent Xj ≤ xi a “success” and Xj > xi a “failure”. Then Y isthe number of successes in n trials.Thus, Y ∼ binomial(n,Pi ).The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi , i.e.,
P(X(j) ≤ xi ) = P(Y ≥ j).
For the other case
P(X(j) = xi ) = P(X(j) ≤ xi )− P(X(j) ≤ xi−1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 31 / 101
Order Statistics
Distribution of order statistics (continuous case)
Theorem
Let X(1), · · · ,X(n) denote the order statistics of a random sample,X1,X2, · · · ,Xn, from a continuous population with cdf FX (x) andpdf fX (x). Then the pdf of X(j) is
fX(j)(x) =
n!
(j − 1)!(n − j)!fX (x)[FX (x)]j−1[1− FX (x)]n−j .
Proof.
We first find the cdf of X(j) and then differentiate it to obtain thepdf. Let Y be a r.v. that counts the number of X1,X2, · · · ,Xn thatare less than or equal to x .
Thus, Y ∼ binomial(n,Pi ). (Note thatwe can write Pi = FX (xi ) in the above theorem.)The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi ,
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 32 / 101
Order Statistics
Distribution of order statistics (continuous case)
Theorem
Let X(1), · · · ,X(n) denote the order statistics of a random sample,X1,X2, · · · ,Xn, from a continuous population with cdf FX (x) andpdf fX (x). Then the pdf of X(j) is
fX(j)(x) =
n!
(j − 1)!(n − j)!fX (x)[FX (x)]j−1[1− FX (x)]n−j .
Proof.
We first find the cdf of X(j) and then differentiate it to obtain thepdf. Let Y be a r.v. that counts the number of X1,X2, · · · ,Xn thatare less than or equal to x . Thus, Y ∼ binomial(n,Pi ). (Note thatwe can write Pi = FX (xi ) in the above theorem.)
The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi ,
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 32 / 101
Order Statistics
Distribution of order statistics (continuous case)
Theorem
Let X(1), · · · ,X(n) denote the order statistics of a random sample,X1,X2, · · · ,Xn, from a continuous population with cdf FX (x) andpdf fX (x). Then the pdf of X(j) is
fX(j)(x) =
n!
(j − 1)!(n − j)!fX (x)[FX (x)]j−1[1− FX (x)]n−j .
Proof.
We first find the cdf of X(j) and then differentiate it to obtain thepdf. Let Y be a r.v. that counts the number of X1,X2, · · · ,Xn thatare less than or equal to x . Thus, Y ∼ binomial(n,Pi ). (Note thatwe can write Pi = FX (xi ) in the above theorem.)The event X(j) ≤ xi is equivalent to the event Y ≥ j; that is,at least j of the sample values are less than or equal to xi ,
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 32 / 101
Order Statistics
Proof Cont’d
Proof.
i.e.,
FX(j)(x) = P(Y ≥ j) =
n∑k=j
(n
k
)[FX (x)]k [1− FX (x)]n−k ,
and the pdf of X(j) is
fX(j)(x) =
d
dxFX(j)
(x)
=n∑
k=j
(n
k
)(k[FX (x)]k−1[1− FX (x)]n−k fX (x)
− (n − k)[FX (x)]k [1− FX (x)]n−k−1fX (x))
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 33 / 101
Order Statistics
Proof Cont’d
Proof.
i.e.,
FX(j)(x) = P(Y ≥ j) =
n∑k=j
(n
k
)[FX (x)]k [1− FX (x)]n−k ,
and the pdf of X(j) is
fX(j)(x) =
d
dxFX(j)
(x)
=n∑
k=j
(n
k
)(k[FX (x)]k−1[1− FX (x)]n−k fX (x)
− (n − k)[FX (x)]k [1− FX (x)]n−k−1fX (x))
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 33 / 101
Order Statistics
Proof Cont’d
fX(j)(x) = j
(n
j
)fX (x)[FX (x)]j−1[1− FX (x)]n−j+
n∑k=j+1
(n
k
)k[FX (x)]k−1[1− FX (x)]n−k fX (x)
−n−1∑k=j
(n
k
)(n − k)[FX (x)]k [1− FX (x)]n−k−1fX (x)
=n!
(j − 1)!(n − j)!fX (x)[FX (x)]j−1[1− FX (x)]n−j
+n∑
k=j
(n
k + 1
)(k + 1)[FX (x)]k [1− FX (x)]n−k−1fX (x)
−n−1∑k=j
(n
k
)(n − k)[FX (x)]k [1− FX (x)]n−k−1fX (x).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 34 / 101
Order Statistics
Uniform order statistic pdf
Example
Let X1,X2, · · · ,Xn be i.i.d. uniform(0, 1), so fX (x) = 1 for x ∈ (0, 1)and FX (x) = x for x ∈ (0, 1). We can obtain that the pdf of the jthorder statistics is
fX(j)(x) =
n!
(j − 1)!(n − j)!x j−1(1− x)n−j
=Γ(n + 1)
Γ(j)Γ(n − j + 1)x j−1(1− x)(n−j+1)−1.
Thus, the jth order statistic form a uniform(0, 1) sample has aBeta(j , n − j + 1) distribution.From this we can deduce that
E (X(j)) =j
n + 1and Var(X(j)) =
j(n − j + 1)
(n + 1)2(n + 2).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 35 / 101
Order Statistics
Uniform order statistic pdf
Example
Let X1,X2, · · · ,Xn be i.i.d. uniform(0, 1), so fX (x) = 1 for x ∈ (0, 1)and FX (x) = x for x ∈ (0, 1). We can obtain that the pdf of the jthorder statistics is
fX(j)(x) =
n!
(j − 1)!(n − j)!x j−1(1− x)n−j
=Γ(n + 1)
Γ(j)Γ(n − j + 1)x j−1(1− x)(n−j+1)−1.
Thus, the jth order statistic form a uniform(0, 1) sample has aBeta(j , n − j + 1) distribution.
From this we can deduce that
E (X(j)) =j
n + 1and Var(X(j)) =
j(n − j + 1)
(n + 1)2(n + 2).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 35 / 101
Order Statistics
Uniform order statistic pdf
Example
Let X1,X2, · · · ,Xn be i.i.d. uniform(0, 1), so fX (x) = 1 for x ∈ (0, 1)and FX (x) = x for x ∈ (0, 1). We can obtain that the pdf of the jthorder statistics is
fX(j)(x) =
n!
(j − 1)!(n − j)!x j−1(1− x)n−j
=Γ(n + 1)
Γ(j)Γ(n − j + 1)x j−1(1− x)(n−j+1)−1.
Thus, the jth order statistic form a uniform(0, 1) sample has aBeta(j , n − j + 1) distribution.From this we can deduce that
E (X(j)) =j
n + 1and Var(X(j)) =
j(n − j + 1)
(n + 1)2(n + 2).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 35 / 101
Order Statistics
Joint distribution of order statistics (continuous case)
Theorem
Let X(1), · · · ,X(n) denote the order statistics of a random sample,X1,X2, · · · ,Xn, from a continuous population with cdf FX (x) andpdf fX (x).
Then the joint pdf of X(i) and X(j), 1 ≤ i ≤ j ≤ n, is
fX(i),X(j)(u, v) =
n!
(i − 1)!(j − 1− i)!(n − j)!fX (u)fX (v)[FX (u)]i−1
[FX (v)− FX (u)]j−1−v [1− FX (v)]n−j .
for −∞ < u < v <∞.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 36 / 101
Order Statistics
Joint distribution of order statistics (continuous case)
Theorem
Let X(1), · · · ,X(n) denote the order statistics of a random sample,X1,X2, · · · ,Xn, from a continuous population with cdf FX (x) andpdf fX (x).Then the joint pdf of X(i) and X(j), 1 ≤ i ≤ j ≤ n, is
fX(i),X(j)(u, v) =
n!
(i − 1)!(j − 1− i)!(n − j)!fX (u)fX (v)[FX (u)]i−1
[FX (v)− FX (u)]j−1−v [1− FX (v)]n−j .
for −∞ < u < v <∞.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 36 / 101
Order Statistics
Joint distribution of order statistics (continuous case)
Theorem
The joint pdf of three or more order statistics could be derived usingsimilar but even more involved arguments. Perhaps the other mostuseful pdf is
fX(1),··· ,X(n)(x1, · · · , xn) =
n!fX (x1) · · · fX (xn), x1 < · · · < xn0, otherwise.
The joint pdf and the techniques from previous chapter can be used toderive marginal and conditional distributions and distributions of otherfunctions of the order statistics.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 37 / 101
Order Statistics
Joint distribution of order statistics (continuous case)
Theorem
The joint pdf of three or more order statistics could be derived usingsimilar but even more involved arguments. Perhaps the other mostuseful pdf is
fX(1),··· ,X(n)(x1, · · · , xn) =
n!fX (x1) · · · fX (xn), x1 < · · · < xn0, otherwise.
The joint pdf and the techniques from previous chapter can be used toderive marginal and conditional distributions and distributions of otherfunctions of the order statistics.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 37 / 101
Convergence Concepts Convergence in Probability
Outline
1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F
2 Order Statistics3 Convergence Concepts
Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution
4 Generating a Random SampleDirect ModelIndirect Model
5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling
6 Take-aways
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 38 / 101
Convergence Concepts Convergence in Probability
Convergence in probability
Definition
A sequence of r.v.s, X1,X2, · · · , converges in probability to a r.v. Xif, for every ε > 0,
limn→∞
P(|Xn − X | ≥ ε) = 0
orlimn→∞
P(|Xn − X | < ε) = 1,Xnp−→ X .
The r.v.s X1,X2, · · · are typically not i.i.d. r.v.s, as in arandom sample;
The distribution of Xn changes as the subscript changes;
The distribution of Xn converges to some limiting distributionas the subscript becomes large.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 39 / 101
Convergence Concepts Convergence in Probability
Convergence in probability
Definition
A sequence of r.v.s, X1,X2, · · · , converges in probability to a r.v. Xif, for every ε > 0,
limn→∞
P(|Xn − X | ≥ ε) = 0
orlimn→∞
P(|Xn − X | < ε) = 1,Xnp−→ X .
The r.v.s X1,X2, · · · are typically not i.i.d. r.v.s, as in arandom sample;
The distribution of Xn changes as the subscript changes;
The distribution of Xn converges to some limiting distributionas the subscript becomes large.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 39 / 101
Convergence Concepts Convergence in Probability
Convergence in probability
Definition
A sequence of r.v.s, X1,X2, · · · , converges in probability to a r.v. Xif, for every ε > 0,
limn→∞
P(|Xn − X | ≥ ε) = 0
orlimn→∞
P(|Xn − X | < ε) = 1,Xnp−→ X .
The r.v.s X1,X2, · · · are typically not i.i.d. r.v.s, as in arandom sample;
The distribution of Xn changes as the subscript changes;
The distribution of Xn converges to some limiting distributionas the subscript becomes large.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 39 / 101
Convergence Concepts Convergence in Probability
Convergence in probability
Definition
A sequence of r.v.s, X1,X2, · · · , converges in probability to a r.v. Xif, for every ε > 0,
limn→∞
P(|Xn − X | ≥ ε) = 0
orlimn→∞
P(|Xn − X | < ε) = 1,Xnp−→ X .
The r.v.s X1,X2, · · · are typically not i.i.d. r.v.s, as in arandom sample;
The distribution of Xn changes as the subscript changes;
The distribution of Xn converges to some limiting distributionas the subscript becomes large.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 39 / 101
Convergence Concepts Convergence in Probability
Weak law of large numbers
Theorem
Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.Define X n = 1
n
∑ni=1 Xi . Then, for every ε > 0,
limn→∞
P(|X n − µ| < ε) = 1,
that is, X n converges in probability to µ.
Proof.
Using Chebychev’s inequality, for every ε > 0,
P(|X n − µ| < ε) = 1− P(|X n − µ| ≥ ε) = 1− P((X n − µ)2 ≥ ε2)
≥ 1− E (X n − µ)2
ε2= 1− σ2
nε2→ 1.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 40 / 101
Convergence Concepts Convergence in Probability
Weak law of large numbers
Theorem
Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.Define X n = 1
n
∑ni=1 Xi . Then, for every ε > 0,
limn→∞
P(|X n − µ| < ε) = 1,
that is, X n converges in probability to µ.
Proof.
Using Chebychev’s inequality, for every ε > 0,
P(|X n − µ| < ε) = 1− P(|X n − µ| ≥ ε) = 1− P((X n − µ)2 ≥ ε2)
≥ 1− E (X n − µ)2
ε2= 1− σ2
nε2→ 1.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 40 / 101
Convergence Concepts Convergence in Probability
Consistency of S2
Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.If we define
S2n =
1
n − 1
n∑i=1
(Xi − X n)2,
can we prove a WLLN for S2n?
Using Chebychev’s inequality, we have,
P(|Sn − σ2| ≥ ε) ≤ E (Sn − σ2)2
ε2
=Var(S2
n )
ε2.
Thus, a sufficient condition that S2n converges in probability to σ2 is
that Var(S2n )→ 0 as n→∞.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 41 / 101
Convergence Concepts Convergence in Probability
Consistency of S2
Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.If we define
S2n =
1
n − 1
n∑i=1
(Xi − X n)2,
can we prove a WLLN for S2n?
Using Chebychev’s inequality, we have,
P(|Sn − σ2| ≥ ε) ≤ E (Sn − σ2)2
ε2
=Var(S2
n )
ε2.
Thus, a sufficient condition that S2n converges in probability to σ2 is
that Var(S2n )→ 0 as n→∞.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 41 / 101
Convergence Concepts Convergence in Probability
Consistency of S2
Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.If we define
S2n =
1
n − 1
n∑i=1
(Xi − X n)2,
can we prove a WLLN for S2n?
Using Chebychev’s inequality, we have,
P(|Sn − σ2| ≥ ε) ≤ E (Sn − σ2)2
ε2
=Var(S2
n )
ε2.
Thus, a sufficient condition that S2n converges in probability to σ2 is
that Var(S2n )→ 0 as n→∞.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 41 / 101
Convergence Concepts Convergence in Probability
Consistency of S?
Theorem
Let X1,X2, · · · converge in probability to a r.v. X and that h is acontinuous function. Then h(X1), h(X2), · · · converges in probabilityto h(X ).
Consistency of S
If S2n is a consistent estimator of σ2, then the sample standard devi-
ation Sn =√
S2n = h(S2
n ) is a consistent estimator of σ.Note that Sn is, in fact, a biased estimator of σ, but the bias disap-pears asymptotically.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 42 / 101
Convergence Concepts Convergence in Probability
Consistency of S?
Theorem
Let X1,X2, · · · converge in probability to a r.v. X and that h is acontinuous function. Then h(X1), h(X2), · · · converges in probabilityto h(X ).
Consistency of S
If S2n is a consistent estimator of σ2, then the sample standard devi-
ation Sn =√
S2n = h(S2
n ) is a consistent estimator of σ.Note that Sn is, in fact, a biased estimator of σ, but the bias disap-pears asymptotically.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 42 / 101
Convergence Concepts Almost Sure Convergence
Outline
1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F
2 Order Statistics3 Convergence Concepts
Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution
4 Generating a Random SampleDirect ModelIndirect Model
5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling
6 Take-aways
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 43 / 101
Convergence Concepts Almost Sure Convergence
Almost Sure Convergence
A type of convergence that is stronger than convergence in probability isalmost sure convergence.
Definition
A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,
P( limn→∞
|Xn − X | < ε) = 1, or Xna.s.−−→ X .
Although they look similar, they are very different statements;
Given a sample space Ω, Xn(ω) and X (ω) are all functionsdefined on Ω. Xn
a.s.−−→ X indicates that the functions Xn(ω)converge to X (ω) for all ω ∈ Ω except perhaps for ω ∈ N,where N ⊂ Ω and P(N) = 0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 44 / 101
Convergence Concepts Almost Sure Convergence
Almost Sure Convergence
A type of convergence that is stronger than convergence in probability isalmost sure convergence.
Definition
A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,
P( limn→∞
|Xn − X | < ε) = 1, or Xna.s.−−→ X .
Although they look similar, they are very different statements;
Given a sample space Ω, Xn(ω) and X (ω) are all functionsdefined on Ω. Xn
a.s.−−→ X indicates that the functions Xn(ω)converge to X (ω) for all ω ∈ Ω except perhaps for ω ∈ N,where N ⊂ Ω and P(N) = 0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 44 / 101
Convergence Concepts Almost Sure Convergence
Almost Sure Convergence
A type of convergence that is stronger than convergence in probability isalmost sure convergence.
Definition
A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,
P( limn→∞
|Xn − X | < ε) = 1, or Xna.s.−−→ X .
Although they look similar, they are very different statements;
Given a sample space Ω, Xn(ω) and X (ω) are all functionsdefined on Ω. Xn
a.s.−−→ X indicates that the functions Xn(ω)converge to X (ω) for all ω ∈ Ω except perhaps for ω ∈ N,where N ⊂ Ω and P(N) = 0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 44 / 101
Convergence Concepts Almost Sure Convergence
Almost Sure Convergence
A type of convergence that is stronger than convergence in probability isalmost sure convergence.
Definition
A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,
P( limn→∞
|Xn − X | < ε) = 1, or Xna.s.−−→ X .
Although they look similar, they are very different statements;
Given a sample space Ω, Xn(ω) and X (ω) are all functionsdefined on Ω. Xn
a.s.−−→ X indicates that the functions Xn(ω)converge to X (ω) for all ω ∈ Ω except perhaps for ω ∈ N,where N ⊂ Ω and P(N) = 0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 44 / 101
Convergence Concepts Almost Sure Convergence
Almost Sure Convergence
Example
Let the sample space Ω be the closed interval [0, 1] with the uniformpdf.Define r.v.s
Xn(ω) = ω + ωn, and X (ω) = ω.
For every ω ∈ [0, 1), ωn → 0 as n → ∞ and Xn(ω) → ω = X (ω).However, Xn(1) = 2 for every n so Xn(1) does not converge to1 = X (1).But since the convergence occurs on the set [0, 1) and P([0, 1)) = 1,A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,
Xna.s.−−→ X .
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 45 / 101
Convergence Concepts Almost Sure Convergence
Almost Sure Convergence
Example
Let the sample space Ω be the closed interval [0, 1] with the uniformpdf.Define r.v.s
Xn(ω) = ω + ωn, and X (ω) = ω.
For every ω ∈ [0, 1), ωn → 0 as n → ∞ and Xn(ω) → ω = X (ω).
However, Xn(1) = 2 for every n so Xn(1) does not converge to1 = X (1).But since the convergence occurs on the set [0, 1) and P([0, 1)) = 1,A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,
Xna.s.−−→ X .
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 45 / 101
Convergence Concepts Almost Sure Convergence
Almost Sure Convergence
Example
Let the sample space Ω be the closed interval [0, 1] with the uniformpdf.Define r.v.s
Xn(ω) = ω + ωn, and X (ω) = ω.
For every ω ∈ [0, 1), ωn → 0 as n → ∞ and Xn(ω) → ω = X (ω).However, Xn(1) = 2 for every n so Xn(1) does not converge to1 = X (1).
But since the convergence occurs on the set [0, 1) and P([0, 1)) = 1,A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,
Xna.s.−−→ X .
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 45 / 101
Convergence Concepts Almost Sure Convergence
Almost Sure Convergence
Example
Let the sample space Ω be the closed interval [0, 1] with the uniformpdf.Define r.v.s
Xn(ω) = ω + ωn, and X (ω) = ω.
For every ω ∈ [0, 1), ωn → 0 as n → ∞ and Xn(ω) → ω = X (ω).However, Xn(1) = 2 for every n so Xn(1) does not converge to1 = X (1).But since the convergence occurs on the set [0, 1) and P([0, 1)) = 1,A sequence of r.v.s, X1,X2, · · · , converges almost surely to a r.v. Xif, for every ε > 0,
Xna.s.−−→ X .
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 45 / 101
Convergence Concepts Almost Sure Convergence
Convergence in probability, not a.s.
Example
Let the sample space Ω be the closed interval [0, 1] with the uniformpdf.Define r.v.s
X1(ω) = ω + I[0,1](ω),X2(ω) = ω + I[0, 12
](ω),
X3(ω) = ω + I[ 12,1](ω),X4(ω) = ω + I[0, 1
3](ω),
X5(ω) = ω + I[ 13, 2
3](ω),X6(ω) = ω + I[ 2
3,1](ω), · · ·
Let X1(ω) = ω. Thus Xnp−→ X .
For every ω, the value of Xn(ω) alternates between the values ω and1 + ω infinitely often. Thus,
Xn 6a.s.−−→ X .
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 46 / 101
Convergence Concepts Almost Sure Convergence
Convergence in probability, not a.s.
Example
Let the sample space Ω be the closed interval [0, 1] with the uniformpdf.Define r.v.s
X1(ω) = ω + I[0,1](ω),X2(ω) = ω + I[0, 12
](ω),
X3(ω) = ω + I[ 12,1](ω),X4(ω) = ω + I[0, 1
3](ω),
X5(ω) = ω + I[ 13, 2
3](ω),X6(ω) = ω + I[ 2
3,1](ω), · · ·
Let X1(ω) = ω. Thus Xnp−→ X .
For every ω, the value of Xn(ω) alternates between the values ω and1 + ω infinitely often. Thus,
Xn 6a.s.−−→ X .
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 46 / 101
Convergence Concepts Almost Sure Convergence
Convergence in probability, not a.s.
Example
Let the sample space Ω be the closed interval [0, 1] with the uniformpdf.Define r.v.s
X1(ω) = ω + I[0,1](ω),X2(ω) = ω + I[0, 12
](ω),
X3(ω) = ω + I[ 12,1](ω),X4(ω) = ω + I[0, 1
3](ω),
X5(ω) = ω + I[ 13, 2
3](ω),X6(ω) = ω + I[ 2
3,1](ω), · · ·
Let X1(ω) = ω. Thus Xnp−→ X .
For every ω, the value of Xn(ω) alternates between the values ω and1 + ω infinitely often. Thus,
Xn 6a.s.−−→ X .
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 46 / 101
Convergence Concepts Almost Sure Convergence
Strong law of large numbers
Theorem
Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.Define X n = 1
n
∑ni=1 Xi . Then, for every ε > 0,
P( limn→∞
|X n − µ| < ε) = 1, or Xna.s.−−→ X .
For both the weak and strong laws of large numbers, we havethe assumption of a finite variance. Both laws hold withoutthis assumption. The only moment condition needed is thatE |Xi | <∞;
Convergence almost surely, being the stronger criterion, impliesconvergence in probability;
If a sequence converges in probability, it is possible to find asubsequence that converges almost surely.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 47 / 101
Convergence Concepts Almost Sure Convergence
Strong law of large numbers
Theorem
Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.Define X n = 1
n
∑ni=1 Xi . Then, for every ε > 0,
P( limn→∞
|X n − µ| < ε) = 1, or Xna.s.−−→ X .
For both the weak and strong laws of large numbers, we havethe assumption of a finite variance. Both laws hold withoutthis assumption. The only moment condition needed is thatE |Xi | <∞;
Convergence almost surely, being the stronger criterion, impliesconvergence in probability;
If a sequence converges in probability, it is possible to find asubsequence that converges almost surely.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 47 / 101
Convergence Concepts Almost Sure Convergence
Strong law of large numbers
Theorem
Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.Define X n = 1
n
∑ni=1 Xi . Then, for every ε > 0,
P( limn→∞
|X n − µ| < ε) = 1, or Xna.s.−−→ X .
For both the weak and strong laws of large numbers, we havethe assumption of a finite variance. Both laws hold withoutthis assumption. The only moment condition needed is thatE |Xi | <∞;
Convergence almost surely, being the stronger criterion, impliesconvergence in probability;
If a sequence converges in probability, it is possible to find asubsequence that converges almost surely.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 47 / 101
Convergence Concepts Almost Sure Convergence
Strong law of large numbers
Theorem
Let X1,X2, · · · be i.i.d. r.v.s with E (Xi ) = µ and Var(Xi ) = σ2 <∞.Define X n = 1
n
∑ni=1 Xi . Then, for every ε > 0,
P( limn→∞
|X n − µ| < ε) = 1, or Xna.s.−−→ X .
For both the weak and strong laws of large numbers, we havethe assumption of a finite variance. Both laws hold withoutthis assumption. The only moment condition needed is thatE |Xi | <∞;
Convergence almost surely, being the stronger criterion, impliesconvergence in probability;
If a sequence converges in probability, it is possible to find asubsequence that converges almost surely.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 47 / 101
Convergence Concepts Convergence in Distribution
Outline
1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F
2 Order Statistics3 Convergence Concepts
Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution
4 Generating a Random SampleDirect ModelIndirect Model
5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling
6 Take-aways
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 48 / 101
Convergence Concepts Convergence in Distribution
Convergence in distribution
Definition
A sequence of r.v.s, X1,X2, · · · , converges in distribution to a r.v. Xif
limn→∞
FXn(x) = FX (x),
at all points x where FX (x) is continuous.
Example
If r.v.s, X1,X2, · · · are i.i.d. uniform(0, 1) and X(n) = max1≤i≤n Xi ,let us examine if X(n) converges in distribution.As n→∞, we expect X(n) to get close to 1, as X(n) must necessarilybe less than 1, we have for any ε > 0
P(|X(n) − 1| ≥ ε) = P(X(n) ≥ 1 + ε) + P(X(n) ≤ 1− ε)= 0 + P(X(n) ≤ 1− ε).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 49 / 101
Convergence Concepts Convergence in Distribution
Convergence in distribution
Definition
A sequence of r.v.s, X1,X2, · · · , converges in distribution to a r.v. Xif
limn→∞
FXn(x) = FX (x),
at all points x where FX (x) is continuous.
Example
If r.v.s, X1,X2, · · · are i.i.d. uniform(0, 1) and X(n) = max1≤i≤n Xi ,let us examine if X(n) converges in distribution.
As n→∞, we expect X(n) to get close to 1, as X(n) must necessarilybe less than 1, we have for any ε > 0
P(|X(n) − 1| ≥ ε) = P(X(n) ≥ 1 + ε) + P(X(n) ≤ 1− ε)= 0 + P(X(n) ≤ 1− ε).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 49 / 101
Convergence Concepts Convergence in Distribution
Convergence in distribution
Definition
A sequence of r.v.s, X1,X2, · · · , converges in distribution to a r.v. Xif
limn→∞
FXn(x) = FX (x),
at all points x where FX (x) is continuous.
Example
If r.v.s, X1,X2, · · · are i.i.d. uniform(0, 1) and X(n) = max1≤i≤n Xi ,let us examine if X(n) converges in distribution.As n→∞, we expect X(n) to get close to 1, as X(n) must necessarilybe less than 1, we have for any ε > 0
P(|X(n) − 1| ≥ ε) = P(X(n) ≥ 1 + ε) + P(X(n) ≤ 1− ε)= 0 + P(X(n) ≤ 1− ε).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 49 / 101
Convergence Concepts Convergence in Distribution
Convergence in distribution
Definition
A sequence of r.v.s, X1,X2, · · · , converges in distribution to a r.v. Xif
limn→∞
FXn(x) = FX (x),
at all points x where FX (x) is continuous.
Example
If r.v.s, X1,X2, · · · are i.i.d. uniform(0, 1) and X(n) = max1≤i≤n Xi ,let us examine if X(n) converges in distribution.As n→∞, we expect X(n) to get close to 1, as X(n) must necessarilybe less than 1, we have for any ε > 0
P(|X(n) − 1| ≥ ε) = P(X(n) ≥ 1 + ε) + P(X(n) ≤ 1− ε)= 0 + P(X(n) ≤ 1− ε).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 49 / 101
Convergence Concepts Convergence in Distribution
Example Cont’d
Next using the fact that we have an i.i.d. sample, we can write
P(X(n) ≤ 1− ε) = P(Xi ≤ 1− ε, i = 1, · · · , n) = (1− ε)n,
which goes to 0.
So we have proved that X(n) converges to 1 in probability. However,if we take ε = t
n , we then have
P(X(n) ≤ 1− t
n) = (1− t
n)n → e−t ,
which, upon rearranging, yields
P(n(1− X(n)) ≤ t)→ 1− e−t ;
That is, the r.v. n(1 − X(n)) converges in distribution to anexponential(1) r.v.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 50 / 101
Convergence Concepts Convergence in Distribution
Example Cont’d
Next using the fact that we have an i.i.d. sample, we can write
P(X(n) ≤ 1− ε) = P(Xi ≤ 1− ε, i = 1, · · · , n) = (1− ε)n,
which goes to 0.So we have proved that X(n) converges to 1 in probability. However,if we take ε = t
n , we then have
P(X(n) ≤ 1− t
n) = (1− t
n)n → e−t ,
which, upon rearranging, yields
P(n(1− X(n)) ≤ t)→ 1− e−t ;
That is, the r.v. n(1 − X(n)) converges in distribution to anexponential(1) r.v.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 50 / 101
Convergence Concepts Convergence in Distribution
Example Cont’d
Next using the fact that we have an i.i.d. sample, we can write
P(X(n) ≤ 1− ε) = P(Xi ≤ 1− ε, i = 1, · · · , n) = (1− ε)n,
which goes to 0.So we have proved that X(n) converges to 1 in probability. However,if we take ε = t
n , we then have
P(X(n) ≤ 1− t
n) = (1− t
n)n → e−t ,
which, upon rearranging, yields
P(n(1− X(n)) ≤ t)→ 1− e−t ;
That is, the r.v. n(1 − X(n)) converges in distribution to anexponential(1) r.v.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 50 / 101
Convergence Concepts Convergence in Distribution
Example Cont’d
Next using the fact that we have an i.i.d. sample, we can write
P(X(n) ≤ 1− ε) = P(Xi ≤ 1− ε, i = 1, · · · , n) = (1− ε)n,
which goes to 0.So we have proved that X(n) converges to 1 in probability. However,if we take ε = t
n , we then have
P(X(n) ≤ 1− t
n) = (1− t
n)n → e−t ,
which, upon rearranging, yields
P(n(1− X(n)) ≤ t)→ 1− e−t ;
That is, the r.v. n(1 − X(n)) converges in distribution to anexponential(1) r.v.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 50 / 101
Convergence Concepts Convergence in Distribution
Central limit theorem
Let X1,X2, · · · be a sequence of i.i.d r.v.s whose mgfs exist in aneighborhood of 0 (i.e., MXi
(t) exists for |t| < h, for some positiveh). Let E (Xi ) = µ and Var(Xi ) = σ2 > 0. Define
X n =1
n
n∑i=1
Xi .
Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ∈ [−∞,∞],
limn→∞
Gn(x) =
∫ x
−∞
1√2π
e−y2/2dy .
i.e.,√n(X n−µ)σ has a limiting standard normal distribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 51 / 101
Convergence Concepts Convergence in Distribution
Central limit theorem
Let X1,X2, · · · be a sequence of i.i.d r.v.s whose mgfs exist in aneighborhood of 0 (i.e., MXi
(t) exists for |t| < h, for some positiveh). Let E (Xi ) = µ and Var(Xi ) = σ2 > 0. Define
X n =1
n
n∑i=1
Xi .
Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ∈ [−∞,∞],
limn→∞
Gn(x) =
∫ x
−∞
1√2π
e−y2/2dy .
i.e.,√n(X n−µ)σ has a limiting standard normal distribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 51 / 101
Convergence Concepts Convergence in Distribution
Central limit theorem
Let X1,X2, · · · be a sequence of i.i.d r.v.s whose mgfs exist in aneighborhood of 0 (i.e., MXi
(t) exists for |t| < h, for some positiveh). Let E (Xi ) = µ and Var(Xi ) = σ2 > 0. Define
X n =1
n
n∑i=1
Xi .
Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ∈ [−∞,∞],
limn→∞
Gn(x) =
∫ x
−∞
1√2π
e−y2/2dy .
i.e.,√n(X n−µ)σ has a limiting standard normal distribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 51 / 101
Convergence Concepts Convergence in Distribution
The proof of the central limit theorem
Proof.
We will show that, for |t| < h, the mgf of√n(X n−µ)σ converges to
et2/2, the mgf of a N(0, 1) r.v.
Define Yi = Xi−µσ , and let MY (t) denote the common mfg of the
Y ′i s, which exists for |t| < σh, we have
M√n(X n−µ)/σ(t) = M∑ni=1 Yi/
√n(t) = M∑n
i=1 Yi(t√n
) =(MY (
t√n
))n.
We now expand MY ( t√n
) in a Taylor series around 0. We have
MY (t√n
) =∞∑k=0
M(k)Y (0)
(t/√n)k
k!,
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 52 / 101
Convergence Concepts Convergence in Distribution
The proof of the central limit theorem
Proof.
We will show that, for |t| < h, the mgf of√n(X n−µ)σ converges to
et2/2, the mgf of a N(0, 1) r.v.
Define Yi = Xi−µσ , and let MY (t) denote the common mfg of the
Y ′i s, which exists for |t| < σh,
we have
M√n(X n−µ)/σ(t) = M∑ni=1 Yi/
√n(t) = M∑n
i=1 Yi(t√n
) =(MY (
t√n
))n.
We now expand MY ( t√n
) in a Taylor series around 0. We have
MY (t√n
) =∞∑k=0
M(k)Y (0)
(t/√n)k
k!,
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 52 / 101
Convergence Concepts Convergence in Distribution
The proof of the central limit theorem
Proof.
We will show that, for |t| < h, the mgf of√n(X n−µ)σ converges to
et2/2, the mgf of a N(0, 1) r.v.
Define Yi = Xi−µσ , and let MY (t) denote the common mfg of the
Y ′i s, which exists for |t| < σh, we have
M√n(X n−µ)/σ(t) = M∑ni=1 Yi/
√n(t) = M∑n
i=1 Yi(t√n
) =(MY (
t√n
))n.
We now expand MY ( t√n
) in a Taylor series around 0. We have
MY (t√n
) =∞∑k=0
M(k)Y (0)
(t/√n)k
k!,
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 52 / 101
Convergence Concepts Convergence in Distribution
The proof of the central limit theorem
Proof.
We will show that, for |t| < h, the mgf of√n(X n−µ)σ converges to
et2/2, the mgf of a N(0, 1) r.v.
Define Yi = Xi−µσ , and let MY (t) denote the common mfg of the
Y ′i s, which exists for |t| < σh, we have
M√n(X n−µ)/σ(t) = M∑ni=1 Yi/
√n(t) = M∑n
i=1 Yi(t√n
) =(MY (
t√n
))n.
We now expand MY ( t√n
) in a Taylor series around 0. We have
MY (t√n
) =∞∑k=0
M(k)Y (0)
(t/√n)k
k!,
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 52 / 101
Convergence Concepts Convergence in Distribution
Proof Cont’d
where M(k)Y (0) = dkMY (t)
dtk|t=0. Since the mgfs exist for |t| < σh, the
power series expansion is valid if |t| <√nσh.
Using the facts that M(0)Y = 1, M
(1)Y = 0, and M
(2)Y = 1, we have
MY (t√n
) = 1 +(t/√n)2
2!+ RY (t/
√n),
where RY is the remainder term in the Taylor expansion,
RY (t√n
) =∞∑k=3
M(k)Y (0)
(t/√n)k
k!.
An application of Taylor’s Theorem (5.5.21) shows that, for fixedt 6= 0, we have
limn→∞
RY (t/√n)
(t/√n)2
= 0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 53 / 101
Convergence Concepts Convergence in Distribution
Proof Cont’d
where M(k)Y (0) = dkMY (t)
dtk|t=0. Since the mgfs exist for |t| < σh, the
power series expansion is valid if |t| <√nσh.
Using the facts that M(0)Y = 1, M
(1)Y = 0, and M
(2)Y = 1, we have
MY (t√n
) = 1 +(t/√n)2
2!+ RY (t/
√n),
where RY is the remainder term in the Taylor expansion,
RY (t√n
) =∞∑k=3
M(k)Y (0)
(t/√n)k
k!.
An application of Taylor’s Theorem (5.5.21) shows that, for fixedt 6= 0, we have
limn→∞
RY (t/√n)
(t/√n)2
= 0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 53 / 101
Convergence Concepts Convergence in Distribution
Proof Cont’d
where M(k)Y (0) = dkMY (t)
dtk|t=0. Since the mgfs exist for |t| < σh, the
power series expansion is valid if |t| <√nσh.
Using the facts that M(0)Y = 1, M
(1)Y = 0, and M
(2)Y = 1, we have
MY (t√n
) = 1 +(t/√n)2
2!+ RY (t/
√n),
where RY is the remainder term in the Taylor expansion,
RY (t√n
) =∞∑k=3
M(k)Y (0)
(t/√n)k
k!.
An application of Taylor’s Theorem (5.5.21) shows that, for fixedt 6= 0, we have
limn→∞
RY (t/√n)
(t/√n)2
= 0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 53 / 101
Convergence Concepts Convergence in Distribution
Proof Cont’d
Since t is fixed, we also have
limn→∞
RY (t/√n)
(t/√n)2
= limn→∞
nRY (t/√n) = 0,
and is also true at t = 0 since RY (0/√n) = 0.
Thus, for any fixedt, we can write
limn→∞
(MY (
t√n
))n
= limn→∞
[1 +
(t/√n)2
2!+ RY (t/
√n)]n
= limn→∞
[1 +
1
n(t2
2+ nRY (t/
√n))]n
= et2/2.
Note that et2/2 is the mgf of the N(0, 1) distribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 54 / 101
Convergence Concepts Convergence in Distribution
Proof Cont’d
Since t is fixed, we also have
limn→∞
RY (t/√n)
(t/√n)2
= limn→∞
nRY (t/√n) = 0,
and is also true at t = 0 since RY (0/√n) = 0. Thus, for any fixed
t, we can write
limn→∞
(MY (
t√n
))n
= limn→∞
[1 +
(t/√n)2
2!+ RY (t/
√n)]n
= limn→∞
[1 +
1
n(t2
2+ nRY (t/
√n))]n
= et2/2.
Note that et2/2 is the mgf of the N(0, 1) distribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 54 / 101
Convergence Concepts Convergence in Distribution
Proof Cont’d
Since t is fixed, we also have
limn→∞
RY (t/√n)
(t/√n)2
= limn→∞
nRY (t/√n) = 0,
and is also true at t = 0 since RY (0/√n) = 0. Thus, for any fixed
t, we can write
limn→∞
(MY (
t√n
))n
= limn→∞
[1 +
(t/√n)2
2!+ RY (t/
√n)]n
= limn→∞
[1 +
1
n(t2
2+ nRY (t/
√n))]n
= et2/2.
Note that et2/2 is the mgf of the N(0, 1) distribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 54 / 101
Convergence Concepts Convergence in Distribution
Strong form of the central limit theorem
Let X1,X2, · · · be a sequence of i.i.d r.v.s with E (Xi ) = µ and 0 <Var(Xi ) = σ2 <∞. Define
X n =1
n
n∑i=1
Xi .
Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ,
limn→∞
Gn(x) =
∫ x
−∞
1√2π
e−y2/2dy .
i.e.,√n(X n−µ)σ has a limiting standard normal distribution.
In particular, all of the assumptions about mgfs are not needed, i.e., theuse of characteristic functions can replace them.This theorem is general enough for almost all statistical purposes. It holdswhen the parent distribution has finite variance.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 55 / 101
Convergence Concepts Convergence in Distribution
Strong form of the central limit theorem
Let X1,X2, · · · be a sequence of i.i.d r.v.s with E (Xi ) = µ and 0 <Var(Xi ) = σ2 <∞. Define
X n =1
n
n∑i=1
Xi .
Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ,
limn→∞
Gn(x) =
∫ x
−∞
1√2π
e−y2/2dy .
i.e.,√n(X n−µ)σ has a limiting standard normal distribution.
In particular, all of the assumptions about mgfs are not needed, i.e., theuse of characteristic functions can replace them.This theorem is general enough for almost all statistical purposes. It holdswhen the parent distribution has finite variance.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 55 / 101
Convergence Concepts Convergence in Distribution
Strong form of the central limit theorem
Let X1,X2, · · · be a sequence of i.i.d r.v.s with E (Xi ) = µ and 0 <Var(Xi ) = σ2 <∞. Define
X n =1
n
n∑i=1
Xi .
Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ,
limn→∞
Gn(x) =
∫ x
−∞
1√2π
e−y2/2dy .
i.e.,√n(X n−µ)σ has a limiting standard normal distribution.
In particular, all of the assumptions about mgfs are not needed, i.e., theuse of characteristic functions can replace them.This theorem is general enough for almost all statistical purposes. It holdswhen the parent distribution has finite variance.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 55 / 101
Convergence Concepts Convergence in Distribution
Strong form of the central limit theorem
Let X1,X2, · · · be a sequence of i.i.d r.v.s with E (Xi ) = µ and 0 <Var(Xi ) = σ2 <∞. Define
X n =1
n
n∑i=1
Xi .
Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ,
limn→∞
Gn(x) =
∫ x
−∞
1√2π
e−y2/2dy .
i.e.,√n(X n−µ)σ has a limiting standard normal distribution.
In particular, all of the assumptions about mgfs are not needed, i.e., theuse of characteristic functions can replace them.
This theorem is general enough for almost all statistical purposes. It holdswhen the parent distribution has finite variance.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 55 / 101
Convergence Concepts Convergence in Distribution
Strong form of the central limit theorem
Let X1,X2, · · · be a sequence of i.i.d r.v.s with E (Xi ) = µ and 0 <Var(Xi ) = σ2 <∞. Define
X n =1
n
n∑i=1
Xi .
Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ,
limn→∞
Gn(x) =
∫ x
−∞
1√2π
e−y2/2dy .
i.e.,√n(X n−µ)σ has a limiting standard normal distribution.
In particular, all of the assumptions about mgfs are not needed, i.e., theuse of characteristic functions can replace them.This theorem is general enough for almost all statistical purposes.
It holdswhen the parent distribution has finite variance.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 55 / 101
Convergence Concepts Convergence in Distribution
Strong form of the central limit theorem
Let X1,X2, · · · be a sequence of i.i.d r.v.s with E (Xi ) = µ and 0 <Var(Xi ) = σ2 <∞. Define
X n =1
n
n∑i=1
Xi .
Let Gn(x) denote the cdf of√n(X n−µ)σ . Then, for any x ,
limn→∞
Gn(x) =
∫ x
−∞
1√2π
e−y2/2dy .
i.e.,√n(X n−µ)σ has a limiting standard normal distribution.
In particular, all of the assumptions about mgfs are not needed, i.e., theuse of characteristic functions can replace them.This theorem is general enough for almost all statistical purposes. It holdswhen the parent distribution has finite variance.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 55 / 101
Convergence Concepts Convergence in Distribution
Normal approximation to the negative binomial
Let X1,X2, · · · be a random sample from a negative binomial(r , p) distri-bution. Recall that
E (X ) =r(1− p)
p, and Var(X ) =
r(1− p)
p2,
and the central limit theorem tells us that
√n(X − r(1− p)/p)√
r(1− p)/p2→ N(0, 1).
For r = 10, p = 12 , and n = 30, and calculations would be
P(X ≤ 11) = P( 30∑
i=1
Xi ≤ 330)
= · · · = 0.8916.
P(X ≤ 11) = P(√30(X − 10)√
20≤√
30(11− 10)√20
) ≈ 0.8888.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 56 / 101
Convergence Concepts Convergence in Distribution
Normal approximation to the negative binomial
Let X1,X2, · · · be a random sample from a negative binomial(r , p) distri-bution. Recall that
E (X ) =r(1− p)
p, and Var(X ) =
r(1− p)
p2,
and the central limit theorem tells us that
√n(X − r(1− p)/p)√
r(1− p)/p2→ N(0, 1).
For r = 10, p = 12 , and n = 30, and calculations would be
P(X ≤ 11) = P( 30∑
i=1
Xi ≤ 330)
= · · · = 0.8916.
P(X ≤ 11) = P(√30(X − 10)√
20≤√
30(11− 10)√20
) ≈ 0.8888.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 56 / 101
Convergence Concepts Convergence in Distribution
Slutsky’s Theorem
Let Xn → X in distribution and Yn → a, a constant, in probability,then
a. YnXn → aX in distribution.
b. Xn + Yn → X + a in distribution.
Let√n(X n−µ)σ → N(0, 1), but the value of σ is unknown.
If limn→∞ Var(S2n ) = 0, then S2
n → σ2 in probability, i.e., σSn→ 1 in
probability.Hence, Slutsky’s theorem tells us
√n(X n − µ)
Sn=
σ
Sn
√n(X n − µ)
σ→ N(0, 1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 57 / 101
Convergence Concepts Convergence in Distribution
Slutsky’s Theorem
Let Xn → X in distribution and Yn → a, a constant, in probability,then
a. YnXn → aX in distribution.
b. Xn + Yn → X + a in distribution.
Let√n(X n−µ)σ → N(0, 1), but the value of σ is unknown.
If limn→∞ Var(S2n ) = 0, then S2
n → σ2 in probability, i.e., σSn→ 1 in
probability.Hence, Slutsky’s theorem tells us
√n(X n − µ)
Sn=
σ
Sn
√n(X n − µ)
σ→ N(0, 1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 57 / 101
Convergence Concepts Convergence in Distribution
Slutsky’s Theorem
Let Xn → X in distribution and Yn → a, a constant, in probability,then
a. YnXn → aX in distribution.
b. Xn + Yn → X + a in distribution.
Let√n(X n−µ)σ → N(0, 1), but the value of σ is unknown.
If limn→∞ Var(S2n ) = 0, then S2
n → σ2 in probability, i.e., σSn→ 1 in
probability.Hence, Slutsky’s theorem tells us
√n(X n − µ)
Sn=
σ
Sn
√n(X n − µ)
σ→ N(0, 1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 57 / 101
Convergence Concepts Convergence in Distribution
Slutsky’s Theorem
Let Xn → X in distribution and Yn → a, a constant, in probability,then
a. YnXn → aX in distribution.
b. Xn + Yn → X + a in distribution.
Let√n(X n−µ)σ → N(0, 1), but the value of σ is unknown.
If limn→∞ Var(S2n ) = 0, then S2
n → σ2 in probability, i.e., σSn→ 1 in
probability.Hence, Slutsky’s theorem tells us
√n(X n − µ)
Sn=
σ
Sn
√n(X n − µ)
σ→ N(0, 1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 57 / 101
Convergence Concepts Convergence in Distribution
Slutsky’s Theorem
Let Xn → X in distribution and Yn → a, a constant, in probability,then
a. YnXn → aX in distribution.
b. Xn + Yn → X + a in distribution.
Let√n(X n−µ)σ → N(0, 1), but the value of σ is unknown.
If limn→∞ Var(S2n ) = 0, then S2
n → σ2 in probability, i.e., σSn→ 1 in
probability.
Hence, Slutsky’s theorem tells us
√n(X n − µ)
Sn=
σ
Sn
√n(X n − µ)
σ→ N(0, 1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 57 / 101
Convergence Concepts Convergence in Distribution
Slutsky’s Theorem
Let Xn → X in distribution and Yn → a, a constant, in probability,then
a. YnXn → aX in distribution.
b. Xn + Yn → X + a in distribution.
Let√n(X n−µ)σ → N(0, 1), but the value of σ is unknown.
If limn→∞ Var(S2n ) = 0, then S2
n → σ2 in probability, i.e., σSn→ 1 in
probability.Hence, Slutsky’s theorem tells us
√n(X n − µ)
Sn=
σ
Sn
√n(X n − µ)
σ→ N(0, 1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 57 / 101
Generating a Random Sample
Exponential lifetime
Example
Suppose that a particular electrical component is to be modeled with anexponential(λ) lifetime. The manufacturer is interested in determining theprobability that, out of c components, at least t of them will last h hours.
Taking this one step at a time, we have
p1 = P(component lasts at least h hours) = P(X ≥ h|λ),
and assuming that the components are independent, we can model theoutcomes of the c components as Bernoulli trials, so
p2 = P(at least t components last h hours)
=c∑
k=t
(c
k
)pk1 (1− p1)c−k .
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 58 / 101
Generating a Random Sample
Exponential lifetime
Example
Suppose that a particular electrical component is to be modeled with anexponential(λ) lifetime. The manufacturer is interested in determining theprobability that, out of c components, at least t of them will last h hours.Taking this one step at a time, we have
p1 = P(component lasts at least h hours) = P(X ≥ h|λ),
and assuming that the components are independent, we can model theoutcomes of the c components as Bernoulli trials, so
p2 = P(at least t components last h hours)
=c∑
k=t
(c
k
)pk1 (1− p1)c−k .
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 58 / 101
Generating a Random Sample
Example Cont’d
Moreover, the exponential model has the advantage that p1 can be ex-pressed in closed form, i.e.,
p1 =
∫ ∞h
1
λe−x/λdx = e−h/λ.
However, if each component were modeled with, say, a Gamma distribution,then p1 may not be expressible in closed form. This would make calculationof p2 even more involved.A simulation approach to the calculation of expressions is to generate r.v.swith the desired distribution and then use the Weak Law of Large Numbersto validate the simulation. That is, if Yi , i = 1, 2, · · · , n, are i.i.d., then aconsequence of that theorem is
1
n
n∑i=1
h(Yi )→ E (h(Y )) in probability.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 59 / 101
Generating a Random Sample
Example Cont’d
Moreover, the exponential model has the advantage that p1 can be ex-pressed in closed form, i.e.,
p1 =
∫ ∞h
1
λe−x/λdx = e−h/λ.
However, if each component were modeled with, say, a Gamma distribution,then p1 may not be expressible in closed form. This would make calculationof p2 even more involved.
A simulation approach to the calculation of expressions is to generate r.v.swith the desired distribution and then use the Weak Law of Large Numbersto validate the simulation. That is, if Yi , i = 1, 2, · · · , n, are i.i.d., then aconsequence of that theorem is
1
n
n∑i=1
h(Yi )→ E (h(Y )) in probability.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 59 / 101
Generating a Random Sample
Example Cont’d
Moreover, the exponential model has the advantage that p1 can be ex-pressed in closed form, i.e.,
p1 =
∫ ∞h
1
λe−x/λdx = e−h/λ.
However, if each component were modeled with, say, a Gamma distribution,then p1 may not be expressible in closed form. This would make calculationof p2 even more involved.A simulation approach to the calculation of expressions is to generate r.v.swith the desired distribution and then use the Weak Law of Large Numbersto validate the simulation. That is, if Yi , i = 1, 2, · · · , n, are i.i.d., then aconsequence of that theorem is
1
n
n∑i=1
h(Yi )→ E (h(Y )) in probability.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 59 / 101
Generating a Random Sample
Approximate calculation
The probability p2 can be calculated using the following steps. For j =1, · · · , n :
a. Generate X1, · · · ,Xc i.i.d. ∼ exponential(λ);
b. Set Yj = 1 if at least t Xi s are ≥ h; otherwise, set Yj = 0.
Then, because Yj ∼ Bernoulli(p2) and E (Yj) = p2, then
1
n
n∑i=1
Yi → p2 in probability.
There two major concerns should be emphasize
We must examine how to generate the r.v.s that we need;
We then use a version of the Law of Large Numbers to validate thesimulation approximation.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 60 / 101
Generating a Random Sample
Approximate calculation
The probability p2 can be calculated using the following steps. For j =1, · · · , n :
a. Generate X1, · · · ,Xc i.i.d. ∼ exponential(λ);
b. Set Yj = 1 if at least t Xi s are ≥ h; otherwise, set Yj = 0.
Then, because Yj ∼ Bernoulli(p2) and E (Yj) = p2, then
1
n
n∑i=1
Yi → p2 in probability.
There two major concerns should be emphasize
We must examine how to generate the r.v.s that we need;
We then use a version of the Law of Large Numbers to validate thesimulation approximation.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 60 / 101
Generating a Random Sample
Approximate calculation
The probability p2 can be calculated using the following steps. For j =1, · · · , n :
a. Generate X1, · · · ,Xc i.i.d. ∼ exponential(λ);
b. Set Yj = 1 if at least t Xi s are ≥ h; otherwise, set Yj = 0.
Then, because Yj ∼ Bernoulli(p2) and E (Yj) = p2, then
1
n
n∑i=1
Yi → p2 in probability.
There two major concerns should be emphasize
We must examine how to generate the r.v.s that we need;
We then use a version of the Law of Large Numbers to validate thesimulation approximation.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 60 / 101
Generating a Random Sample
Approximate calculation
The probability p2 can be calculated using the following steps. For j =1, · · · , n :
a. Generate X1, · · · ,Xc i.i.d. ∼ exponential(λ);
b. Set Yj = 1 if at least t Xi s are ≥ h; otherwise, set Yj = 0.
Then, because Yj ∼ Bernoulli(p2) and E (Yj) = p2, then
1
n
n∑i=1
Yi → p2 in probability.
There two major concerns should be emphasize
We must examine how to generate the r.v.s that we need;
We then use a version of the Law of Large Numbers to validate thesimulation approximation.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 60 / 101
Generating a Random Sample
Approximate calculation
The probability p2 can be calculated using the following steps. For j =1, · · · , n :
a. Generate X1, · · · ,Xc i.i.d. ∼ exponential(λ);
b. Set Yj = 1 if at least t Xi s are ≥ h; otherwise, set Yj = 0.
Then, because Yj ∼ Bernoulli(p2) and E (Yj) = p2, then
1
n
n∑i=1
Yi → p2 in probability.
There two major concerns should be emphasize
We must examine how to generate the r.v.s that we need;
We then use a version of the Law of Large Numbers to validate thesimulation approximation.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 60 / 101
Generating a Random Sample
Approximate calculation
The probability p2 can be calculated using the following steps. For j =1, · · · , n :
a. Generate X1, · · · ,Xc i.i.d. ∼ exponential(λ);
b. Set Yj = 1 if at least t Xi s are ≥ h; otherwise, set Yj = 0.
Then, because Yj ∼ Bernoulli(p2) and E (Yj) = p2, then
1
n
n∑i=1
Yi → p2 in probability.
There two major concerns should be emphasize
We must examine how to generate the r.v.s that we need;
We then use a version of the Law of Large Numbers to validate thesimulation approximation.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 60 / 101
Generating a Random Sample
Pseudo-random number generation
We start with the assumption that we can generate i.i.d. uniform r.v.sU1, · · · ,Um. There exist many algorithms for generating pseudo-random numbers that will pass almost all tests for uniformity.Since we are starting from the uniform r.v.s, our problem here isreally not the problem of generating the desired r.v.s, but rather oftransforming the uniform r.v.s to the desired distribution.
There are two general methods for doing this
Direct model
Indirect model
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 61 / 101
Generating a Random Sample
Pseudo-random number generation
We start with the assumption that we can generate i.i.d. uniform r.v.sU1, · · · ,Um. There exist many algorithms for generating pseudo-random numbers that will pass almost all tests for uniformity.Since we are starting from the uniform r.v.s, our problem here isreally not the problem of generating the desired r.v.s, but rather oftransforming the uniform r.v.s to the desired distribution.There are two general methods for doing this
Direct model
Indirect model
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 61 / 101
Generating a Random Sample Direct Model
Outline
1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F
2 Order Statistics3 Convergence Concepts
Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution
4 Generating a Random SampleDirect ModelIndirect Model
5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling
6 Take-aways
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 62 / 101
Generating a Random Sample Direct Model
Pseudo-random number generation
We start with the assumption that we can generate i.i.d. uniform r.v.sU1, · · · ,Um. There exist many algorithms for generating pseudo-random numbers that will pass almost all tests for uniformity.Since we are starting from the uniform r.v.s, our problem here isreally not the problem of generating the desired r.v.s, but rather oftransforming the uniform r.v.s to the desired distribution.
There are two general methods for doing this
Direct model
Indirect model
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 63 / 101
Generating a Random Sample Direct Model
Pseudo-random number generation
We start with the assumption that we can generate i.i.d. uniform r.v.sU1, · · · ,Um. There exist many algorithms for generating pseudo-random numbers that will pass almost all tests for uniformity.Since we are starting from the uniform r.v.s, our problem here isreally not the problem of generating the desired r.v.s, but rather oftransforming the uniform r.v.s to the desired distribution.There are two general methods for doing this
Direct model
Indirect model
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 63 / 101
Generating a Random Sample Direct Model
Probability integral transform
A direct model of generating a r.v. is one for which there exists a closedform function g(u) such that the transformed variable Y = g(U) has thedesired distribution when U ∼ uniform(0, 1).
As might be recalled, thiswas already accomplished for continuous r.v.s in the probability integraltransform, where any distribution was transformed to the uniform. Hencethe inverse transformation solves our problem.
Example
If Y is a continuous r.v. with cdf FY , then the r.v. F−1Y (U), where U ∼
uniform(0, 1), has distribution FY . If Y ∼ exponential(λ), then
F−1Y (U) = λ log (1− U)
is an exponential(λ) r.v.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 64 / 101
Generating a Random Sample Direct Model
Probability integral transform
A direct model of generating a r.v. is one for which there exists a closedform function g(u) such that the transformed variable Y = g(U) has thedesired distribution when U ∼ uniform(0, 1). As might be recalled, thiswas already accomplished for continuous r.v.s in the probability integraltransform, where any distribution was transformed to the uniform. Hencethe inverse transformation solves our problem.
Example
If Y is a continuous r.v. with cdf FY , then the r.v. F−1Y (U), where U ∼
uniform(0, 1), has distribution FY . If Y ∼ exponential(λ), then
F−1Y (U) = λ log (1− U)
is an exponential(λ) r.v.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 64 / 101
Generating a Random Sample Direct Model
Probability integral transform
A direct model of generating a r.v. is one for which there exists a closedform function g(u) such that the transformed variable Y = g(U) has thedesired distribution when U ∼ uniform(0, 1). As might be recalled, thiswas already accomplished for continuous r.v.s in the probability integraltransform, where any distribution was transformed to the uniform. Hencethe inverse transformation solves our problem.
Example
If Y is a continuous r.v. with cdf FY , then the r.v. F−1Y (U), where U ∼
uniform(0, 1), has distribution FY . If Y ∼ exponential(λ), then
F−1Y (U) = λ log (1− U)
is an exponential(λ) r.v.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 64 / 101
Generating a Random Sample Direct Model
Analysis of probability integral transform
Thus, if we generate U1, · · · ,Un as i.i.d. uniform r.v.s, Y =−λ log (1− Ui ), i = 1, · · · , n, are i.i.d. exponential(λ) r.v.s. As an ex-ample, for n = 10, 000, we generate ui and have
1
n
∑ui = 0.5019 and
1
n − 1
∑(ui − u)2 = 0.0842.
According to the WLLN, we know that
U → E (U) =1
2and S2 → Var(U) =
1
12= 0.0833.
The transformed variables Yi = −2 log (1− ui ) have an exponential(2)distribution, and we find that
1
n
∑yi = 2.0004 and
1
n − 1
∑(yi − y)2 = 4.0908.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 65 / 101
Generating a Random Sample Direct Model
Analysis of probability integral transform
Thus, if we generate U1, · · · ,Un as i.i.d. uniform r.v.s, Y =−λ log (1− Ui ), i = 1, · · · , n, are i.i.d. exponential(λ) r.v.s. As an ex-ample, for n = 10, 000, we generate ui and have
1
n
∑ui = 0.5019 and
1
n − 1
∑(ui − u)2 = 0.0842.
According to the WLLN, we know that
U → E (U) =1
2and S2 → Var(U) =
1
12= 0.0833.
The transformed variables Yi = −2 log (1− ui ) have an exponential(2)distribution, and we find that
1
n
∑yi = 2.0004 and
1
n − 1
∑(yi − y)2 = 4.0908.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 65 / 101
Generating a Random Sample Direct Model
Analysis of probability integral transform
Thus, if we generate U1, · · · ,Un as i.i.d. uniform r.v.s, Y =−λ log (1− Ui ), i = 1, · · · , n, are i.i.d. exponential(λ) r.v.s. As an ex-ample, for n = 10, 000, we generate ui and have
1
n
∑ui = 0.5019 and
1
n − 1
∑(ui − u)2 = 0.0842.
According to the WLLN, we know that
U → E (U) =1
2and S2 → Var(U) =
1
12= 0.0833.
The transformed variables Yi = −2 log (1− ui ) have an exponential(2)distribution, and we find that
1
n
∑yi = 2.0004 and
1
n − 1
∑(yi − y)2 = 4.0908.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 65 / 101
Generating a Random Sample Direct Model
Analysis of probability integral transform
The relationship between the exponential and other distributions allows thequick generation of many r.v.s. For example
Yj = −λ log (Uj) ∼ exponential(λ)
Y = −2v∑
j=1
log (Uj) ∼ χ22v
Y = −βa∑
j=1
log (Uj) ∼ gamma(a, β)
Y =
∑aj=1 log (Uj)∑a+bj=1 log (Uj)
∼ beta(a, b).
Unfortunately, there are limits to this transformation. For example, wecannot use it to generate χ2 r.v.s with odd degrees of freedom.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 66 / 101
Generating a Random Sample Direct Model
Box-Muller algorithm
F−1Y (u) = y ⇔ u =
∫ y
−∞fy (t)dt.
Application of this formula to the exponential distribution was particularhandy, as the integral equation has a simple solution.
However, in manycases no closed-form solution. In this case, the other types of transforma-tions or indirect methods may be feasible.
Box-muller algorithm
Generate U1 and U2, two independent uniform(0, 1) r.v.s, and set
R =√−2 logU1 and θ = 2πU2.
Then
X = R cos θ ∼ N(0, 1) and Y = R sin θ ∼ N(0, 1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 67 / 101
Generating a Random Sample Direct Model
Box-Muller algorithm
F−1Y (u) = y ⇔ u =
∫ y
−∞fy (t)dt.
Application of this formula to the exponential distribution was particularhandy, as the integral equation has a simple solution. However, in manycases no closed-form solution.
In this case, the other types of transforma-tions or indirect methods may be feasible.
Box-muller algorithm
Generate U1 and U2, two independent uniform(0, 1) r.v.s, and set
R =√−2 logU1 and θ = 2πU2.
Then
X = R cos θ ∼ N(0, 1) and Y = R sin θ ∼ N(0, 1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 67 / 101
Generating a Random Sample Direct Model
Box-Muller algorithm
F−1Y (u) = y ⇔ u =
∫ y
−∞fy (t)dt.
Application of this formula to the exponential distribution was particularhandy, as the integral equation has a simple solution. However, in manycases no closed-form solution. In this case, the other types of transforma-tions or indirect methods may be feasible.
Box-muller algorithm
Generate U1 and U2, two independent uniform(0, 1) r.v.s, and set
R =√−2 logU1 and θ = 2πU2.
Then
X = R cos θ ∼ N(0, 1) and Y = R sin θ ∼ N(0, 1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 67 / 101
Generating a Random Sample Direct Model
Box-Muller algorithm
F−1Y (u) = y ⇔ u =
∫ y
−∞fy (t)dt.
Application of this formula to the exponential distribution was particularhandy, as the integral equation has a simple solution. However, in manycases no closed-form solution. In this case, the other types of transforma-tions or indirect methods may be feasible.
Box-muller algorithm
Generate U1 and U2, two independent uniform(0, 1) r.v.s, and set
R =√−2 logU1 and θ = 2πU2.
Then
X = R cos θ ∼ N(0, 1) and Y = R sin θ ∼ N(0, 1).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 67 / 101
Generating a Random Sample Direct Model
Generating discrete r.v.s
Generating processing
If Y is a discrete r.v. taking on values y1 < y2 < · · · < yk , then wecan similarly write
P[FY (yi ) < U ≤ FY (yi+1)] = FY (yi+1)− FY (yi ) = P(Y = yi+1).
To generate Yi ∼ FY (y),
a. Generate U ∼ uniform(0, 1);
b. If FY (yi+1) ≥ U > FY (yi ), set Y = yi+1.
We define y0 = −∞ and FY (y0) = 0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 68 / 101
Generating a Random Sample Indirect Model
Outline
1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F
2 Order Statistics3 Convergence Concepts
Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution
4 Generating a Random SampleDirect ModelIndirect Model
5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling
6 Take-aways
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 69 / 101
Generating a Random Sample Indirect Model
Beta r.v. generation I
Suppose the goal is to generate Y ∼ beta(a, b). If both a andb are integers, then the direct transformation method can be used.
However, if a and b are not integers, for example a = 2.7 and b = 6.3.
If (U,V ) are independent uniform(0, 1)r.v.s, then the probability of the shadedarea is
P(V ≤ y ,U ≤ 1
cfY (V ))
=
∫ y
0
∫ fY (v)/c
0dudv
=1
c
∫ y
0fY (v)dv =
1
cP(Y ≤ y)
Here, if y = 1, we have
P(U ≤ 1
cfY (V )) =
1
c.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 70 / 101
Generating a Random Sample Indirect Model
Beta r.v. generation I
Suppose the goal is to generate Y ∼ beta(a, b). If both a andb are integers, then the direct transformation method can be used.However, if a and b are not integers, for example a = 2.7 and b = 6.3.
If (U,V ) are independent uniform(0, 1)r.v.s, then the probability of the shadedarea is
P(V ≤ y ,U ≤ 1
cfY (V ))
=
∫ y
0
∫ fY (v)/c
0dudv
=1
c
∫ y
0fY (v)dv =
1
cP(Y ≤ y)
Here, if y = 1, we have
P(U ≤ 1
cfY (V )) =
1
c.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 70 / 101
Generating a Random Sample Indirect Model
Beta r.v. generation I
Suppose the goal is to generate Y ∼ beta(a, b). If both a andb are integers, then the direct transformation method can be used.However, if a and b are not integers, for example a = 2.7 and b = 6.3.
If (U,V ) are independent uniform(0, 1)r.v.s, then the probability of the shadedarea is
P(V ≤ y ,U ≤ 1
cfY (V ))
=
∫ y
0
∫ fY (v)/c
0dudv
=1
c
∫ y
0fY (v)dv =
1
cP(Y ≤ y)
Here, if y = 1, we have
P(U ≤ 1
cfY (V )) =
1
c.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 70 / 101
Generating a Random Sample Indirect Model
Beta r.v. generation I
Suppose the goal is to generate Y ∼ beta(a, b). If both a andb are integers, then the direct transformation method can be used.However, if a and b are not integers, for example a = 2.7 and b = 6.3.
If (U,V ) are independent uniform(0, 1)r.v.s, then the probability of the shadedarea is
P(V ≤ y ,U ≤ 1
cfY (V ))
=
∫ y
0
∫ fY (v)/c
0dudv
=1
c
∫ y
0fY (v)dv =
1
cP(Y ≤ y)
Here, if y = 1, we have
P(U ≤ 1
cfY (V )) =
1
c.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 70 / 101
Generating a Random Sample Indirect Model
Rejection sampling
Hence, we have
P(Y ≤ y) =P(V ≤ y ,U ≤ 1
c fY (V ))
P(U ≤ 1c fY (V ))
= P(V ≤ y |U ≤ 1
cfY (V )).
The result suggests the following rejection sampling algorithm forgenerating Y ∼ beta(a, b) :
a. Generate (U,V ) independent uniform(0, 1);
b. If U < 1cFY (V ), set Y = V ; otherwise, return to step(a).
This algorithm generates a beta(a, b) r.v. as long as c ≥ maxy fY (y)and, in fact, can be generalized to any bounded density with boundedsupport.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 71 / 101
Generating a Random Sample Indirect Model
Rejection sampling
Hence, we have
P(Y ≤ y) =P(V ≤ y ,U ≤ 1
c fY (V ))
P(U ≤ 1c fY (V ))
= P(V ≤ y |U ≤ 1
cfY (V )).
The result suggests the following rejection sampling algorithm forgenerating Y ∼ beta(a, b) :
a. Generate (U,V ) independent uniform(0, 1);
b. If U < 1cFY (V ), set Y = V ; otherwise, return to step(a).
This algorithm generates a beta(a, b) r.v. as long as c ≥ maxy fY (y)and, in fact, can be generalized to any bounded density with boundedsupport.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 71 / 101
Generating a Random Sample Indirect Model
Rejection sampling
Hence, we have
P(Y ≤ y) =P(V ≤ y ,U ≤ 1
c fY (V ))
P(U ≤ 1c fY (V ))
= P(V ≤ y |U ≤ 1
cfY (V )).
The result suggests the following rejection sampling algorithm forgenerating Y ∼ beta(a, b) :
a. Generate (U,V ) independent uniform(0, 1);
b. If U < 1cFY (V ), set Y = V ; otherwise, return to step(a).
This algorithm generates a beta(a, b) r.v. as long as c ≥ maxy fY (y)and, in fact, can be generalized to any bounded density with boundedsupport.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 71 / 101
Generating a Random Sample Indirect Model
Rejection sampling
Hence, we have
P(Y ≤ y) =P(V ≤ y ,U ≤ 1
c fY (V ))
P(U ≤ 1c fY (V ))
= P(V ≤ y |U ≤ 1
cfY (V )).
The result suggests the following rejection sampling algorithm forgenerating Y ∼ beta(a, b) :
a. Generate (U,V ) independent uniform(0, 1);
b. If U < 1cFY (V ), set Y = V ; otherwise, return to step(a).
This algorithm generates a beta(a, b) r.v. as long as c ≥ maxy fY (y)and, in fact, can be generalized to any bounded density with boundedsupport.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 71 / 101
Generating a Random Sample Indirect Model
Rejection sampling
Hence, we have
P(Y ≤ y) =P(V ≤ y ,U ≤ 1
c fY (V ))
P(U ≤ 1c fY (V ))
= P(V ≤ y |U ≤ 1
cfY (V )).
The result suggests the following rejection sampling algorithm forgenerating Y ∼ beta(a, b) :
a. Generate (U,V ) independent uniform(0, 1);
b. If U < 1cFY (V ), set Y = V ; otherwise, return to step(a).
This algorithm generates a beta(a, b) r.v. as long as c ≥ maxy fY (y)and, in fact, can be generalized to any bounded density with boundedsupport.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 71 / 101
Generating a Random Sample Indirect Model
Analysis of rejection sampling
It should be clear that the optimal choice of c is c = maxy fY (y).To see why this is so, note that the algorithm is open-ended in thesense that we do not know how many (U,V ) pairs will be needed toget on r.v. Y .
Let N be the number of (U,V ) pairs required for one Y . Then,recalling that P(U ≤ 1
c fY (V )) = 1c , we see that N is a geometric( 1
c )r.v. Thus to generate one Y , we expect to need E (N) = c pairs(U,V ), and in this sense minimizing c will optimize the algorithm.Examining the figure, we see that the algorithm is wasteful in the areawhere U > 1
cFY (V ). This is because we are using a uniform r.v. Vto get a beta r.v. Y . To improve, we might start with somethingcloser to the beta distribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 72 / 101
Generating a Random Sample Indirect Model
Analysis of rejection sampling
It should be clear that the optimal choice of c is c = maxy fY (y).To see why this is so, note that the algorithm is open-ended in thesense that we do not know how many (U,V ) pairs will be needed toget on r.v. Y .Let N be the number of (U,V ) pairs required for one Y . Then,recalling that P(U ≤ 1
c fY (V )) = 1c , we see that N is a geometric( 1
c )r.v.
Thus to generate one Y , we expect to need E (N) = c pairs(U,V ), and in this sense minimizing c will optimize the algorithm.Examining the figure, we see that the algorithm is wasteful in the areawhere U > 1
cFY (V ). This is because we are using a uniform r.v. Vto get a beta r.v. Y . To improve, we might start with somethingcloser to the beta distribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 72 / 101
Generating a Random Sample Indirect Model
Analysis of rejection sampling
It should be clear that the optimal choice of c is c = maxy fY (y).To see why this is so, note that the algorithm is open-ended in thesense that we do not know how many (U,V ) pairs will be needed toget on r.v. Y .Let N be the number of (U,V ) pairs required for one Y . Then,recalling that P(U ≤ 1
c fY (V )) = 1c , we see that N is a geometric( 1
c )r.v. Thus to generate one Y , we expect to need E (N) = c pairs(U,V ), and in this sense minimizing c will optimize the algorithm.
Examining the figure, we see that the algorithm is wasteful in the areawhere U > 1
cFY (V ). This is because we are using a uniform r.v. Vto get a beta r.v. Y . To improve, we might start with somethingcloser to the beta distribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 72 / 101
Generating a Random Sample Indirect Model
Analysis of rejection sampling
It should be clear that the optimal choice of c is c = maxy fY (y).To see why this is so, note that the algorithm is open-ended in thesense that we do not know how many (U,V ) pairs will be needed toget on r.v. Y .Let N be the number of (U,V ) pairs required for one Y . Then,recalling that P(U ≤ 1
c fY (V )) = 1c , we see that N is a geometric( 1
c )r.v. Thus to generate one Y , we expect to need E (N) = c pairs(U,V ), and in this sense minimizing c will optimize the algorithm.Examining the figure, we see that the algorithm is wasteful in the areawhere U > 1
cFY (V ).
This is because we are using a uniform r.v. Vto get a beta r.v. Y . To improve, we might start with somethingcloser to the beta distribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 72 / 101
Generating a Random Sample Indirect Model
Analysis of rejection sampling
It should be clear that the optimal choice of c is c = maxy fY (y).To see why this is so, note that the algorithm is open-ended in thesense that we do not know how many (U,V ) pairs will be needed toget on r.v. Y .Let N be the number of (U,V ) pairs required for one Y . Then,recalling that P(U ≤ 1
c fY (V )) = 1c , we see that N is a geometric( 1
c )r.v. Thus to generate one Y , we expect to need E (N) = c pairs(U,V ), and in this sense minimizing c will optimize the algorithm.Examining the figure, we see that the algorithm is wasteful in the areawhere U > 1
cFY (V ). This is because we are using a uniform r.v. Vto get a beta r.v. Y .
To improve, we might start with somethingcloser to the beta distribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 72 / 101
Generating a Random Sample Indirect Model
Analysis of rejection sampling
It should be clear that the optimal choice of c is c = maxy fY (y).To see why this is so, note that the algorithm is open-ended in thesense that we do not know how many (U,V ) pairs will be needed toget on r.v. Y .Let N be the number of (U,V ) pairs required for one Y . Then,recalling that P(U ≤ 1
c fY (V )) = 1c , we see that N is a geometric( 1
c )r.v. Thus to generate one Y , we expect to need E (N) = c pairs(U,V ), and in this sense minimizing c will optimize the algorithm.Examining the figure, we see that the algorithm is wasteful in the areawhere U > 1
cFY (V ). This is because we are using a uniform r.v. Vto get a beta r.v. Y . To improve, we might start with somethingcloser to the beta distribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 72 / 101
Generating a Random Sample Indirect Model
Theorem of Accept/Reject sampling
Let Y ∼ fY (y) and V ∼ fV (v), where fY and fV have commonsupport with
M = supy
fY (y)/fV (y) <∞.
To generate a r.v. Y ∼ fY ;
a. Generate U ∼ uniform(0, 1), andV ∼ fV , independent;
b. If U < 1M fY (V )/fV (V ), set Y = V ;
otherwise, return to step(a).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 73 / 101
Generating a Random Sample Indirect Model
Theorem of Accept/Reject sampling
Let Y ∼ fY (y) and V ∼ fV (v), where fY and fV have commonsupport with
M = supy
fY (y)/fV (y) <∞.
To generate a r.v. Y ∼ fY ;
a. Generate U ∼ uniform(0, 1), andV ∼ fV , independent;
b. If U < 1M fY (V )/fV (V ), set Y = V ;
otherwise, return to step(a).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 73 / 101
Generating a Random Sample Indirect Model
Theorem of Accept/Reject sampling
Let Y ∼ fY (y) and V ∼ fV (v), where fY and fV have commonsupport with
M = supy
fY (y)/fV (y) <∞.
To generate a r.v. Y ∼ fY ;
a. Generate U ∼ uniform(0, 1), andV ∼ fV , independent;
b. If U < 1M fY (V )/fV (V ), set Y = V ;
otherwise, return to step(a).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 73 / 101
Generating a Random Sample Indirect Model
Proof
Proof.
The generated r.v. Y has cdf
P(Y ≤ y) = P(V ≤ y |stop) = P(V ≤ y |U <1
MfY (V )/fV (V ))
=P(V ≤ y ,U < 1
M fY (V )/fV (V ))
P(U < 1M fY (V )/fV (V ))
=
∫ y
−∞∫ 1
M fY (V )/fV (V )
0dufV (v)dv∫ +∞
−∞∫ 1
M fY (V )/fV (V )
0dufV (v)dv
=
∫ y
−∞fY (v)dv .
Hence, the number of trials needed to generate one Y is a geometric( 1M )
r.v., and M is the expected number of trials.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 74 / 101
Generating a Random Sample Indirect Model
Beta r.v. generation II
To generate Y ∼ beta(2.7, 6.3), we can consider the following algo-rithm
a. Generate U ∼ uniform(0, 1), and V ∼ beta(2, 6), independent;
b. If U < 1M fY (V )/fV (V ), set Y = V ; otherwise, return to
step(a).
This Accept/reject algorithm will generate the required Y as longas supy fY (y)/fV (y) ≤ M < ∞. For the given densities, we haveM = 1.67, so the requirement is satisfied.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 75 / 101
Generating a Random Sample Indirect Model
Beta r.v. generation II
To generate Y ∼ beta(2.7, 6.3), we can consider the following algo-rithm
a. Generate U ∼ uniform(0, 1), and V ∼ beta(2, 6), independent;
b. If U < 1M fY (V )/fV (V ), set Y = V ; otherwise, return to
step(a).
This Accept/reject algorithm will generate the required Y as longas supy fY (y)/fV (y) ≤ M < ∞. For the given densities, we haveM = 1.67, so the requirement is satisfied.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 75 / 101
Generating a Random Sample Indirect Model
Beta r.v. generation II
To generate Y ∼ beta(2.7, 6.3), we can consider the following algo-rithm
a. Generate U ∼ uniform(0, 1), and V ∼ beta(2, 6), independent;
b. If U < 1M fY (V )/fV (V ), set Y = V ; otherwise, return to
step(a).
This Accept/reject algorithm will generate the required Y as longas supy fY (y)/fV (y) ≤ M < ∞. For the given densities, we haveM = 1.67, so the requirement is satisfied.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 75 / 101
Markov Chain Monte Carlo
The Main idea of MCMC
We cannot sample directly from the target distribution p(x) in theintegral
∫f (x)p(x)dx .
Create a Markov chain whose transition matrix does notdepend on the normalization term.
Make sure the chain has a stationary distribution and it isequal to the target distribution.
After sufficient number of iterations, the chain will convergethe stationary distribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 76 / 101
Markov Chain Monte Carlo
The Main idea of MCMC
We cannot sample directly from the target distribution p(x) in theintegral
∫f (x)p(x)dx .
Create a Markov chain whose transition matrix does notdepend on the normalization term.
Make sure the chain has a stationary distribution and it isequal to the target distribution.
After sufficient number of iterations, the chain will convergethe stationary distribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 76 / 101
Markov Chain Monte Carlo
The Main idea of MCMC
We cannot sample directly from the target distribution p(x) in theintegral
∫f (x)p(x)dx .
Create a Markov chain whose transition matrix does notdepend on the normalization term.
Make sure the chain has a stationary distribution and it isequal to the target distribution.
After sufficient number of iterations, the chain will convergethe stationary distribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 76 / 101
Markov Chain Monte Carlo
Markov Chain Monte Carlo
Overview
Markov Chain Monte Carlo (MCMC) methods are a class of algorithms forsampling from a probability distribution based on constructing a Markovchain that has the desired distribution as its stationary distribution.
The algorithm is proposed in 1953, which is the top-10 mostimportant algorithms in the 20th century.
MCMC works by generating a sequence of sample values in such away that, as more and more sample values are produced, thedistribution of values more closely approximates the desireddistribution, π(i).
That is, a Markov Chain has stationary distribution π(i) associatedwith transition probability matrix P.
The Markov Chain converges to the stationary distribution π(i) forarbitrary initial status x0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 77 / 101
Markov Chain Monte Carlo
Markov Chain Monte Carlo
Overview
Markov Chain Monte Carlo (MCMC) methods are a class of algorithms forsampling from a probability distribution based on constructing a Markovchain that has the desired distribution as its stationary distribution.
The algorithm is proposed in 1953, which is the top-10 mostimportant algorithms in the 20th century.
MCMC works by generating a sequence of sample values in such away that, as more and more sample values are produced, thedistribution of values more closely approximates the desireddistribution, π(i).
That is, a Markov Chain has stationary distribution π(i) associatedwith transition probability matrix P.
The Markov Chain converges to the stationary distribution π(i) forarbitrary initial status x0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 77 / 101
Markov Chain Monte Carlo
Markov Chain Monte Carlo
Overview
Markov Chain Monte Carlo (MCMC) methods are a class of algorithms forsampling from a probability distribution based on constructing a Markovchain that has the desired distribution as its stationary distribution.
The algorithm is proposed in 1953, which is the top-10 mostimportant algorithms in the 20th century.
MCMC works by generating a sequence of sample values in such away that, as more and more sample values are produced, thedistribution of values more closely approximates the desireddistribution, π(i).
That is, a Markov Chain has stationary distribution π(i) associatedwith transition probability matrix P.
The Markov Chain converges to the stationary distribution π(i) forarbitrary initial status x0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 77 / 101
Markov Chain Monte Carlo
Markov Chain Monte Carlo
Overview
Markov Chain Monte Carlo (MCMC) methods are a class of algorithms forsampling from a probability distribution based on constructing a Markovchain that has the desired distribution as its stationary distribution.
The algorithm is proposed in 1953, which is the top-10 mostimportant algorithms in the 20th century.
MCMC works by generating a sequence of sample values in such away that, as more and more sample values are produced, thedistribution of values more closely approximates the desireddistribution, π(i).
That is, a Markov Chain has stationary distribution π(i) associatedwith transition probability matrix P.
The Markov Chain converges to the stationary distribution π(i) forarbitrary initial status x0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 77 / 101
Markov Chain Monte Carlo
Markov Chain Monte Carlo
Overview
Markov Chain Monte Carlo (MCMC) methods are a class of algorithms forsampling from a probability distribution based on constructing a Markovchain that has the desired distribution as its stationary distribution.
The algorithm is proposed in 1953, which is the top-10 mostimportant algorithms in the 20th century.
MCMC works by generating a sequence of sample values in such away that, as more and more sample values are produced, thedistribution of values more closely approximates the desireddistribution, π(i).
That is, a Markov Chain has stationary distribution π(i) associatedwith transition probability matrix P.
The Markov Chain converges to the stationary distribution π(i) forarbitrary initial status x0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 77 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Outline
1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F
2 Order Statistics3 Convergence Concepts
Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution
4 Generating a Random SampleDirect ModelIndirect Model
5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling
6 Take-aways
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 78 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Random process
A random process is a collection of r.v.s indexed by some set I , takingvalues in a set S .
I is the index set, usually time, e.g., Z+, R, and R+, etc.
S is the state space, e.g., Z+, Rn, and a, b, c, etc.
We classify random processes according to
Index set can be discrete or continuous;
State space can be finite, countable or uncountable(continuous).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 79 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Random process
A random process is a collection of r.v.s indexed by some set I , takingvalues in a set S .
I is the index set, usually time, e.g., Z+, R, and R+, etc.
S is the state space, e.g., Z+, Rn, and a, b, c, etc.
We classify random processes according to
Index set can be discrete or continuous;
State space can be finite, countable or uncountable(continuous).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 79 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Markov property
More formally, X (t) is Markovian if the following property holds:
P(X (tn) = jn|X (tn−1) = jn−1, · · · ,X (t1) = j1)
= P(X (tn) = jn|X (tn−1) = jn−1)
for all finite sequence of times t1 < · · · < tn ∈ I and of statesj1, · · · , jn ∈ S .
A random process is called a Markov Process if conditionalon the current state, its future is independent of its past.
Process X0,X1, · · · satisfies the Markov property if
P(Xn+1 = xn+1|X0 = x0, · · · ,Xn = xn) = P(Xn+1 = xn+1|Xn = xn)
for all n and all xi ∈ Ω.
The term Markov property refers to the memoryless propertyof a stochastic process.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 80 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Markov property
More formally, X (t) is Markovian if the following property holds:
P(X (tn) = jn|X (tn−1) = jn−1, · · · ,X (t1) = j1)
= P(X (tn) = jn|X (tn−1) = jn−1)
for all finite sequence of times t1 < · · · < tn ∈ I and of statesj1, · · · , jn ∈ S .
A random process is called a Markov Process if conditionalon the current state, its future is independent of its past.
Process X0,X1, · · · satisfies the Markov property if
P(Xn+1 = xn+1|X0 = x0, · · · ,Xn = xn) = P(Xn+1 = xn+1|Xn = xn)
for all n and all xi ∈ Ω.
The term Markov property refers to the memoryless propertyof a stochastic process.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 80 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Markov property
More formally, X (t) is Markovian if the following property holds:
P(X (tn) = jn|X (tn−1) = jn−1, · · · ,X (t1) = j1)
= P(X (tn) = jn|X (tn−1) = jn−1)
for all finite sequence of times t1 < · · · < tn ∈ I and of statesj1, · · · , jn ∈ S .
A random process is called a Markov Process if conditionalon the current state, its future is independent of its past.
Process X0,X1, · · · satisfies the Markov property if
P(Xn+1 = xn+1|X0 = x0, · · · ,Xn = xn) = P(Xn+1 = xn+1|Xn = xn)
for all n and all xi ∈ Ω.
The term Markov property refers to the memoryless propertyof a stochastic process.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 80 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Markov property
More formally, X (t) is Markovian if the following property holds:
P(X (tn) = jn|X (tn−1) = jn−1, · · · ,X (t1) = j1)
= P(X (tn) = jn|X (tn−1) = jn−1)
for all finite sequence of times t1 < · · · < tn ∈ I and of statesj1, · · · , jn ∈ S .
A random process is called a Markov Process if conditionalon the current state, its future is independent of its past.
Process X0,X1, · · · satisfies the Markov property if
P(Xn+1 = xn+1|X0 = x0, · · · ,Xn = xn) = P(Xn+1 = xn+1|Xn = xn)
for all n and all xi ∈ Ω.
The term Markov property refers to the memoryless propertyof a stochastic process.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 80 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Discrete time Markov Chain
A stochastic process X0,X1, · · · of discrete time and discrete spaceis a Markov chain if it satisfies the Markov property, i.e.,
P(Xn+1 = xn+1|X0 = x0, · · · ,Xn = xn) = P(Xn+1 = xn+1|Xn = xn)
for all n and all xi ∈ Ω.
A Markov chain Xt is said to be time homogeneous ifP(Xs+t = j |Xs = i) is independent of s. When this holds,putting s = 0 gives
P(Xs+t = j |Xs = i) = P(Xt = j |X0 = i).
If moreover P(Xn+1 = j |Xn = i) = Pij is independent of n,then X is said to be time homogeneous Markov chain.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 81 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Discrete time Markov Chain
A stochastic process X0,X1, · · · of discrete time and discrete spaceis a Markov chain if it satisfies the Markov property, i.e.,
P(Xn+1 = xn+1|X0 = x0, · · · ,Xn = xn) = P(Xn+1 = xn+1|Xn = xn)
for all n and all xi ∈ Ω.
A Markov chain Xt is said to be time homogeneous ifP(Xs+t = j |Xs = i) is independent of s. When this holds,putting s = 0 gives
P(Xs+t = j |Xs = i) = P(Xt = j |X0 = i).
If moreover P(Xn+1 = j |Xn = i) = Pij is independent of n,then X is said to be time homogeneous Markov chain.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 81 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Discrete time Markov Chain
A stochastic process X0,X1, · · · of discrete time and discrete spaceis a Markov chain if it satisfies the Markov property, i.e.,
P(Xn+1 = xn+1|X0 = x0, · · · ,Xn = xn) = P(Xn+1 = xn+1|Xn = xn)
for all n and all xi ∈ Ω.
A Markov chain Xt is said to be time homogeneous ifP(Xs+t = j |Xs = i) is independent of s. When this holds,putting s = 0 gives
P(Xs+t = j |Xs = i) = P(Xt = j |X0 = i).
If moreover P(Xn+1 = j |Xn = i) = Pij is independent of n,then X is said to be time homogeneous Markov chain.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 81 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Examples
Ω = A,B; π(A,B) = q, π(A,A) = 1− q, π(B,A) = r ,π(B,B) = q for some 0 < q, r < 1 and µ(A) = 1, µ(B) = 0,so that at time 0 we always start at A.
Ω = Z; π(a, a− 1) = 12 , π(a, a + 1) = 1
2 , π(a, b) = 0 ifb 6= a± 1 for every a ∈ Z, µ(0) = 1 and µ(a) = 0 if a 6= 0, soat time 0 we always start at 0.
A graph G = (V ,E ) consists of a vertex set V and an edge setE , where the elements of E are unordered pairs of vertices:E ⊂ (x , y) : x , y ∈ V , x 6= y. The degree deg(x) of a vertexx is the number of neighbours of x . In this case, Ω = V and isfinite, π(a, b) = 1
deg(a) if b is a neighbour of a and π(a, b) = 0
otherwise. µ(a) = 1|V | for every a ∈ V , so at time 0 we start
from the uniform distribution on V
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 82 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Examples
Ω = A,B; π(A,B) = q, π(A,A) = 1− q, π(B,A) = r ,π(B,B) = q for some 0 < q, r < 1 and µ(A) = 1, µ(B) = 0,so that at time 0 we always start at A.
Ω = Z; π(a, a− 1) = 12 , π(a, a + 1) = 1
2 , π(a, b) = 0 ifb 6= a± 1 for every a ∈ Z, µ(0) = 1 and µ(a) = 0 if a 6= 0, soat time 0 we always start at 0.
A graph G = (V ,E ) consists of a vertex set V and an edge setE , where the elements of E are unordered pairs of vertices:E ⊂ (x , y) : x , y ∈ V , x 6= y. The degree deg(x) of a vertexx is the number of neighbours of x . In this case, Ω = V and isfinite, π(a, b) = 1
deg(a) if b is a neighbour of a and π(a, b) = 0
otherwise. µ(a) = 1|V | for every a ∈ V , so at time 0 we start
from the uniform distribution on V
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 82 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Examples
Ω = A,B; π(A,B) = q, π(A,A) = 1− q, π(B,A) = r ,π(B,B) = q for some 0 < q, r < 1 and µ(A) = 1, µ(B) = 0,so that at time 0 we always start at A.
Ω = Z; π(a, a− 1) = 12 , π(a, a + 1) = 1
2 , π(a, b) = 0 ifb 6= a± 1 for every a ∈ Z, µ(0) = 1 and µ(a) = 0 if a 6= 0, soat time 0 we always start at 0.
A graph G = (V ,E ) consists of a vertex set V and an edge setE , where the elements of E are unordered pairs of vertices:E ⊂ (x , y) : x , y ∈ V , x 6= y. The degree deg(x) of a vertexx is the number of neighbours of x . In this case, Ω = V and isfinite, π(a, b) = 1
deg(a) if b is a neighbour of a and π(a, b) = 0
otherwise. µ(a) = 1|V | for every a ∈ V , so at time 0 we start
from the uniform distribution on V
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 82 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Transition matrix
Definition
Let a Markov chain have P(t+1)x,y = P[Xt+1 = y |Xt = x], and the finite state space be
Ω = [n]. This gives us a transition matrix P(t+1) at time t. The transition matrix is anN ×N matrix of nonnegative entries such that the sum over each row of P(t) is 1, since∀n and ∀xi ∈ Ω ∑
y
P(t+1)x,y =
∑y
P[Xt+1 = y |Xt = x] = 1
For example, P(t+1) =
12
12 0 0
13
23 0 0
0 0 34
14
0 0 14
34
P
(t+1)1,2 = P[Xt+1 = 2|Xt = 1] = 1
2 and
P(t+1)1,3 = P[Xt+1 = 3|Xt = 1] = 0∑4i=1 P
(t+1)1,i = 1
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 83 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Transition matrix
Definition
Let a Markov chain have P(t+1)x,y = P[Xt+1 = y |Xt = x], and the finite state space be
Ω = [n]. This gives us a transition matrix P(t+1) at time t. The transition matrix is anN ×N matrix of nonnegative entries such that the sum over each row of P(t) is 1, since∀n and ∀xi ∈ Ω ∑
y
P(t+1)x,y =
∑y
P[Xt+1 = y |Xt = x] = 1
For example, P(t+1) =
12
12 0 0
13
23 0 0
0 0 34
14
0 0 14
34
P(t+1)1,2 = P[Xt+1 = 2|Xt = 1] = 1
2 and
P(t+1)1,3 = P[Xt+1 = 3|Xt = 1] = 0∑4i=1 P
(t+1)1,i = 1
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 83 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Transition matrix
Definition
Let a Markov chain have P(t+1)x,y = P[Xt+1 = y |Xt = x], and the finite state space be
Ω = [n]. This gives us a transition matrix P(t+1) at time t. The transition matrix is anN ×N matrix of nonnegative entries such that the sum over each row of P(t) is 1, since∀n and ∀xi ∈ Ω ∑
y
P(t+1)x,y =
∑y
P[Xt+1 = y |Xt = x] = 1
For example, P(t+1) =
12
12 0 0
13
23 0 0
0 0 34
14
0 0 14
34
P
(t+1)1,2 = P[Xt+1 = 2|Xt = 1] = 1
2 and
P(t+1)1,3 = P[Xt+1 = 3|Xt = 1] = 0
∑4i=1 P
(t+1)1,i = 1
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 83 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Transition matrix
Definition
Let a Markov chain have P(t+1)x,y = P[Xt+1 = y |Xt = x], and the finite state space be
Ω = [n]. This gives us a transition matrix P(t+1) at time t. The transition matrix is anN ×N matrix of nonnegative entries such that the sum over each row of P(t) is 1, since∀n and ∀xi ∈ Ω ∑
y
P(t+1)x,y =
∑y
P[Xt+1 = y |Xt = x] = 1
For example, P(t+1) =
12
12 0 0
13
23 0 0
0 0 34
14
0 0 14
34
P
(t+1)1,2 = P[Xt+1 = 2|Xt = 1] = 1
2 and
P(t+1)1,3 = P[Xt+1 = 3|Xt = 1] = 0∑4i=1 P
(t+1)1,i = 1
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 83 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
State distribution
Definition
Let π(t) be the state distribution of the chain at time t, that π(t)x = P[Xt =
x ].
For a finite chain, π(t) is a vector of N nonnegative entries such that∑x π
(t)x = 1. Then, it holds that π(t+1) = π(t)P(t+1). We apply the
law of total probability
π(t+1)y =
∑x
P[Xt+1 = y |Xt = x ]P[Xt = x ] =∑x
π(t)x P(t+1)
x,y = (π(t)P(t+1))y
Let π(t) = (0.4, 0.6, 0, 0) be a state distribution, thenπ(t+1) = (0.4, 0.6, 0, 0)
Let π(t) = (0, 0, 0.5, 0.5) be a state distribution, thenπ(t+1) = (0, 0, 0.5, 0.5)
Let π(t) = (0.1, 0.9, 0, 0) be a state distribution, thenπ(t+1) = (0.35, 0.65, 0, 0)
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 84 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
State distribution
Definition
Let π(t) be the state distribution of the chain at time t, that π(t)x = P[Xt =
x ].For a finite chain, π(t) is a vector of N nonnegative entries such that∑
x π(t)x = 1. Then, it holds that π(t+1) = π(t)P(t+1). We apply the
law of total probability
π(t+1)y =
∑x
P[Xt+1 = y |Xt = x ]P[Xt = x ] =∑x
π(t)x P(t+1)
x,y = (π(t)P(t+1))y
Let π(t) = (0.4, 0.6, 0, 0) be a state distribution, thenπ(t+1) = (0.4, 0.6, 0, 0)
Let π(t) = (0, 0, 0.5, 0.5) be a state distribution, thenπ(t+1) = (0, 0, 0.5, 0.5)
Let π(t) = (0.1, 0.9, 0, 0) be a state distribution, thenπ(t+1) = (0.35, 0.65, 0, 0)
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 84 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
State distribution
Definition
Let π(t) be the state distribution of the chain at time t, that π(t)x = P[Xt =
x ].For a finite chain, π(t) is a vector of N nonnegative entries such that∑
x π(t)x = 1. Then, it holds that π(t+1) = π(t)P(t+1). We apply the
law of total probability
π(t+1)y =
∑x
P[Xt+1 = y |Xt = x ]P[Xt = x ] =∑x
π(t)x P(t+1)
x,y = (π(t)P(t+1))y
Let π(t) = (0.4, 0.6, 0, 0) be a state distribution, thenπ(t+1) = (0.4, 0.6, 0, 0)
Let π(t) = (0, 0, 0.5, 0.5) be a state distribution, thenπ(t+1) = (0, 0, 0.5, 0.5)
Let π(t) = (0.1, 0.9, 0, 0) be a state distribution, thenπ(t+1) = (0.35, 0.65, 0, 0)
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 84 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
State distribution
Definition
Let π(t) be the state distribution of the chain at time t, that π(t)x = P[Xt =
x ].For a finite chain, π(t) is a vector of N nonnegative entries such that∑
x π(t)x = 1. Then, it holds that π(t+1) = π(t)P(t+1). We apply the
law of total probability
π(t+1)y =
∑x
P[Xt+1 = y |Xt = x ]P[Xt = x ] =∑x
π(t)x P(t+1)
x,y = (π(t)P(t+1))y
Let π(t) = (0.4, 0.6, 0, 0) be a state distribution, thenπ(t+1) = (0.4, 0.6, 0, 0)
Let π(t) = (0, 0, 0.5, 0.5) be a state distribution, thenπ(t+1) = (0, 0, 0.5, 0.5)
Let π(t) = (0.1, 0.9, 0, 0) be a state distribution, thenπ(t+1) = (0.35, 0.65, 0, 0)
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 84 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
State distribution
Definition
Let π(t) be the state distribution of the chain at time t, that π(t)x = P[Xt =
x ].For a finite chain, π(t) is a vector of N nonnegative entries such that∑
x π(t)x = 1. Then, it holds that π(t+1) = π(t)P(t+1). We apply the
law of total probability
π(t+1)y =
∑x
P[Xt+1 = y |Xt = x ]P[Xt = x ] =∑x
π(t)x P(t+1)
x,y = (π(t)P(t+1))y
Let π(t) = (0.4, 0.6, 0, 0) be a state distribution, thenπ(t+1) = (0.4, 0.6, 0, 0)
Let π(t) = (0, 0, 0.5, 0.5) be a state distribution, thenπ(t+1) = (0, 0, 0.5, 0.5)
Let π(t) = (0.1, 0.9, 0, 0) be a state distribution, thenπ(t+1) = (0.35, 0.65, 0, 0)
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 84 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Stationary distributions
Definition
A stationary distribution of a finite Markov chain with transition matrix Pis a probability distribution π such that πP = π.
For some Markov chains, no matter what the initial distribution is,after running the chain for a while, the distribution of the chainapproaches the stationary distribution
E.g., P20 =
0.4 0.6 0 00.4 0.6 0 00 0 0.5 0.50 0 0.5 0.5
. The chain could converge to
any distribution which is a linear combination of (0.4, 0.6, 0, 0) and(0, 0, 0.5, 0.5). We observe that the original chain P can be brokeninto two disjoint Markov chains, which have their own stationarydistributions. We say that the chain is reducible
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 85 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Stationary distributions
Definition
A stationary distribution of a finite Markov chain with transition matrix Pis a probability distribution π such that πP = π.
For some Markov chains, no matter what the initial distribution is,after running the chain for a while, the distribution of the chainapproaches the stationary distribution
E.g., P20 =
0.4 0.6 0 00.4 0.6 0 00 0 0.5 0.50 0 0.5 0.5
. The chain could converge to
any distribution which is a linear combination of (0.4, 0.6, 0, 0) and(0, 0, 0.5, 0.5). We observe that the original chain P can be brokeninto two disjoint Markov chains, which have their own stationarydistributions. We say that the chain is reducible
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 85 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Stationary distributions
Definition
A stationary distribution of a finite Markov chain with transition matrix Pis a probability distribution π such that πP = π.
For some Markov chains, no matter what the initial distribution is,after running the chain for a while, the distribution of the chainapproaches the stationary distribution
E.g., P20 =
0.4 0.6 0 00.4 0.6 0 00 0 0.5 0.50 0 0.5 0.5
. The chain could converge to
any distribution which is a linear combination of (0.4, 0.6, 0, 0) and(0, 0, 0.5, 0.5). We observe that the original chain P can be brokeninto two disjoint Markov chains, which have their own stationarydistributions. We say that the chain is reducible
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 85 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Irreducibility
Definition
State y is accessible from state x if it is possible for the chain to visit statey if the chain starts in state x , in other words, Pn(x , y) > 0, ∀n. State xcommunicates with state y if y is accessible from x and x is accessiblefrom y . We say that the Markov chain is irreducible if all pairs of statescommunicates.
y is accessible from x means that y is connected from x in thetransition graph, i.e., there is a directed path from x to y
x communicates with y means that x and y are strongly connectedin the transition graph
A finite Markov chain is irreducible if and only if its transition graphis strongly connected
The Markov chain associated with transition matrix P is notirreducible
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 86 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Irreducibility
Definition
State y is accessible from state x if it is possible for the chain to visit statey if the chain starts in state x , in other words, Pn(x , y) > 0, ∀n. State xcommunicates with state y if y is accessible from x and x is accessiblefrom y . We say that the Markov chain is irreducible if all pairs of statescommunicates.
y is accessible from x means that y is connected from x in thetransition graph, i.e., there is a directed path from x to y
x communicates with y means that x and y are strongly connectedin the transition graph
A finite Markov chain is irreducible if and only if its transition graphis strongly connected
The Markov chain associated with transition matrix P is notirreducible
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 86 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Irreducibility
Definition
State y is accessible from state x if it is possible for the chain to visit statey if the chain starts in state x , in other words, Pn(x , y) > 0, ∀n. State xcommunicates with state y if y is accessible from x and x is accessiblefrom y . We say that the Markov chain is irreducible if all pairs of statescommunicates.
y is accessible from x means that y is connected from x in thetransition graph, i.e., there is a directed path from x to y
x communicates with y means that x and y are strongly connectedin the transition graph
A finite Markov chain is irreducible if and only if its transition graphis strongly connected
The Markov chain associated with transition matrix P is notirreducible
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 86 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Irreducibility
Definition
State y is accessible from state x if it is possible for the chain to visit statey if the chain starts in state x , in other words, Pn(x , y) > 0, ∀n. State xcommunicates with state y if y is accessible from x and x is accessiblefrom y . We say that the Markov chain is irreducible if all pairs of statescommunicates.
y is accessible from x means that y is connected from x in thetransition graph, i.e., there is a directed path from x to y
x communicates with y means that x and y are strongly connectedin the transition graph
A finite Markov chain is irreducible if and only if its transition graphis strongly connected
The Markov chain associated with transition matrix P is notirreducible
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 86 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Irreducibility
Definition
State y is accessible from state x if it is possible for the chain to visit statey if the chain starts in state x , in other words, Pn(x , y) > 0, ∀n. State xcommunicates with state y if y is accessible from x and x is accessiblefrom y . We say that the Markov chain is irreducible if all pairs of statescommunicates.
y is accessible from x means that y is connected from x in thetransition graph, i.e., there is a directed path from x to y
x communicates with y means that x and y are strongly connectedin the transition graph
A finite Markov chain is irreducible if and only if its transition graphis strongly connected
The Markov chain associated with transition matrix P is notirreducible
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 86 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Aperiodicity
Definition
The period of a state x is the greatest common divisor (gcd), such thatdx = gcdn|(Pn)x,x > 0. A state is aperiodic if its period is 1. A Markovchain is aperiodic if all its states are aperiodic.
For example, suppose that the period of state x is dx = 3. Then,starting from state x , chain x ,©,©,,©,©,, · · · , only thesquares are possible to be x .
In the transition graph of a finite Markov chain, (Pn)x,x > 0 isequivalent to that x is on a cycle of length n. Period of a state x isthe greatest common devisor of the lengths of cycles passing x .
Theorem
1. If the states x and y communicate, then dx = dy .2. We have (Pn)x,x = 0 if n mod(dx) 6= 0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 87 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Aperiodicity
Definition
The period of a state x is the greatest common divisor (gcd), such thatdx = gcdn|(Pn)x,x > 0. A state is aperiodic if its period is 1. A Markovchain is aperiodic if all its states are aperiodic.
For example, suppose that the period of state x is dx = 3. Then,starting from state x , chain x ,©,©,,©,©,, · · · , only thesquares are possible to be x .
In the transition graph of a finite Markov chain, (Pn)x,x > 0 isequivalent to that x is on a cycle of length n. Period of a state x isthe greatest common devisor of the lengths of cycles passing x .
Theorem
1. If the states x and y communicate, then dx = dy .2. We have (Pn)x,x = 0 if n mod(dx) 6= 0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 87 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Aperiodicity
Definition
The period of a state x is the greatest common divisor (gcd), such thatdx = gcdn|(Pn)x,x > 0. A state is aperiodic if its period is 1. A Markovchain is aperiodic if all its states are aperiodic.
For example, suppose that the period of state x is dx = 3. Then,starting from state x , chain x ,©,©,,©,©,, · · · , only thesquares are possible to be x .
In the transition graph of a finite Markov chain, (Pn)x,x > 0 isequivalent to that x is on a cycle of length n. Period of a state x isthe greatest common devisor of the lengths of cycles passing x .
Theorem
1. If the states x and y communicate, then dx = dy .2. We have (Pn)x,x = 0 if n mod(dx) 6= 0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 87 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Aperiodicity
Definition
The period of a state x is the greatest common divisor (gcd), such thatdx = gcdn|(Pn)x,x > 0. A state is aperiodic if its period is 1. A Markovchain is aperiodic if all its states are aperiodic.
For example, suppose that the period of state x is dx = 3. Then,starting from state x , chain x ,©,©,,©,©,, · · · , only thesquares are possible to be x .
In the transition graph of a finite Markov chain, (Pn)x,x > 0 isequivalent to that x is on a cycle of length n. Period of a state x isthe greatest common devisor of the lengths of cycles passing x .
Theorem
1. If the states x and y communicate, then dx = dy .2. We have (Pn)x,x = 0 if n mod(dx) 6= 0.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 87 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Existence of the stationary distribution
Theorem
Let X0,X1, · · · , be an irreducible and aperiodic Markov chain withtransition matrix P. Then, limn→∞ Pn
ij exists and independent of i ,denoted as limn→∞ Pn
ij = π(j). We also have
limn→∞
Pn =
π(1) π(2) · · · π(j) · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·
(1)
π(j) =∑∞
i=1 π(i)Pij , and∑∞
i=1 π(i) = 1.
π is the unique and non-negative solution for equation πP = π.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 88 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Existence of the stationary distribution
Theorem
Let X0,X1, · · · , be an irreducible and aperiodic Markov chain withtransition matrix P. Then, limn→∞ Pn
ij exists and independent of i ,denoted as limn→∞ Pn
ij = π(j). We also have
limn→∞
Pn =
π(1) π(2) · · · π(j) · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·
(1)
π(j) =∑∞
i=1 π(i)Pij , and∑∞
i=1 π(i) = 1.
π is the unique and non-negative solution for equation πP = π.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 88 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Existence of the stationary distribution
Theorem
Let X0,X1, · · · , be an irreducible and aperiodic Markov chain withtransition matrix P. Then, limn→∞ Pn
ij exists and independent of i ,denoted as limn→∞ Pn
ij = π(j). We also have
limn→∞
Pn =
π(1) π(2) · · · π(j) · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·
(1)
π(j) =∑∞
i=1 π(i)Pij , and∑∞
i=1 π(i) = 1.
π is the unique and non-negative solution for equation πP = π.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 88 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Existence of the stationary distribution
Theorem
Let X0,X1, · · · , be an irreducible and aperiodic Markov chain withtransition matrix P. Then, limn→∞ Pn
ij exists and independent of i ,denoted as limn→∞ Pn
ij = π(j). We also have
limn→∞
Pn =
π(1) π(2) · · · π(j) · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·π(1) π(2) · · · π(j) · · ·· · · · · · · · · · · · · · ·
(1)
π(j) =∑∞
i=1 π(i)Pij , and∑∞
i=1 π(i) = 1.
π is the unique and non-negative solution for equation πP = π.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 88 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Detailed Balance Condition
Theorem
Let X0,X1, · · · , be an aperiodic Markov chain with transition matrix P anddistribution π. If the following condition holds,
π(i)Pij = π(j)Pji , for all i , j (2)
then π(x) is the stationary distribution of the Markov chain. The aboveequation is called the detailed balance condition.
Proof:∑∞i=1 π(i)Pij =
∑∞i=1 π(j)Pji = π(j)
∑∞i=1 Pji = π(j)⇒ πP = π.
In general, π(i)Pij 6= π(j)Pji . That is, π(i) may not be the stationarydistribution.
The natural question is how to revise the Markov Chain such that πbecomes a stationary distribution. For example, we introduce afunction α(i , j) s.t. π(i)Pijα(i , j) = π(j)Pjiα(j , i)
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 89 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Detailed Balance Condition
Theorem
Let X0,X1, · · · , be an aperiodic Markov chain with transition matrix P anddistribution π. If the following condition holds,
π(i)Pij = π(j)Pji , for all i , j (2)
then π(x) is the stationary distribution of the Markov chain. The aboveequation is called the detailed balance condition.
Proof:∑∞i=1 π(i)Pij =
∑∞i=1 π(j)Pji = π(j)
∑∞i=1 Pji = π(j)⇒ πP = π.
In general, π(i)Pij 6= π(j)Pji . That is, π(i) may not be the stationarydistribution.
The natural question is how to revise the Markov Chain such that πbecomes a stationary distribution. For example, we introduce afunction α(i , j) s.t. π(i)Pijα(i , j) = π(j)Pjiα(j , i)
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 89 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Detailed Balance Condition
Theorem
Let X0,X1, · · · , be an aperiodic Markov chain with transition matrix P anddistribution π. If the following condition holds,
π(i)Pij = π(j)Pji , for all i , j (2)
then π(x) is the stationary distribution of the Markov chain. The aboveequation is called the detailed balance condition.
Proof:∑∞i=1 π(i)Pij =
∑∞i=1 π(j)Pji = π(j)
∑∞i=1 Pji = π(j)⇒ πP = π.
In general, π(i)Pij 6= π(j)Pji . That is, π(i) may not be the stationarydistribution.
The natural question is how to revise the Markov Chain such that πbecomes a stationary distribution. For example, we introduce afunction α(i , j) s.t. π(i)Pijα(i , j) = π(j)Pjiα(j , i)
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 89 / 101
Markov Chain Monte Carlo Markov Chain and Random Walk
Detailed Balance Condition
Theorem
Let X0,X1, · · · , be an aperiodic Markov chain with transition matrix P anddistribution π. If the following condition holds,
π(i)Pij = π(j)Pji , for all i , j (2)
then π(x) is the stationary distribution of the Markov chain. The aboveequation is called the detailed balance condition.
Proof:∑∞i=1 π(i)Pij =
∑∞i=1 π(j)Pji = π(j)
∑∞i=1 Pji = π(j)⇒ πP = π.
In general, π(i)Pij 6= π(j)Pji . That is, π(i) may not be the stationarydistribution.
The natural question is how to revise the Markov Chain such that πbecomes a stationary distribution. For example, we introduce afunction α(i , j) s.t. π(i)Pijα(i , j) = π(j)Pjiα(j , i)
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 89 / 101
Markov Chain Monte Carlo MCMC Sampling Algorithm
Outline
1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F
2 Order Statistics3 Convergence Concepts
Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution
4 Generating a Random SampleDirect ModelIndirect Model
5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling
6 Take-aways
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 90 / 101
Markov Chain Monte Carlo MCMC Sampling Algorithm
Revising the Markov Chain
Choosing a reasonable parameter
How to choose α(i , j) such that π(i)Pijα(i , j) = π(j)Pjiα(j , i).
In terms of symmetry, we simply chooseα(i , j) = π(j)Pji , α(j , i) = π(i)Pij .
Therefore, we have π(i)Pijα(i , j)︸ ︷︷ ︸Qij
= π(j)Pjiα(j , i)︸ ︷︷ ︸Qji
.
The transition matrix:Qij = Pijα(i , j), if j 6= i ;Qii = Pii +
∑k 6=i Pi,k(1− α(i , k)), Otherwise.
Accept with probability α(i , j);
Otherwise stay in the currentlocation.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 91 / 101
Markov Chain Monte Carlo MCMC Sampling Algorithm
Revising the Markov Chain
Choosing a reasonable parameter
How to choose α(i , j) such that π(i)Pijα(i , j) = π(j)Pjiα(j , i).
In terms of symmetry, we simply chooseα(i , j) = π(j)Pji , α(j , i) = π(i)Pij .
Therefore, we have π(i)Pijα(i , j)︸ ︷︷ ︸Qij
= π(j)Pjiα(j , i)︸ ︷︷ ︸Qji
.
The transition matrix:Qij = Pijα(i , j), if j 6= i ;Qii = Pii +
∑k 6=i Pi,k(1− α(i , k)), Otherwise.
Accept with probability α(i , j);
Otherwise stay in the currentlocation.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 91 / 101
Markov Chain Monte Carlo MCMC Sampling Algorithm
Revising the Markov Chain
Choosing a reasonable parameter
How to choose α(i , j) such that π(i)Pijα(i , j) = π(j)Pjiα(j , i).
In terms of symmetry, we simply chooseα(i , j) = π(j)Pji , α(j , i) = π(i)Pij .
Therefore, we have π(i)Pijα(i , j)︸ ︷︷ ︸Qij
= π(j)Pjiα(j , i)︸ ︷︷ ︸Qji
.
The transition matrix:Qij = Pijα(i , j), if j 6= i ;Qii = Pii +
∑k 6=i Pi,k(1− α(i , k)), Otherwise.
Accept with probability α(i , j);
Otherwise stay in the currentlocation.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 91 / 101
Markov Chain Monte Carlo MCMC Sampling Algorithm
Revising the Markov Chain
Choosing a reasonable parameter
How to choose α(i , j) such that π(i)Pijα(i , j) = π(j)Pjiα(j , i).
In terms of symmetry, we simply chooseα(i , j) = π(j)Pji , α(j , i) = π(i)Pij .
Therefore, we have π(i)Pijα(i , j)︸ ︷︷ ︸Qij
= π(j)Pjiα(j , i)︸ ︷︷ ︸Qji
.
The transition matrix:Qij = Pijα(i , j), if j 6= i ;Qii = Pii +
∑k 6=i Pi,k(1− α(i , k)), Otherwise.
Accept with probability α(i , j);
Otherwise stay in the currentlocation.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 91 / 101
Markov Chain Monte Carlo MCMC Sampling Algorithm
Revising the Markov Chain
Choosing a reasonable parameter
How to choose α(i , j) such that π(i)Pijα(i , j) = π(j)Pjiα(j , i).
In terms of symmetry, we simply chooseα(i , j) = π(j)Pji , α(j , i) = π(i)Pij .
Therefore, we have π(i)Pijα(i , j)︸ ︷︷ ︸Qij
= π(j)Pjiα(j , i)︸ ︷︷ ︸Qji
.
The transition matrix:Qij = Pijα(i , j), if j 6= i ;Qii = Pii +
∑k 6=i Pi,k(1− α(i , k)), Otherwise.
Accept with probability α(i , j);
Otherwise stay in the currentlocation.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 91 / 101
Markov Chain Monte Carlo MCMC Sampling Algorithm
MCMC Sampling Algorithm
Let P(x2|x1) be a proposed distribution.0: initialize x (0);1: for i = 0 to N − 1 do2: sample u ∼ U[0, 1];3: sample x ∼ P(x |x (i));4: if u < α(x , x (i)) = π(x)P(x (i)|x),5: then x (i+1) = x ;6: else reject x , and x (i+1) = x (i);7: endif8: endfor9: output Last N samples;
Observation
Let α(i , j) = 0.1, andα(j , i) = 0.2 satisfy thedetailed balance condi-tion, thus we have
π(i)Pij0.1 = π(j)Pji0.2.
The small value ofα(i , j) results in a highrejection ratio. Wetherefore modify the e-quation as follows:
π(i)Pij0.5 = π(j)Pji1.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 92 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Outline
1 Sampling from the Normal DistributionProperties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F
2 Order Statistics3 Convergence Concepts
Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution
4 Generating a Random SampleDirect ModelIndirect Model
5 Markov Chain Monte CarloMarkov Chain and Random WalkMCMC Sampling AlgorithmMetropolis-Hastings Algorithm and Gibbs Sampling
6 Take-aways
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 93 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Metropolis-Hastings algorithm
Let P(x2|x1) be a proposeddistribution.0: initialize x (0);1: for i = 0 to max do2: sample u ∼ U[0, 1];3: sample x ∼ P(x |x (i));
4: if u < min1, π(x)P(x(i)|x)
π(x(i))P(x |x(i)),
5: then x (i+1) = x ;6: else reject x , andx (i+1) = x (i);7: endif8: endfor9: output Last N samples;
Observation
If we let
α(i , j) = min
1,π(x)P(x (i)|x)
π(x (i))P(x |x (i))
,
we can get a high accept ratio, andfurther improve the algorithm effi-ciency.However, for high-dimensionalP, Metropolis-Hastings algorithmmay be inefficient because ofα < 1. Is there a way to find atransition matrix with acceptanceratio α = 1?
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 94 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Properties of MCMC
Trade-off between Mixing rate and Acceptance ratio, where
Acceptance ratio = E[α(xi , x)]
Mixing ratio = rate that the chain moves around the dist.
We can have multiple transition matrices Pi (i.e., proposaldistribution), and apply them in turn.
Observation
For example:
Sample:x (t+1)|x (t) ∼ N (0.5x (t), 1.0);
Convergence:x (t)|x (0) ∼ N (0, 1.33), t → +∞.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 95 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Intuition
Example: two-dimensional case
Let P(x , y) be a two-dimensional probability distribution, and twopoints A(x1, y1) and B(x1, y2). We have
P(x1, y1)P(y2|x1) = P(x1)P(y1|x1)P(y2|x1) (3)
P(x1, y2)P(y1|x1) = P(x1)P(y2|x1)P(y1|x1) (4)
That is
P(x1, y1)P(y2|x1) = P(x1, y2)P(y1|x1) (5)
i.e.,
P(A)P(y2|x1) = P(B)P(y1|x1) (6)
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 96 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Intuition Cont’d
If p(y |xi ) is considered as the transition probability of two pointswhose x-axis coordinates are xi . Therefore, transition betweenthese two points satisfies the detailed balance condition, i.e.,P(A)P(y1|x1) = P(B)P(y2|x1) holds.
Transition matrix
The transition probabilities between two points A and B are given by T (A→ B).
T (A→ B) =
p(yB |x1), if xA = xB = x1;p(xB |y1), if yA = yB = y1;0, otherwise.
It is easy to confirm that the detailed balance condition holds, i.e., p(A)T (A → B) =p(B)T (B → A).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 97 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Intuition Cont’d
If p(y |xi ) is considered as the transition probability of two pointswhose x-axis coordinates are xi . Therefore, transition betweenthese two points satisfies the detailed balance condition, i.e.,P(A)P(y1|x1) = P(B)P(y2|x1) holds.
Transition matrix
The transition probabilities between two points A and B are given by T (A→ B).
T (A→ B) =
p(yB |x1), if xA = xB = x1;p(xB |y1), if yA = yB = y1;0, otherwise.
It is easy to confirm that the detailed balance condition holds, i.e., p(A)T (A → B) =p(B)T (B → A).
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 97 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Multivariate case
For multivariate case
Let P(xi |x−i ) = P(xi |x1, · · · , xi−1, xi+1, · · · , xn). The transition probabilities are given
by T (x→ x′) = P(x
′i |x−i ). Then, we have:
T (x→ x′)p(x) = P(x
′i |x−i )P(xi |x−i )P(x−i )
T (x′→ x)p(x
′) = P(xi |x
′−i )P(x
′i |x′−i )P(x
′−i )
Note that x′−i = x−i . That is,
T (x→ x′)p(x) = T (x
′→ x)p(x
′). (7)
Therefore, the detailed balance condition also holds.
Gibbs sampling is feasible if it is easy to sample from the conditional probabilitydistribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 98 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Multivariate case
For multivariate case
Let P(xi |x−i ) = P(xi |x1, · · · , xi−1, xi+1, · · · , xn). The transition probabilities are given
by T (x→ x′) = P(x
′i |x−i ). Then, we have:
T (x→ x′)p(x) = P(x
′i |x−i )P(xi |x−i )P(x−i )
T (x′→ x)p(x
′) = P(xi |x
′−i )P(x
′i |x′−i )P(x
′−i )
Note that x′−i = x−i . That is,
T (x→ x′)p(x) = T (x
′→ x)p(x
′). (7)
Therefore, the detailed balance condition also holds.
Gibbs sampling is feasible if it is easy to sample from the conditional probabilitydistribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 98 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Multivariate case
For multivariate case
Let P(xi |x−i ) = P(xi |x1, · · · , xi−1, xi+1, · · · , xn). The transition probabilities are given
by T (x→ x′) = P(x
′i |x−i ). Then, we have:
T (x→ x′)p(x) = P(x
′i |x−i )P(xi |x−i )P(x−i )
T (x′→ x)p(x
′) = P(xi |x
′−i )P(x
′i |x′−i )P(x
′−i )
Note that x′−i = x−i . That is,
T (x→ x′)p(x) = T (x
′→ x)p(x
′). (7)
Therefore, the detailed balance condition also holds.
Gibbs sampling is feasible if it is easy to sample from the conditional probabilitydistribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 98 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Multivariate case
For multivariate case
Let P(xi |x−i ) = P(xi |x1, · · · , xi−1, xi+1, · · · , xn). The transition probabilities are given
by T (x→ x′) = P(x
′i |x−i ). Then, we have:
T (x→ x′)p(x) = P(x
′i |x−i )P(xi |x−i )P(x−i )
T (x′→ x)p(x
′) = P(xi |x
′−i )P(x
′i |x′−i )P(x
′−i )
Note that x′−i = x−i . That is,
T (x→ x′)p(x) = T (x
′→ x)p(x
′). (7)
Therefore, the detailed balance condition also holds.
Gibbs sampling is feasible if it is easy to sample from the conditional probabilitydistribution.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 98 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Gibbs sampling algorithm (proposed distribution P(xi |x−i))
0: initialize x1, · · · , xn;1: for τ = 0 to max do2: sample xτ+1
1 ∼ P(x1|xτ2 , xτ3 , · · · , xτn );3: · · · ;4: sample xτ+1
j ∼ P(xj |xτ+11 , · · · , xτ+1
j−1 , xτj+1, · · · , x
τn );
5: · · · ;6: sample xτ+1
n ∼ P(xn|xτ+11 , xτ+1
2 , · · · , xτ+1n−1 );
7: output Last N samples;
Gibbs sampling is a type of random walk through parameter space, and can be consideredas a Metroplis-Hastings algorithm with a special proposal distribution.At each iteration, we are drawing from conditional posterior probabilities. This mean-s that the proposal move is always accepted. Hence, if we can draw samples fromthe conditional distributions, Gibbs sampling can be much more efficient than regularMetropolis-Hastings.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 99 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Gibbs sampling algorithm (proposed distribution P(xi |x−i))
0: initialize x1, · · · , xn;1: for τ = 0 to max do2: sample xτ+1
1 ∼ P(x1|xτ2 , xτ3 , · · · , xτn );3: · · · ;4: sample xτ+1
j ∼ P(xj |xτ+11 , · · · , xτ+1
j−1 , xτj+1, · · · , x
τn );
5: · · · ;6: sample xτ+1
n ∼ P(xn|xτ+11 , xτ+1
2 , · · · , xτ+1n−1 );
7: output Last N samples;
Gibbs sampling is a type of random walk through parameter space, and can be consideredas a Metroplis-Hastings algorithm with a special proposal distribution.
At each iteration, we are drawing from conditional posterior probabilities. This mean-s that the proposal move is always accepted. Hence, if we can draw samples fromthe conditional distributions, Gibbs sampling can be much more efficient than regularMetropolis-Hastings.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 99 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Gibbs sampling algorithm (proposed distribution P(xi |x−i))
0: initialize x1, · · · , xn;1: for τ = 0 to max do2: sample xτ+1
1 ∼ P(x1|xτ2 , xτ3 , · · · , xτn );3: · · · ;4: sample xτ+1
j ∼ P(xj |xτ+11 , · · · , xτ+1
j−1 , xτj+1, · · · , x
τn );
5: · · · ;6: sample xτ+1
n ∼ P(xn|xτ+11 , xτ+1
2 , · · · , xτ+1n−1 );
7: output Last N samples;
Gibbs sampling is a type of random walk through parameter space, and can be consideredas a Metroplis-Hastings algorithm with a special proposal distribution.At each iteration, we are drawing from conditional posterior probabilities. This mean-s that the proposal move is always accepted. Hence, if we can draw samples fromthe conditional distributions, Gibbs sampling can be much more efficient than regularMetropolis-Hastings.
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 99 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Properties of Gibbs Sampling
Properties
No need to tune the proposal distribution;
Good trade-off between acceptance and mixing: Acceptance ratio is always 1.
Need to be able to derive conditional probability distributions.
Acceleration of Gibbs sampling: (givenp(a, b, c) draw samples from a and c)
Blocked Gibbs:(1) Draw (a,b) given c;(2) Draw c given (a,b);
Collapsed Gibbs:(1) Draw a given c;(2) Draw c given a;
Marginalize whenever you can.
R code for MCMC and Gibbs sampling, please refer tohttps://blog.csdn.net/abcjennifer/article/details/25908495
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 100 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Properties of Gibbs Sampling
Properties
No need to tune the proposal distribution;
Good trade-off between acceptance and mixing: Acceptance ratio is always 1.
Need to be able to derive conditional probability distributions.
Acceleration of Gibbs sampling: (givenp(a, b, c) draw samples from a and c)
Blocked Gibbs:(1) Draw (a,b) given c;(2) Draw c given (a,b);
Collapsed Gibbs:(1) Draw a given c;(2) Draw c given a;
Marginalize whenever you can.
R code for MCMC and Gibbs sampling, please refer tohttps://blog.csdn.net/abcjennifer/article/details/25908495
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 100 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Properties of Gibbs Sampling
Properties
No need to tune the proposal distribution;
Good trade-off between acceptance and mixing: Acceptance ratio is always 1.
Need to be able to derive conditional probability distributions.
Acceleration of Gibbs sampling: (givenp(a, b, c) draw samples from a and c)
Blocked Gibbs:(1) Draw (a,b) given c;(2) Draw c given (a,b);
Collapsed Gibbs:(1) Draw a given c;(2) Draw c given a;
Marginalize whenever you can.
R code for MCMC and Gibbs sampling, please refer tohttps://blog.csdn.net/abcjennifer/article/details/25908495
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 100 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Properties of Gibbs Sampling
Properties
No need to tune the proposal distribution;
Good trade-off between acceptance and mixing: Acceptance ratio is always 1.
Need to be able to derive conditional probability distributions.
Acceleration of Gibbs sampling: (givenp(a, b, c) draw samples from a and c)
Blocked Gibbs:(1) Draw (a,b) given c;(2) Draw c given (a,b);
Collapsed Gibbs:(1) Draw a given c;(2) Draw c given a;
Marginalize whenever you can.
R code for MCMC and Gibbs sampling, please refer tohttps://blog.csdn.net/abcjennifer/article/details/25908495
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 100 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Properties of Gibbs Sampling
Properties
No need to tune the proposal distribution;
Good trade-off between acceptance and mixing: Acceptance ratio is always 1.
Need to be able to derive conditional probability distributions.
Acceleration of Gibbs sampling: (givenp(a, b, c) draw samples from a and c)
Blocked Gibbs:(1) Draw (a,b) given c;(2) Draw c given (a,b);
Collapsed Gibbs:(1) Draw a given c;(2) Draw c given a;
Marginalize whenever you can.
R code for MCMC and Gibbs sampling, please refer tohttps://blog.csdn.net/abcjennifer/article/details/25908495
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 100 / 101
Markov Chain Monte Carlo Metropolis-Hastings Algorithm and Gibbs Sampling
Properties of Gibbs Sampling
Properties
No need to tune the proposal distribution;
Good trade-off between acceptance and mixing: Acceptance ratio is always 1.
Need to be able to derive conditional probability distributions.
Acceleration of Gibbs sampling: (givenp(a, b, c) draw samples from a and c)
Blocked Gibbs:(1) Draw (a,b) given c;(2) Draw c given (a,b);
Collapsed Gibbs:(1) Draw a given c;(2) Draw c given a;
Marginalize whenever you can.
R code for MCMC and Gibbs sampling, please refer tohttps://blog.csdn.net/abcjennifer/article/details/25908495
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 100 / 101
Take-aways
Take-aways
Conclusions
Basic Concepts of Random Samples
Sampling from the Normal Distribution
Properties of the Sample Mean and VarianceThe Derived Distributions: Student’s T and Snedecor’s F
Order Statistics
Convergence Concepts
Convergence in ProbabilityAlmost Sure ConvergenceConvergence in Distribution
Generating a Random Sample
MING GAO (DaSE@ECNU) Statistical Inference Apr. 14, 2020 101 / 101