basics of probability theory - uzh › ... › probability-lecture.pdf · 2015-10-07 · overview 1...

Basics of Probability Theory

Stefan Bruder

University of Zurich

September 1, 2015

Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 1 / 160

Textbooks:

Jean Jacod and Phillip Protter: Probability Essentials

Albert N. Shiryaev: Probability

Sidney I. Resnick: A Probability Path

Achim Klenke: Probability Theory - A Comprehensive Course

Marc Paollela: Intermediate Probability - A Computational Approach

Marc Paollela: Fundamental Probability - A Computational Approach

Patrick Billingsley Probability and Measure

Olav Kallenberg: Foundations of Modern Probability.


Overview

1 Probability Space

2 Finite or Countably Infinite Ω

3 Probability Measures on (R,B(R))

4 Random Variables

5 Moments

6 Inequalities

7 Moment Generating Functions

8 Transformations of Random Variables

9 Convergence Concepts

10 Law of Large Numbers

11 Central Limit Theorem

12 Delta Method


Probability Space

Probability Space


Probability Space

Definition (Probability space)

A probability space is a triple (Ω,B,P) where

Ω is the sample space

B is the σ-algebra on Ω

P is a probability measure; that is, P is a set function with domain Band range [0, 1] such that

1 P(A) ≥ 0 ∀A ∈ B2 P is σ-additive: If An ∈ B are pairwise disjoint events (i.e. Al ∩ Ak = ∅

for l 6= k), then

P

( ∞⋃n=1

An

)=

∞∑n=1

P (An)

3 P(Ω) = 1.


Probability Space

The sample space is the set of all possible outcomes. Examples:Two successive tosses of a coin Ω = hh, tt, ht, th, the lifetime of alight-bulb Ω = R+, a toss of two diceΩ = (i , j) : 1 ≤ i ≤ 6, 1 ≤ j ≤ 6.An event is a subset of Ω, which has the following properties:

Probability is defined for the subsetIt is observable whether the event has occured or not after theexperiment has been completed.

If A,B are events, then the contrary event is interpreted as thecomplement set Ac , the event ”A or B” is interpreted as the unionA ∪ B, the event ”A and B” is interpreted as the intersection A ∩ B,the sure event is Ω and the impossible event is ∅.The family of all events is denoted by B. B should have the propertythat if A,B ∈ B, then Ac ∈ B, A ∩ B ∈ B, A ∪ B ∈ B, Ω ∈ B and∅ ∈ B.


Probability Space

Definition (σ-algebra)

A σ-algebra on Ω is defined as a nonempty collection B of subsets of Ωsuch that

1 Ω, ∅ ∈ B.

2 If A ∈ B then Ac ∈ B.

3 If A1,A2, . . . ∈ B then ∪∞i=1Ai ∈ B.

A σ-algebra is closed under complements (2) and countable unions(3).

By De Morgan’s law, a σ-algebra is also closed under countableintersections.

The smallest σ-algebra is ∅,Ω and the largest σ-algebra is thepower set 2Ω.

Intuitively one always chooses B = 2Ω, but it turns out thatdepending on Ω the power set is eventually too big: Example Ω = R.


Probability Space

Definition

If C ∈ 2Ω, the σ-algebra generated by C, and written σ(C), is the smallestσ-algebra containing C.

Definition

Suppose Ω = R. Let C = (a, b],−∞ ≤ a < b <∞. Then the Borelσ-algebra on R is defined by

B(R) = σ(C)

The elements of B(R) are called Borel sets.

There are many equivalent ways to generate B(R). The Borelσ-algebra can be generated with any kind of interval: open, closed,semiopen, finite, and semi-infinite.

Generalization for Rd : Let C =∏d

i=1(a, b],−∞ ≤ a < b <∞

,

then B(Rd) = σ(C).


Probability Space

Example

Let Ω = 1, 2, 3, 4, 5, 6 and C = 2, 4 , 6. Then,

σ(C) = ∅, 2, 4 , 6 , 1, 3, 5, 6 , 1, 2, 3, 4, 5 , 2, 4, 6 , 1, 3, 5 ,Ω

We can check the three conditions

Ω, ∅ ∈ σ(C) X

If A ∈ σ(C) then Ac ∈ σ(C) X

If A1,A2, . . . ∈ σ(C) then ∪∞i=1Ai ∈ σ(C) X

Exercise

Show that a σ-algebra is closed under countable intersections, i.e. ifA1,A2, . . . ∈ B ⇒ ∩∞i=1Ai ∈ B.


Probability Space

Exercise

Define the following two generators of the Borel σ-algebra on RC() = (a, b) : −∞ ≤ a < b ≤ ∞C(] = (a, b] : −∞ ≤ a < b <∞.

1 Show that σ(C()) ⊂ σ(C(]). Use the fact that

(a, b) =∞⋃n=1

(a, b − 1

n

].

2 Show that σ(C(]) ⊂ σ(C()). Use the fact that

(a, b] =∞⋂n=1

(a, b +

1

n

).


Probability Space

The definition of a probability measure has already several importantimplications:

Proposition

1 P(Ac) = 1− P(A)

2 P(∅) = 0

3 For A,B ∈ B, P(A ∪ B) = P(A) + P(B)− P(A ∩ B)

4 For A,B ∈ B, if A ⊂ B then P(A) ≤ P(B)

5 For An ∈ B, P(∪∞n=1An) ≤∑∞

n=1 P(An)6 For An ∈ B

If An ↑ A, then limn→∞

P(An) = P(A)

If An ↓ A, then limn→∞

P(An) = P(A)

(4) is the monotonicity property, (5) is the σ-subadditivity propertyand (6) is the continuity property of a probability measure.

Proof of (1), (3), and (6a).


Probability Space

Proof of (1).

Ω = A ∪ Ac . Then by the σ-additivity of P and A ∩ Ac = ∅

P(Ω) = P(A ∪ Ac) (1)

= P(A) + P(Ac) (2)

⇒ P(Ac) = 1− P(A).

Proof of (3).

First note that A = A ∩ Ω = A ∩ (B ∪ Bc) = (A ∩ B) ∪ (A ∩ Bc). Thus,P(A) = P(A∩B) + P(A∩Bc) because (A∩B) and (A∩Bc) are disjoint.

P(A ∪ B) = P((A ∩ Bc) ∪ (A ∩ B) ∪ (Ac ∩ B)) (3)

= P(A ∩ Bc) + P(A ∩ B) + P(Ac ∩ B) (4)

= P(A)− P(A ∩ B) + P(A ∩ B) + P(B)− P(A ∩ B) (5)

= P(A) + P(B)− P(A ∩ B) (6)


Probability Space

Proof of (6).

An is an increasing sequence of events, i.e. An ⊂ An+1 ⊂ . . ., then

limn→∞

An = ∪∞n=1An := A.

Define B1 = A1,B2 = A2 \ A1, . . . ,Bn = An \ An−1. Then Bn is adisjoint sequence of events with

An = ∪ni=1Bi and ∪∞i=1 Bi = ∪∞i=1Ai = A

P(A) = P(

limn→∞

An

)(7)

= P (∪∞i=1Ai ) (8)

= P (∪∞i=1Bi ) (9)


Probability Space

Proof of (6) cont’d.

=∞∑i=1

P(Bi ) (10)

= limn→∞

n∑i=1

P(Bi ) (11)

= limn→∞

P(∪ni=1Bi ) (12)

= limn→∞

P(An) (13)

The Proof of limn→∞

P(An) = P(A) if An ↓ A is conceptually similar.


Probability Space

Definition (Conditional Probability)

Let A,B ∈ B and P(B) > 0. The conditional probability of A given B is

P(A|B) = P(A∩B)P(B) .

Proposition

Given P(B) > 0, the conditional probability is a probability measure on A.

Proof.

Define Q(A) = P(A|B) ∀A ∈ A (with B fixed).

Q(Ω) = P(Ω|B) =P(Ω ∩ B)

P(B)=

P(B)

P(B)= 1


Probability Space

Proof.

Let (An)n≥1 be a sequence of pairwise disjoint events, then

Q(∪∞n=1An) =P((∪∞n=1An) ∩ B)

P(B)(14)

=P((∪∞n=1(An ∩ B))

P(B)(15)

=∞∑i=1

P(An ∩ B)

P(B)(16)

=∞∑i=1

P(An|B) (17)

=∞∑i=1

Q(An) (18)


Probability Space

Definition (Independence of Events)

Two events A and B are independent if P(A ∩ B) = P(A)P(B)

A (possibly infinite) collection of events (Ai )i∈I is an independentcollection if for every finite subset J ⊂ I, it holds thatP(∩i∈JAi ) =

∏i∈J P(Ai )

Independence of events (Ai )i∈I implies pairwise independence, butthe converse is not true! Example: Ω = 1, 2, 3, 4, P(i) = 1

4∀i ∈ 1, 2, 3, 4 ,A = 1, 2, B = 1, 3, C = 2, 3. Then A,B,Care pairwise independent but not independent.

A and B are independent if and only if P(A |B) = P(A) .

Definition

A collection of events (En) is called a partition of Ω if En ∈ A,En ∩ Em = ∅ for m 6= n, P(En) > 0 and ∪nEn = Ω.


Probability Space

Theorem (Law of Total Probability)

Let (En)n≥1 be a finite or countable partition of Ω. Then if A ∈ B,

P(A) =∑n

P(A|En)P(En)

Theorem (Bayes’ Theorem)

Let (En)n≥1 be a finite or countable partition of Ω, and suppose P(A) > 0.Then

P(En|A) =P(A|En)P(En)∑m P(A|Em)P(Em)

Famous applications of Bayes’ theorem:

HIV test

Monty Hall problem

. . .


Probability Space

Proof of the Law of Total Probability.

Note that A = (A ∩ Ω) = (A ∩ (∪nEn)) = (∪n(A ∩ En)), where((A ∩ En))n≥1 is a family of pairwise disjoint sets. Thus,

P(A) = P(∪n(A ∩ En)) =∑n

P(A ∩ En)) =∑n

P(A|En)P(En)

The proof of Bayes’ Theorem is straightforward once we have proven thelaw of total probability:

Proof of Bayes’ Theorem.

P(En|A) =P(En ∩ A)

P(A)=

P(A|En)P(En)∑m P(A|Em)P(Em)


Probability Space

Proposition

Let P1,P2, . . . be probability measures on the same probability space(Ω,B,P). Define

P(A) =∞∑i=1

λiPi (A) ∀A ∈ B, (19)

where λi ≥ 0 and∑∞

i=1 λi = 1. Then, P as defined in (19), is aprobability measure on (Ω,B).

Exercise

Show that P(A ∩ Bc) = P(A)− P(A ∩ B).

Exercise

Suppose that P(C ) > 0. Show thatP(A ∪ B |C ) = P(A |C ) + P(B |C )− P(A ∩ B |C ).


Probability Space

Exercise

Suppose that P(C ) > 0 and A1, . . . ,An are all pairwise disjoint. Showthat P(∪ni=1Ai |C ) =

∑ni=1 P(Ai |C ).

Exercise

Show that if events A and B are independent, then so are A and Bc andAc and Bc .

Exercise

Show that the mixture probability measure defined in (19) is a probabilitymeasure on (Ω,B,P).


Finite or Countably Infinite Ω

Finite or countably infinite Ω



Assume that Ω is finite or countably infinite, for exampleΩ = 1, 2, 3, 4, 5, 6 or Ω = N .

In this case it is possible to take the power set as the σ-algebra, i.e.A = 2Ω.

With a finite or countably infinite Ω the construction of a probabilitymeasure is straightforward. Intuitively, one simply needs to determinethe probability of every element of Ω, that is P(ω) for all ω ∈ Ω.

If Ω is not countable (for example Ω = [0, 1]), then A = 2Ω is nolonger possible. One has to work with a σ-algebra that is ”smaller”than 2Ω.

For the general case the construction of a probability measure is muchmore demanding and requires measure theory.



Theorem

1 A probability measure on the finite or countable set Ω is characterizedby its values on the atoms ω:

P(ω) = pω, ω ∈ Ω.

2 Let (pω)ω∈Ω be a family of real numbers indexed by the finite orcountable set Ω. Then there exists a unique probability measure Psuch that P(ω) = pω if and only if pω ≥ 0 and

∑ω∈Ω pω = 1.

Proof of (1).

Let A ∈ 2Ω, then A = ∪ω∈A ω is a finite or countable union of pairwisedisjoint singletons. Then,

P(A) = P(∪ω∈A ω) =∑ω∈A

P(ω) =∑ω∈A

pω



Proof of (2).

⇒:If P(ω) = pω then pω ≥ 0 by definition. Additionally we have,

1 = P(Ω) = P(∪ω∈Ω ω) =∑ω∈Ω

P(ω) =∑ω∈Ω

pω.

⇐:If (pω)ω∈Ω satisfies pω ≥ 0 and

∑ω∈Ω pω = 1. Define P by

P(A) =∑

ω∈A pω. Then,

P(∅) = 0 and P(Ω) =∑ω∈Ω

pω = 1.

Countable additivity is trivial when Ω is finite, when Ω is countable itholds that

∑i∈I∑

ω∈Aipω =

∑ω∈∪iAi

pω.



Example

Let Ω be a finite set and let A = 2Ω. The probability measure is a uniformprobability measure

P(A) =cardA

cardΩ∀A ∈ 2Ω.

P(A) ≥ 0 ∀A ∈ 2Ω follows directly from the definition of cardinality

P(Ω) = cardΩcardΩ = 1

Let Ai ∈ 2Ω be disjoint events, then

P(∪∞i=1Ai ) =card ∪∞i=1 Ai

cardΩ=

∑∞i=1 cardAi

cardΩ=∞∑i=1

cardAi

cardΩ=∞∑i=1

P(Ai )

On a given finite set Ω the uniform probability measure is unique.



Example

Let a probability space be given by (N, 2N,P). A probability measure P isdefined by its atomistic values (parametrized with λ > 0)

pn = e−λλn

n!n ∈ N.

pn ≥ 0.∑n pn = e−λ

∑∞n=0

λn

n! = e−λeλ = 1.

Example

Let a probability space be given by (N+, 2N+ ,P). A probability measure P

is defined by its atomistic values (parametrized with α ∈ [0, 1))

pn = (1− α)αn n ∈ N.



Exercise

Let a probability space be given by (N+, 2N+ ,P). Let two probability

measures be given by

P1(ω) =1

(e − 1)ω!and P2(ω) =

1

3

(3

4

)ωShow that both uniquely define probability measures on (N+, 2

N+).

Show that both probabillity measures are σ-additive, P(N+) = 1 andP(A) ≥ 0 ∀A ∈ 2N+ .

Exercise

Show that there is no uniform probability measure on (N, 2N), that is,there is no probability measure P : 2N → [0, 1] such that

P(i) = P(j) ∀i , j ∈ N.


Probability Measures on (R,B(R))

Probability measures on (R,B(R))



Assume that Ω = R and B is the Borel σ-algebra of R.Thus, theprobability space is given by (R,B(R),P)

Probability measures on (R,B(R)) are very important for the analysisof random variables.

Definition (Distribution Function)

The distribution function F : R→ [0, 1] of P is the function

F (x) = P ((−∞, x ]) , ∀x ∈ R

Proposition

The distribution function F (x) has the following properties:

1 F is monotone non-decreasing

2 F is right continuous

3 limx→−∞

F (x) = 0 and limx→+∞

F (x) = 1.



Proof of (1).

For x < y it holds that

(−∞, x ] ⊂ (−∞, y ].

Using the monotonicity property of a probability measure

F (x) = P((−∞, x ]) ≤ P((−∞, y ]) = F (y).

Thus, F (x) ≤ F (y) for x < y .



Right-continuity means that if xn ↓ x then limxn↓x

F (xn) = F (x).

Proof of (2).

Take a decreasing sequence xn such that xn ↓ x as n→∞, for exampledefine xn = x + 1

n . The sequence of events (−∞, xn]; n ≥ 1 is also adecreasing sequence. Then,

limxn↓x

F (xn) = limn→∞

F (xn) (20)

= limn→∞

P((−∞, xn]) (21)

= P(

limn→∞

(−∞, xn])

(22)

= P(∩∞n=1(−∞, xn]) (23)

= P((−∞, x ]) (24)

= F (x) (25)



Proof of (3).

Take an increasing sequence xn such that xn ↑ ∞ as n→∞. Thesequence of events (−∞, xn]; n ≥ 1 is an increasing sequence of events.Then,

F (∞) = limx→∞

P((−∞, x ]) (26)

= limn→∞

P((−∞, xn]) (27)

= P(

limn→∞

(−∞, xn])

(28)

= P (∪∞n=1(−∞, xn]) (29)

= P(R) (30)

= P(Ω) (31)

= 1 (32)




Take a decreasing sequence xn such that xn ↓ −∞ as n→∞. Thesequence of events (−∞, xn]; n ≥ 1 is a decreasing sequence of events.Then,

F (−∞) = limx→−∞

P((−∞, x ]) (33)

= limn→∞

P((−∞, xn]) (34)

= P(

limn→∞

(−∞, xn])

(35)

= P(∩∞n=1(−∞, xn]) (36)

= P(∅) (37)

= 0 (38)



Starting from the probability measure P we get its distributionfunction by the definition of F . Thus, P = Q ⇒ FP = FQ

It can be shown that the converse is also true: FP = FQ ⇒ P = Q.This implies that the complete probability measure P is known if weknow its distribution function F and therefore the probability for anygiven set A ∈ B(R) can be determined from F .

It can also be shown that any function F : R→ [0, 1], which ismonotonically non-decreasing, right-continuous with lim

x→−∞F (x) = 0

and limx→+∞

F (x) = 1 is the distribution function of a uniquely

determined probability measure on (R,B(R)).

Example: F (x) =

0, if x < 018 x3, if x ∈ [0, 2)

1 if x ≥ 2



Example (Dirac probability measure)

Let c ∈ R. A point mass on R satisfies

P(A) =

1, if c ∈ A

0, otherwise

The distribution function of P is given by

FP(x) = P((−∞, x ]) =

0, if x < c

1, if x ≥ c

Indicator function: 1A : X → 0, 1Previous example: FP(x) = 1[c,∞)(x)



Example (Lebesque probability measure)

If a, b ∈ R, then define m((a, b]) = b - a (Lebesgue measure on(R,B(R))). Further define (for fixed a, b)

ma,b(A) =m(A ∩ (a, b])

b − a∀A ∈ B(R).

Then, ma,b(A) is a probability measure on (R,B(R)) and its distributionfunction is given by

Fa,b(x) = ma,b((−∞, x ]) =

0, if x < ax−ab−a , if a ≤ x < b

1, if b ≤ x .



The distribution function generates the probability of intervals of theform (−∞, x ] for all x ∈ R.

How is the probability for intervals of the form (x , y ], [x , y ], [x , y),(x , y) or x for x < y determined?

Proposition

Let F be the distribution function of P on R and let F (x−) denote theleft limit of F at x. Then for all x < y,

1 P((x , y ]) = F (y)− F (x)

2 P([x , y ]) = F (y)− F (x−)

3 P([x , y)) = F (y−)− F (x−)

4 P((x , y)) = F (y−)− F (x)

5 P(x) = F (x)− F (x−)

Next, we will prove (1) and (2). (3),(4) and (5) are proven in similar way.



Proof of (1).

P((x , y ]) = P (((−∞, x ] ∪ (y ,+∞))c) (39)

= 1− P((−∞, x ] ∪ (y ,+∞)) (40)

= 1− [P((−∞, x ]) + P((y ,+∞))] (41)

= 1− [F (x) + (1− P((−∞, y ]))] (42)

= F (y)− F (x) (43)

Proof of (2).

Define a decreasing sequence xn such that xn ↓ x , for examplexn = x − 1

n . Then, using (1) yields

P((xn, y ]) = F (y)− F (xn).




First, consider the lhs

limn→∞

P((xn, y ]) = P(

limn→∞

(xn, y ])

(44)

= P (∩∞n=1(xn, y ]) (45)

= P([x , y ]) (46)

The rhs converges to F (y)− F (x−) by the definition of the left limit of F .Thus, P([x , y ]) = F (y)− F (x−).

Proposition

Let P be a probability measure on (R,B(R)) and let F be thecorresponding distribution function. Then,

P(x) = 0⇔ F is a continuous function.



Proof.

”⇒ ”: Suppose P(x) = 0 ∀x ∈ R. P(x) = F (x)− F (x−) = 0implies that F(x) = F(x-), which means that F is left-continuous at x.Since F is a distribution function it is right-continuous, i.e. F(x) = F(x+).Thus, F(x) = F(x-) = F(x+) ∀x ∈ R.”⇐ ”: Suppose F is continuous on R. This implies thatF (x) = F (x−) = F (x+) ∀x ∈ R. Thus,P(x) = F (x)− F (x−) = F (x)− F (x) = 0.

Exercise

Let c , d ∈ R (c < d) and let εc and εd denote point masses at c and d,respectively. Define

P =1

3εc +

2

3εd

Find the distribution function of P.



Exercise

Suppose a distribution function is given by

F (x) =1

41[0,∞)(x) +

1

21[1,∞)(x) +

1

41[2,∞)(x).

Let P be given byP((−∞, x ]) = F (x).

Find the probabilities of the following events

A = (−12 ,

12 )

B = (−12 ,

32 )

C = (−23 ,

52 )

D = [0, 2)

E = (3,∞).



Exercise

Suppose a distribution function is given by

F (x) =∞∑i=1

1

2i1[ 1

i,∞)(x).

Let P be given byP((−∞, x ]) = F (x).

Find the probabilities of the following events

A = [1,∞)

B = [ 110 ,∞)

C = 0D = [0, 1

2 )

E = (−∞, 0)

F = (0,∞).



Definition (Discrete probability measures)

A discrete probability measure is a probability measure for which thecorresponding distribution function is piecewise constant.

Discrete probability measures are countable convex combinations ofpoint masses.

The measure is concentrated on an at most countable set.

The distribution function increases only with ”jumps”:∆F (xk) = F (xk)− F (xk−) = P(xk).

Example

Let P = 19ε1 + 8

9ε0. Then the distribution function is given by

F (x) =8

91[0,∞)(x) +

1

91[1,∞)(x)



Definition (Absolutely continuous probability measures)

An absolutely continuous measure is a measure for which thecorresponding distribution function is of the following form

F (x) = P ((−∞, x ]) =

∫ x

−∞f (t)dt,

where f (t) is a nonnegative function.

The function f : R→ R+ is called the density of the distribution.

Generalization: Integral in the sense of Lebesgue instead of Riemann.

Not every distribution function admits a density. Example: Pointmass.

Every non-negative function f : R→ R+ that is Riemann (Lebesgue)integrable and such that

∫∞−∞ f (x)dx = 1 defines a distribution

function via the above definition.



Example

Let a probability space be given by (R,B(R),P). An absolutely continuousprobability measure is characterized by the following distribution function(λ, k ∈ R++)

FP(x) = P((−∞, x ]) =

∫ x

−∞

k

λ

( t

λ

)k−1

e−( tλ

)k1[0,∞)(t)dt ∀t ∈ R.

Let B = (1, 10] ∈ B(R). P(B) = P((1, 10]) = FP(10)− FP(1) = 0.367834

Example

Let a probability space be given by (R,B(R),P). An absolutely continuousprobability measure is characterized by the following distribution function

FP(x) = P((−∞, x ]) =

∫ x

−∞

1√2π

e−12t2

dt ∀t ∈ R

Let B = (1, 10] ∈ B(R). P(B) = P((1, 10]) = FP(10)− FP(1) = 0.158655



More dimensions: Probability space (Rd ,B(Rd),P).

Definition

The distribution function F : Rd → [0, 1] of a probability measure on(Rd ,B(Rd)) is the function

F (x1, . . . , xd) = P

(d∏

i=1

(−∞, xi ]

), ∀x = (x1, . . . , xd)′ ∈ Rd

Absolutely continuous measure:

F (x1, . . . , xd) =

∫ x1

−∞· · ·∫ xd

−∞f (t1, . . . , td)dt1 · · · dtd ,

where f (t1, . . . , td) is a non-negative function and integrates to one.

Define ∆aibi F (x1, . . . , xd) =F (x1, . . . , xi−1, bi , xi+1, . . .)− F (x1, . . . , xi−1, ai , xi+1, . . .).

P(∏d

i=1(ai , bi ]) = ∆a1b1 · · ·∆adbd F (x1, . . . , xd).



Example

An absolutely continuous probability measure (R2,B(R2)) is given by thefollowing distribution function

F (x1, x2) = P

(2∏

i=1

(−∞, xi ]

)=

∫ x1

−∞

∫ x2

−∞1[0,1](t1)1[0,1](t2)dt1dt2.

The integral can by simplified to the following expression

F (x1, x2) =

0, if x1 ∨ x2 < 0

x1x2, if x1 ∧ x2 ∈ [0, 1)

x1, if x1 ∈ [0, 1) ∧ x2 ≥ 1

x2, if x2 ∈ [0, 1) ∧ x1 ≥ 1

1, if x1 ∧ x2 ≥ 1.

Let B = (0.25, 0.5]× (0.25, 0.5]. Then,P(B) = F (0.5, 0.5)− F (0.5, 0.25)− F (0.5, 0.25) + F (0.25, 0.25) = 1

16 .Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 49 / 160


Exercise

Derive an expression for the distribution function of an uniformdistribution on [a, b] ⊂ R.

Exercise

Suppose P is an absolutely continuous measure on (R,B(R)) with density

fP(x) =1

2σexp−

|X−µ|σ ∀x ∈ R, µ ∈ R, σ ∈ R++

Show that∫∞−∞ f (x)dx = 1.

Derive an expression for FP(x).

Show that F(x) has the properties of a distribution function.


Random Variables

Random Variables


Random Variables

Definition

Let X : Ω→ R be a function. The inverse image of B ⊂ R is the set

X−1(B) = ω ∈ Ω : X (ω) ∈ B ⊂ Ω

Let (Ω,A,P) be a probability space

Definition

A random variable on (Ω,A,P) is a function X : Ω→ R such that

∀B ∈ B(R) X−1(B) ∈ A.

The measurability condition states that the inverse image is ameasurable set of Ω, i.e. X−1(B) ∈ A. This is essential sinceprobabilities are defined only on A.

A random vector is a function X : Ω→ Rd , where each componentfunction Xi : Ω→ R is a random variable. This means that eachcomponent of a random vector is a random variable.


Random Variables


Random Variables

Example

The simplest example of a random variable is the indicator randomvariable 1A : Ω→ R defined by

1A(ω) =

1, if ω ∈ A

0, if ω /∈ A.

Let us check the measurability condition for every B ∈ B(R):

1−1A (B) =

∅, if 0 /∈ B ∧ 1 /∈ B

Ac , if 0 ∈ B ∧ 1 /∈ B

A, if 0 /∈ B ∧ 1 ∈ B

Ω, if 0 ∈ B ∧ 1 ∈ B.

Thus, the measurability condition is satisfied.


Random Variables

The definition of a random variable states that a function X : Ω→ Ris a random variable if X is A- measurable, i.e.∀B ∈ B(R) X−1(B) ∈ A.

Given this definition, it seems that in order to establish that afunction X : Ω→ R is a random variable it is necessary to check themeasurability condition for all B ∈ B(R).

Fortunately, there is a simplified criterion.

Proposition

A function X : Ω→ R is a random variable iff

X−1((−∞, t]) ∈ A ∀t ∈ R.


Random Variables

Example

Let (Ω,A,P) be a probability space. Define X : Ω→ R by

X (ω) = c ∀ω ∈ Ω, c ∈ R.

Check the measurability condition:

∀B ∈ B(R); X−1(B) =

∅, if c ∈ B

Ω, if c /∈ B.

Check the simplified criterion:

∀t ∈ R; X−1((−∞, t]) =

∅, if t < c

Ω, if t ≥ c

∅ ∈ A and Ω ∈ A ⇒ X is a random variable.


Random Variables

Let X be a random variable defined on (Ω,B,P) with values in (R,B(R))

Definition

The distribution of X , PX : B(R)→ [0, 1], is defined by

PX (B) = P(X−1(B)) = P(ω : X (ω) ∈ B)

∀B ∈ B(R).

Definition

The distribution function of X , FX : R→ [0, 1], is defined by

FX (t) = PX ((−∞, t]) = P(X−1(−∞, t]) ∀t ∈ R.

Notation: Write P(X ≤ t) instead ofP(X−1(−∞, t]) = P(ω ∈ Ω : X (ω) ≤ t).

FX (t) has the following properties: Monotonically non-decreasing,right-continuous, lim

x→−∞F (x) = 0, and lim

x→∞F (x) = 1.


Random Variables

Proposition

PX is a probability measure on (R,B(R)).

Proof.

∀B ∈ B(R), PX (B) = P(X−1(B)) ≥ 0, since X−1(B) ∈ A.

PX (R) = P(X−1(R)) = P(Ω) = 1, since X : Ω→ R.

Let Bn ∈ B(R) be pairwise disjoint. Then,

PX (∪∞n=1Bn) = P(X−1(∪∞n=1Bn)) (47)

= P(∪∞n=1X−1(Bn)

)(48)

=∞∑n=1

P(X−1(Bn)) (49)

=∞∑n=1

PX (Bn) (50)


Random Variables


Random Variables

Example

Consider the experiment of throwing two dice. The sample space is givenby Ω = (i , j) : 1 ≤ i , j ≤ 6. Define X : Ω→ 2, 3, 4, . . . , 12 by

X ((i , j)) = i + j ∀i , j ∈ Ω.

Then, for example

X = 4 = X−1(4) = (1, 3), (3, 1), (2, 2) ⊂ 2Ω

The induced probability measure is given by

PX (X = i) = P(X−1(i)).

For example,

PX (X ∈ 2, 3) = P((1, 1), (1, 2), (2, 1)) =3

36.


Random Variables

Proposition

Let X : Ω→ R and Y : Ω→ R be random variables. Then Z = f (X ,Y )is also a random variable for the following functions of X and Y

Z = cX + dY , ∀c , d ∈ RZ = XY

Z = XY , if Y 6= 0

Z =maxX ,Y Z =minX ,Y

Thus, the set of random variables is closed under

addition

multiplication

division

maximum

minimum

scalar multiplication.


Random Variables

Definition

A family (Xi )i∈I of random variables is called identically distributed if

PXi= PXj

∀i , j ∈ I.

If PX = PY , then this is denoted by Xd= Y .

Definition

The support of a real valued random variable X , denoted by Supp(X ), isthe smallest closed set C such that P(C ) = 1.

Example

Let X ∼ N (0, 1), then Supp(X ) = R.

Let X ∼ U(a,b), then Supp(X ) = [a, b].

Let X ∼Bern(p), then Supp(X ) = 0, 1.


Random Variables

Random d-vectors: X : Ω→ Rd .

Definition

The distribution function of X,FX : Rd → [0, 1], is defined by

FX(x) = PX

(d∏

i=1

(−∞, xi ]

)= P

(X−1

(d∏

i=1

(−∞, xi ]

))∀x ∈ Rd

Note that

X−1

(d∏

i=1

(−∞, xi ]

)=

d⋂i=1

X−1i ((−∞, xi ]) =

d⋂i=1

Xi ≤ xi

.

Absolutely continuous random variable with density fX(X):

FX(t) =

∫ x1

−∞· · ·∫ xd

−∞fx(t1, . . . td)dt1 · · · dtd


Random Variables

Proposition

Let X be a d-dimensional random vector, then the marginal distribution ofXi is given by

FXi(xi ) = lim

xj→∞,j 6=iFX(x1, . . . xd) ∀xi ∈ R

∀i ∈ 1, . . . , d.

Proposition

Let fX(x) be the density of an absolutely continuous d-dimensionalrandom vector, then the marginal density is given by

fXi(xi ) =

∫ ∞−∞· · ·∫ ∞−∞

fX(x1, . . . xd)dx1 · · · dxi−1dxi+1 · · · dxd

∀i ∈ 1, . . . , d.


Random Variables

Example

Let X be a 2-dimensional random vector with fX given by

fX(x1, x2) =1

π1[−1,1](x1)1

[−√

1−x21 ,+√

1−x21 ]

(x2).

The marginal density of X1 is then given by

fX1(x1) =

∫ ∞−∞

1

π1

[−√

1−x21 ,+√

1−x21 ]

(x2)1[−1,1](x1)dx2 (51)

=

∫ √1−x21

−√

1−x21

1

π1[−1,1](x1)dx2 (52)

=2

π

(√1− x2

1

)1[−1,1](x1). (53)


Random Variables

Exercise

Let Ω = 1, 2, 3, 4, A = 2Ω and P = U1,2,3,4. Define X by

X (ω) =

3 , ω ∈ 1, 2, 37 , ω ∈ 4 .

Is X a random variable?

Characterize the induced probability measure PX .

Exercise

Let X be a random point chosen uniformely from the region

R ≡ (x , y) : |x |+ |y | ≤ 1 ⊂ R2.

Derive an expression for fX (x) and fY (y).


Random Variables

Exercise

Give an example to show that Xd= Y 6⇒ X = Y .


Random Variables

Intuition for Independence: The information provided by anyindividual random variable should not affect the behavior of the otherrandom variables in the family.

The definition of independence of random variables is abstract:

Definition

The family (Xi )i∈I of random variables is called independent if the family(σ(Xi ))i∈I of induced σ-algebras is independent.

Induced σ-algebra of a random variable X :

σ(X ) =

X−1(B) : B ∈ B(R)

A and B are said to be independent if

P(A ∩ B) = P(A)P(B) ∀A ∈ A ∀B ∈ B

Need for more accessible condition(s).


Random Variables

Proposition

A family (Xi )i∈I of random variables is independent if and only if, forevery J ⊂ I and every x ∈ RJ

FJ (x) =∏j∈J

Fj(xj),

where FJ : RJ → [0, 1] is defined by FJ (x) = P(Xj ≤ xj ∀j ∈ J ).

General result, since J is an arbitrary index set.With finitely many random variables, the above result reduces to thefamiliar necessary and sufficient condition for independence:

Proposition

The random variables (X1, . . . ,Xn) are independent if and only if

F(X1,...,Xn)(x) =n∏

i=1

FXi(xi ) ∀x ∈ Rn


Random Variables

There are alternative (necessary and sufficient) criteria for absolutelycontinuous and discrete random variables:

Proposition

The discrete random variables X1, . . . ,Xn with countable support S aresaid to be independent if and only if

P(X1 = x1, . . . ,Xn = xn) =n∏

i=1

P(Xi = xi) ∀xi ∈ S i = 1, . . . , n

Proposition

Let X be an absolutely continuous, n-dimensional random vector withdensity fX(x), then X1, . . . ,Xn are independent if and only if

fX(x) =n∏

i=1

fXi(xi ) ∀x ∈ Rn


Random Variables

Example

Let X1, . . . ,Xn be independent exponentially distributed random variableswith parameters λn > 0. The joint distribution function is then given by

F(X1,...,Xn)(x1, . . . , xn) =n∏

i=1

FXi(xi ) =

n∏i=1

(1− eλixi ).

Define Z =minX1, . . . ,Xn, then

FZ (z) = P(Z ≤ z) (54)

= 1− P(min X1, . . . ,Xn > z) (55)

= 1− P(Xi > z ∀i = 1, . . . , n) (56)

= 1− P(∩ni=1 Xi > z) (57)

= 1− P(X1 > z)× · · · × P(Xn > z) (58)

= 1−n∏

i=1

e−λiz (59)


Random Variables

Proposition

1 Let Xii .i .d .∼ Bern(p), then Y =

∑ni=1 Xi ∼ Bin(n, p).

2 Let Xiind .∼ Poisson(λi ), then Y =

∑ni=1 Xi ∼ Poisson(

∑ni=1 λi ).

Proposition

1 Let Xii .i .d .∼ N (0, 1), then Y =

∑ni=1 X 2

i ∼ χ(n).

2 Let Xiind .∼ χ(νi ), then Y =

∑ni=1 Xi ∼ χ(

∑ni=1 νi )

3 Let X ∼ N (0, 1), then Y = µ+ σX ∼ N (µ, σ2).

4 Let Xii .i .d .∼ Exp(λ), then Y =

∑ni=1 Xi ∼Gamma(n, λ).

5 Let X ∼ N (0, 1) and Y ∼ χ(ν) be independent, then T = X√Yν

∼ t(ν).

6 Let Xiind .∼ N (µi , σ

2i ), then Y =

∑ni=1 Xi ∼ N (

∑ni=1 µi ,

∑ni=1 σ

2i ).


Random Variables

Exercise

Let X and Y be independent random variables with S = N and

P(X = i) = P(Y = i) =1

2ii ∈ N.

Find the following probabilities

P(min(X ,Y ) ≤ i)P(X = Y )P(Y > X)

Exercise

Let (X ,Y )′ be an absolutely continuous, bivariate random vector withfX ,Y (x , y) given by

fX ,Y (x , y) = e−y1[0,y ](x)1[0,∞)(y)

Are X and Y independent?


Random Variables


Moments

Moments


Moments

For a simple random variable X =∑n

i=1 ai1Aion (Ω,A,P), the

expectation is defined as

E[X ] =

∫XdP =

n∑i=1

aiP(Ai )

For a non-negative random variable X on (Ω,A,P), the expectationis defined as

E[X ] =

∫XdP = lim

n→∞

∫ξndP,

where ξn is an approximating sequence of simple random variables.

The expectation of an arbitrary random variable X on (Ω,A,P) isthen defined via the decomposition of X into two non-negativerandom variables X + and X−.


Moments

Definition

Let X : Ω→ R be a random variable.

The positive part of X is defined by

X + = max X , 0

The negative part of X is defined by

X− = −min X , 0

X + and X− are nonnegative random variables.

X = X + − X−

|X | = X + + X−

If X ≥ 0, then X− = 0 and X = X +.


Moments

Definition (Expectation)

The expectation of a random variable E [X ] is said to exist, or is defined, ifat least one of E [X +] and E [X−] is finite:

min(E[X +],E[X−]) <∞.

In this case we define

E [X ] = E[X +]− E

[X−].

Definition (Finite Expectation)

The expectation of X is said to be finite if E [X +] <∞ and E [X−] <∞.


Moments

Thus,

E[X ] =

E[X +]− E[X−] if E[X +] <∞,E[X−] <∞∞ if E[X +] =∞,E[X−] <∞−∞ if E[X +] <∞,E[X−] =∞undefined if E[X +] =∞,E[X−] =∞

Proposition

E [X ] <∞⇔ E [|X |] <∞.

Proof.

Follows from |X | = X + + X−


Moments

Definition

Let L1(Ω,B,P) be the set of all random variables with E[|X |] <∞:

L1(Ω,B,P) = X : Ω→ R : E[|X |] <∞

Proposition (Properties of expectation)

Let X ,Y ∈ L1 and c ∈ R1 If X = c almost surely, then E[X ] = c.

2 E[cX ] = cE[X ].

3 E[X + Y ] = E[X ] + E[X ].

4 If X ≤ Y , then E[X ] ≤ E[Y ]

5 If X ≥ 0, then E[X ] ≥ 0.

(3) generalizes to the sum of n random variables (Proof by induction).


Moments

Need for computational formulas for the expectation of a randomvariable.

For discrete and absolutely continuous random variables theexpectation, provided it exists, can be calculated using the familiarformulas.

Proposition

Let X be a discrete random variable, then

E[X ] =+∞∑

i=−∞xiP(X = xi )

Let X be an absolutely continuous random variable with densityfX (x), then

E[X ] =

∫ ∞−∞

xfX (x)dx .


Moments

Example

Let X be an absolutely continuous random variable with density function

fX (x) = x−21(1,∞)(x)

Then for the negative part we find

E[X−] = E[−min X , 0] = E[0] = 0 <∞

and for the positive part we find

E[X +] = E[max X , 0] =

∫ ∞1

x−1dx =∞

Thus,min

E[X−],E[X +]

<∞.

⇒ E[X ] exists and E[X ] =∞.


Moments

Example

Let X be an absolutely continuous random variable with density function

fX (x) =1

2|x |−2

1(1,∞)(|x |)

Then for the negative part we find

E[X−] = E[−min X , 0] =

∫ −1

−∞x−1dx = [ln(|x |)]−1

−∞ =∞

and for the positive part we find

E[X +] = E[max X , 0] =

∫ ∞1

x−1dx = [ln(|x |)]∞1 =∞

Thus, E[X ] does not exist.


Moments

Example

Let X ∼ U[a,b] and b > a. Then,

E[X ] =

∫ ∞−∞

xfX (x)dx (60)

=

∫ ∞−∞

x

b − a1[a,b](x)dx (61)

=

∫ b

a

x

b − adx (62)

=1

2(a + b) (63)

Exercise

Let X be log-normally distributed with parameters µ and σ2. Derive anexpression for E[X ].


Moments

Exercise

Let X ∼Cauchy(0, 1). The density function is then given by

fX (x) =1

π(1 + x2)1(−∞,∞)(x).

Does E[X ] exist?

Exercise

Let X ∼Pareto(α) and α ∈ (0, 1]. The density function is given by

fX (x) =α

xα+11[1,∞)

Does E[X ] exist?


Moments

Functions of random variables are often of great interest and thereforealso their expectations.

For the special cases of discrete and absolutely continuous randomvariables there are the following computational formulas.

Proposition

Suppose X is a discrete random variable. If g(X ) ∈ L1 or if g is positive,then

E[g(X )] =∞∑

i=−∞g(xi )P(X = xi )

Proposition

Suppose X is an absolutely continuous random variable with density fX . Ifg(X ) ∈ L1 or if g is positive, then

E[g(X )] =

∫ ∞−∞

g(x)fX (x)dx


Moments

Definition

Let Ln(Ω,B,P) be the set of all random variables with E[|X |n] <∞:

Ln(Ω,B,P) = X : Ω→ R : E[|X |n] <∞

Proposition

If X ∈ Ln, then E[|X |k ] <∞ for k = 1, 2, . . . , n.

Proof.

Set |X |n =(|X |k

) nk

for k = 1, 2, . . . , n. Note that the function

f : R+ → R+ defined by f (x) = xrs is convex on R+ for r > s. Then by

Jensen’s inequality

+∞ > E[|X |n] = E[(|X |k

) nk

]≥(E[|X |k ]

) nk.

⇒ E[|X |k ] < +∞ for k = 1, 2, . . . , n ⇒ E[X k ] < +∞ for k = 1, 2, . . . , n.Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 87 / 160

Moments

Definition

If X ∈ Lp, then the r -th moment of X is given by

E[X r ] for r = 1, 2, . . . , p

Definition

If X ∈ Lp, then the r -th absolute moment of X is given by

E[|X |r ] for r = 1, 2, . . . , p

Definition

If X ∈ Lp, then the r -th central moment of X is given by

E[(X − E(X ))r ] for r = 1, 2, . . . , p


Moments

Example

Let X follow a Beta distribution with density function

fX (x) =1

B(r , s)x r−1(1− x)s−1

1[0,1](x).

Then,

E[X k ] =

∫ 1

0

1

B(r , s)xk+r−1(1− x)s−1dx .

Using the definition of the Beta function yields

E[X k ] =B(r + k , s)

B(r , s)=

Γ(r + k)Γ(r + s)

Γ(r + s + k)Γ(r)

For example:

E[X ] =rΓ(r)Γ(r + s)

(r + s)Γ(r + s)Γ(r)=

r

r + s


Moments

Definition (Variance)

Let X ∈ L2. The variance of X is defined by

σ2X = Var(X ) = E

[(X − E(X ))2

]= E

[X 2]− (E [X ])2

and the standard deviation of X is given by

σX = SD(X ) = +√

Var(X )

Proposition (Properties of Variance)

Var(X ) ≥ 0

Var(aX + b) = a2Var(X ) ∀a, b ∈ RVar(X ) = 0⇔ X = E [X ] almost surely, i.e P(X = E [X ]) = 1


Moments

Proposition

If X and Y are in L2, then XY ∈ L1.

Proof.

Note that (X − Y )2 = X 2 + Y 2 − 2XY ≥ 0. Rearranging and takingabsolute values on both sides yields |XY | ≤ X 2 + Y 2. Then by theinequality preserving property of the expectation

E[|XY |] ≤ E[X 2] + E[Y 2].

Thus, E[|XY |] <∞.

Definition (Covariance)

Let X ,Y ∈ L2 . The covariance of X ,Y is defined by

Cov(X ,Y ) = E [(X − E(X ))(Y − E(Y ))] = E(XY )− E(X )E(Y )


Moments

Proposition (Properties of Covariance)

Let X ,Y ,Z ,T ∈ L2 and a, b, c , d ∈ R1 Cov(X ,X ) = Var(X )

2 Cov(X ,Y ) = Cov(Y ,X )

3 Cov(aX + b,Y ) = aCov(X ,Y )

4 Cov(X + Z ,Y ) = Cov(X ,Y ) + Cov(Z ,Y )

5 Cov(aX + bZ , cY + dT ) =acCov(X ,Y ) + adCov(X ,T ) + bcCov(Z ,Y ) + bdCov(Z ,T )

Proof of (3).

Cov(X + Z ,Y ) = E[(X + Z − E[X + Z ])(Y − E[Y ])] (64)

= E[XY ]− E[X ]E[Y ] + E[ZY ]− E[Z ]E[Y ] (65)

= Cov(X ,Y ) + Cov(Z ,Y ). (66)


Moments

Proposition

Let X ,Y ∈ L1 be independent. Then, E[XY ] = E[X ]E[Y ].

Proposition

If X and Y are independent, then Cov(X ,Y ) = 0.

Proof.

Follows trivially from the fact that independence impliesE [XY ] = E [X ]E [Y ].

The converse is false in general: Cov(X ,Y ) = 0 does not imply thatX and Y are independent.

Special case: If X and Y are jointly normal distributed, then X andY are independent iff Cov(X ,Y ) = 0


Moments

Definition (Correlation)

Let X ,Y be two real valued random variables in L2. The correlation ofX ,Y is defined by

ρX ,Y =Cov(X ,Y )√

Var(X )√

Var(Y )

Proposition (Properties of Correlation)

Let X ,Y ∈ L2 and a, b, c , d ∈ R−1 ≤ ρX ,Y ≤ 1

|ρX ,Y | = 1⇔ P(Y = a + bX ) = 1

ρaX+b,cY+d = ρX ,Y


Moments

Exercise

Provide an example to show that E[XY ] = E[X ]E[Y ] 6⇒ Independence ofX and Y .

Exercise

If a ∈ R \ 0 and b ∈ R, show that

ρX ,aX+b =a

|a|.

Exercise

Let X ,Y ∈ L2 and let

Z =

(1

σY

)Y −

(ρX ,YσX

)X .

Show that σ2Z = 1− ρ2

X ,Y .


Moments

The moments of Rd -valued random variables are definedcomponentwise. Thus, we can simply adapt the concepts from theunivariate case.

Definition

Let X = (X1, . . . ,Xn) be an Rn-valued random variable. Provided thatXi ∈ L1, the first moment is defined by

E[X] = (E[X1], . . . ,E[Xd ])′

Definition

Let X = (X1, . . . ,Xn) be an Rn-valued random variable. Provided thatXi ∈ L2, the covariance matrix of X, ΣX, is the n × n matrix defined by

σi ,j = Cov(Xi ,Xj)


Moments

Proposition (Properties of Covariance Matrices)

1 Var(Xi ) = σi ,i2 σi ,j = σj ,i (symmetric)

3 ∀a ∈ Rn a′ΣXa ≥ 0 (positive semidefinite)

4 Let A ∈ Rm×n. ΣAX = AΣXA′

Proof of (3).

a′ΣXa =n∑

i=1

n∑j=1

aiajσi ,j (67)

= Var

(n∑

i=1

aiXi

)≥ 0 ∀a ∈ Rn (68)


Moments

Definition

Let X = (X1, . . . ,Xn) be an Rn-valued random variable. Provided thatXi ∈ L2, the correlation matrix of X, ΞX, is the n × n matrix defined by

Ξi ,j =Cov(Xi ,Xj)√

Var(Xi )√

Var(Yi )

Exercise

Let (X ,Y )′ be an absolutely continuous, bivariate random vector withdensity given by

f(X ,Y )(x , y) =1

π1[−1,1](x)1[−

√1−x2,+

√1−x2](y)

Derive Cov(X ,Y ).

Are X and Y independent?


Moments

Exercise

Show that |ρX ,Y | ≤ 1. Use the fact that |E [XY ]| ≤ +√

E [X 2]E [Y 2].

Exercise

Let Xi ∈ L2. Show that Var (∑n

i=1 aiXi ) =∑n

i=1

∑nj=1 aiajCov(Xi ,Xj).


Inequalities

Inequalities


Inequalities

Proposition (Modulus inequality)

If X ∈ L1, then|E[X ]| ≤ E[|X |]

Proof.

|E[X ]| = |E[X +]− E[X−]| ≤ E[X +] + E[X−] = E[|X |]

Proposition (Jensen’s inequality)

Let X ∈ L1 and let f : R→ R be a convex (concave) function such thatf (X ) ∈ L1. Then,

E[f (X )] ≥ f (E[X ]), if g is convex,

E[f (X )] ≤ f (E[X ]), if g is concave.

If f is strictly convex (concave) the inequality is strict.


Inequalities

Proposition (Cauchy-Schwarz inequality)

Let X ,Y ∈ L2. Then,

|E [XY ]| ≤ +√

E [X 2]E [Y 2]

Proof.

For every t ∈ R it holds that

0 ≤ E[(tX + Y )2] = t2E[X 2] + 2tE[XY ] + E[Y 2].

This is a quadratic equation in t and has at most one real root. Firstconsider the case where E[(tX + Y )2] = 0: Let Y = −tX , then

|E [XY ]| =∣∣E [−tX 2

]∣∣ = |t|E[X 2]√E [X 2]E [Y 2] =

√E [X 2]E [t2X 2] = |t|E[X 2].


Inequalities

Proof Cont’d.

Next, assume Y 6= −tX . This implies that E[(tX + Y )2] > 0 and thequadratic equation can not have a real root. Thus, its discriminant mustbe negative √

4[(E[XY ])2]− E[X 2]E[Y 2]] < 0.

Then it follows that

(E[XY ])2 − E[X 2]E[Y 2] < 0⇔ |E [XY ]| < +√

E [X 2]E [Y 2].

Combinig both cases yields the Cauchy-Schwarz inequality.

Another version of Cauchy-Schwarz inequality:E [|XY |] ≤ +

√E [X 2]E [Y 2]

Derivation: Apply the previously proven inequality to the randomvariables |X | and |Y |.


Inequalities

Proposition (Markov’s Inequality)

Let X ∈ Lr , then for a ∈ R++

P(|X | ≥ a) ≤ E[|X |r ]

ar.

Most common special case of Markov’s Inequality:

Proposition (Markov’s Inequality)

Let X be a nonnegative random variable and in ∈ L1, then for a ∈ R++

P(X ≥ a) ≤ E[X ]

a.

Proposition (Chebyshev’s Inequality)

Let X ∈ L2, then for a ∈ R++

P(|X − E[X ]| ≥ a) ≤ E[(X − E[X ])2]

a2


Inequalities

Proof.

Define Z = (X − E[X ])2 and apply Markov’s inequality:

P(|X − E[X ]| ≥ a) = P(Z ≥ a2) ≤ E[Z ]

a2=

E[(X − E[X ])2]

a2

Alternative Proof.

Define the following event A = |X − E[X ]| ≥ a, further define thefollowing indicator random variable I = 1A. Then, it holds that|X−E[X ]|2

a2 ≥ I and by the inequality preserving property of the expectation

E[(X − E[X ])2]

a2≥ E[1A] = P(|X − E[X ]| ≥ a).


Inequalities

Proposition (Minkowski’s Inequality)

Let X ,Y ∈ Lp, then for p > 1

(E[|X + Y |p])1p ≤ (E[|X |p])

1p + (E[|Y |p])

1p

Proposition (Holder’s Inequality)

Let X ∈ Lp and Y ∈ Lq, p, q > 1 and 1p + 1

q = 1. Then,

E[|XY |] ≤ (E[|X |p])1p (E[|Y |q])

1q

Proposition (Lyapunov’s Inequality)

Let X ∈ Ls , then

(E[|X |r ])1r ≤ (E[|X |s ])

1s , 1 ≤ r ≤ s.


Inequalities

Exercise

Let X ∈ L2. Show that Var(X ) = 0 ⇒ P(X = E[X ]) = 1.

Exercise

Show that even the following more general version of Markov’s inequalityholds:Let X be a real random variable and let f : [0,∞)→ [0,∞) be monotoneincreasing Then for any a > 0 with f (a) > 0,

P(|X | ≥ a) ≤ E[f (|X |)]

f (a).

Hints:

R = X ≥ a ∪ X < af (X ) = f (X )1R(X )


Moment Generating Functions




Definition (Moment Generating Function)

Let X be a real-valued random variable. Its moment generating functionM : R→ R is defined by

MX (t) = E[etX]

Example

Let Z ∼ U[0,1].

MZ (t) = E[etX]

=

∫ 1

0etzdz =

[etz

t

]1

0

=et − 1

tt 6= 0

Let Y ∼ DU(θ) with p.m.f fY (y) = θ−111,...,θ(y)

MY (t) = E[etY]

=θ∑

i=1

1

θeti



Definition

The moment generating function is said to exist if it is finite on a openneighbourhood of zero, i.e., if there is a h ∈ R++ such that,∀t ∈ (−h, h),M(t) < +∞.

Proposition

If the moment generating function exists in an open interval containingzero, it uniquely determines the probability distribution.

Proposition

If MX (t) exists then

all positive moments are finite: ∀r ∈ R++,E [|X |r ] < +∞

E[X j]

= M(j)X (t)

∣∣∣∣t=0



Why the expectation of etX ?

Power series of etX is given by∑∞

i=1tn

n! X n

By the linearity of the expectation:

E[etX ] = E

[ ∞∑n=0

tn

n!X n

]=∞∑n=0

tn

n!E[X n] ∀ |t| < h.

It can be shown that termwise differentiation is valid, then

M(j)X (t) =

∞∑i=j

t(i−j)

(i − j)!E[X i ] = E

[X j

∞∑n=0

(tX )n

n!

]= E[X jetX ].

It follows that M(j)X (t)

∣∣∣∣t=0

= E[X j].



Example

Let X ∼ Gamma(α, λ) and α, λ ∈ R++.

MX (t) =

∫ +∞

0etx

λα

Γ(α)xα−1e−λxdx =

λα

Γ(α)

∫ +∞

0xα−1ex(t−λ)dx

The integral converges for t < λ:

MX (t) =λα

Γ(α)

Γ(α)

(λ− t)α=

(λ

λ− t

)α∀t ∈ (−∞, λ)

Thus, MX (t) <∞ ∀t ∈ (−λ, λ).

E [X ] = M′X (0) =α

λ

E[X 2]

= M′′X (0) =α (α + 1)

λ2



Example

Let Y ∼ U[0,1].

MY (t) =

et−1t , t 6= 0

1 , t = 0

MY (t) is continuous at zero:

limt→0

et − 1

t= lim

t→0

et

1= 1 = MY (0).

By l’Hopital’s rule

M′Y (0) = limt→0

tet − et + 1

t2= lim

t→0

tet + et

2.

Thus, E[X ] = 12 .



Proposition

Let X be a real-valued random variable with m.g.f. MX (t) and defineY = µ+ σX , then MY (t) = eµtMX (σt)

Let Xi be n independent random variables with m.g.f.’s MXi(t) and

define Z =∑n

i=1 Xi , then MZ (t) = MX1(t) · · ·MXn(t) on thecommon interval where all m.g.f.’s exist.

Proof.

MY (t) = E[etY]

(69)

= E[et(µ+σX )

](70)

= E[etµeσtX

](71)

= etµE[eσtX

](72)

= etµMX (σt) (73)



Proof.

MZ (t) = E[etZ]

(74)

= E[et(∑n

i=1 Xi )]

(75)

= E[etX1 · · · etXn

](76)

= E[etX1

]· · ·E

[etXn

](77)

= MX1(t) · · ·MXn(t) (78)

This property is very helpful to determine the distribution of sums ofindependent random variables.



Example

Show that if Xii .i .d .∼ N (0, 1), then

∑ni=1 X 2

i ∼ χ(n). First derive the m.g.fof X 2:

E[etx2] =

∫ ∞−∞

1√2π

e−x2(t− 1

2)dx =

∫ ∞−∞

1√2π

e−(

x2

2

)(1−2t)

dx .

Make the change of variable u = x√

1− 2t, then dx = du√1−2t

;

E[etx2] =

1√1− 2t

∫ ∞−∞

1√2π

e−12u2

du =1√

1− 2t.

Thus, E[etx2] <∞ for t < 1

2 .

M(t)∑ni=1 X

2i

=

(1√

1− 2t

)n

= M(t)Z∼χ(n)



Exercise

Let Xiind∼ Poisson(λi ) and define Y =

∑ni=1 Xi

Derive an expression for MX1(t).

Show that Y ∼ Poisson(∑n

i=1 λi )

Exercise

Let X follow an exponential distribution with density

fX (x) = λe−λx1[0,∞)(x).

Derive Var(X ) via the moment generating function.

Exercise

Let Xiind∼ N (µi , σ

2i ) and define Sn =

∑ni=1 Xi and D = X1 − X2.

Show that Sn ∼ N (∑n

i=1 µi ,∑n

i=1 σ2i ).

Show that if σ21 = σ2

2 then S2 and D are independent.


Transformations of Random Variables




Let X be an absolutely continuous random variable with density fX andsuppose that Y = g(X ). Is it possible to express the density of Y in termsof fX ?

Proposition

Let X have a continuous density function fX . Let g : D ⊆ R→ R be a C1

function and strictly monotone. Then, Y = g(X ) has the density

fY (y) = fX (g−1(y))

∣∣∣∣ d

dyg−1(y)

∣∣∣∣If FY (y) is differentiable, then fY (y) = d

dy FY (y)

Modified version for piecewise strictly monotone and piecewise C1

functions.



Proof.

Suppose g is increasing and let h(y) = g−1(y). Then,

FY (y) = P(g(X ) ≤ y) (79)

= P(h(g(X )) ≤ h(y)) (80)

= P(X ≤ h(y)) (81)

=

∫ h(y)

−∞fX (t)dt (82)

h(y) is differentiable and therefore by applying Leibniz’s rule

d

dyFY (y) =

d

dy

∫ h(y)

−∞fX (t)dt = f (h(y))h′(y).

If g is decreasingd

dyFY (y) = f (h(y))(−h′(y)).



Proof cont’d.

Thus, summarizing both cases

fY (y) = fX (g−1(y))

∣∣∣∣ d

dyg−1(y)

∣∣∣∣Proposition

Let X have a continuous density fX . Let g : R→ R be piecewise strictlymonotone and piecewise continuously differentiable: that is, there existintervals I1, I2, . . . , In which partition R such that g is strictly monotoneand continuously differentiable on the interior of each Ii . For each i ,g : Ii → R is invertible on g(Ii ). Let Y = g(X ) and let hi be the inversefunction. Then the density fY of Y exists and is given by

fY (y) =n∑

i=1

fX (hi (y))

∣∣∣∣ d

dyhi (y)

∣∣∣∣1g(Ii )(y).



Example

Let X ∼ N (0, 1) and let Y = X 2. Define I1 = [0,∞) and I2 = (−∞, 0).Then g is injective and strictly monotone on I1 and I2. The inversefunctions h1 and h2 are given by

h1(y) =√

y and h2(y) = −√y .

Thus,

fY (y) = fX (h1(y))∣∣h′1(y)

∣∣1[0,∞)(y) + fX (h2(y))∣∣h′2(y)

∣∣1(0,∞)(y)(83)

=1√2π

e−y

21

2√

y1(0,∞)(y) +

1√2π

e−y

2 1(0,∞)(y) (84)

=1√2π

1√

ye−y

2 1(0,∞)(y) (85)

⇒ Y ∼ χ(1).



Example (Cont’d)

Alternative derivation:

FY (y) = P(Y ≤ y) (86)

= P(

X 2 ≤ y

) (87)

= P(−√y ≤ X ≤ √y) (88)

= FX (√

y)− FX (−√y) (89)

=

∫ √y−∞

1√2π

e−x2

2 dx −∫ −√y−∞

1√2π

e−x2

2 dx . (90)

Thus,

fY (y) =d

dyFY (y) =

d

dyFX (√

y)− d

dyFX (−√y) =

1√2π

e−y2

1

2√

y1(0,∞)(y).

⇒ Y ∼ χ(1).



Proposition

Let X be a Rn-valued absolutely continuous random variable with densityfX. Let g : Rn → Rn be a C1 function and injective with non-vanishingJacobian. Then Y = g(X ) has density

fY(y) =

fX(g−1(y))

∣∣detJg−1(y)∣∣ , if y is in the range of g

0, otherwise .

Proposition

Let S ∈ Rn be partitioned into disjoint subsets S0,S1, . . . ,Sm such that∪mi=0Si = S, and such that that S0 has Lebesgue measure zero and thatfor each i = 1, . . . ,m, gi : Si → Rn is injective and in C1 withnon-vanishing Jacobian. Let Y = g(X), then

fY(y) =m∑i=1

fX(g−1i (y))

∣∣∣detJg−1i

(y)∣∣∣ ,



Exercise

Let X ∼ U[0,1] and let Y = − 1λ with λ > 0. Derive the density function of

Y .

Exercise

Let X ∼ U[−1,1]. Derive the density for Y = X k for k ∈ N \ 0.

Exercise

Let X be an absolutely continuous and positive random variable withdensity fX . Define Y = 1

(X+1) . Find the density of Y .


Convergence Concepts




Let X be a random variable and let (Xn)n≥1 be a sequence of randomvariables defined on the same probability space (Ω,B,P).

Definition (Almost Sure Convergence)

A sequence of random variables (Xn)n≥1 converges almost surely to alimiting random variable X if

P(ω ∈ Ω : lim

n→∞Xn(ω) = X (ω)

)= 1

and is denoted by Xna.s.→ X .

Equivalent definition:

Definition

A sequence of random variables (Xn)n≥1 converges almost surely to alimiting random variable X if

P(ω ∈ Ω : lim

n→∞Xn(ω) 6= X (ω)

)= 0



Let X be a random variable and let (Xn)n≥1 be a sequence of randomvariables defined on the same probability space (Ω,B,P).

Definition (Convergence in Probability)

A sequence of random variables (Xn)n≥1 converges in probability to alimiting random variable X if for any ε > 0 we have

limn→∞

P(ω ∈ Ω : |Xn(ω)− X (ω)| > ε) = 0

and is denoted by Xnp→ X .

More common notation: limn→∞

P(|Xn − X | > ε) = 0.

Thus, Xnp→ X states that the probability that Xn and X are more

than a prescribed ε > 0 apart converges to zero as n→∞.

Statistics: An estimator βn is consistent for β if βnp→ β.



Application:

Theorem (WLLN)

Let (Xn)n≥1 be a sequence of i.i.d random variables in L2 with µ = E [X1]

and σ2 = Var(X ). Define Xn = 1n

∑ni=1 Xi , then

Xnp→ µ

Proof.

P(|Xn − E

[Xn

]| ≥ ε

)= P

(|Xn − µ| ≥ ε

)(91)

≤Var

(Xn

)ε2

(92)

=Var(X1)

nε2(93)

Hence, limn→∞

P(|Xn − µ| ≥ ε

)= 0⇒ Xn

p→ µ.



Let X be a random variable and let (Xn)n≥1 be a sequence of randomvariables, which are not necessarily defined on the same probability spaceand let CFX

⊂ R denote all points at which FX is continuous.

Definition (Convergence in Distribution)

The sequence (Xn)n≥1 converges to X in distribution, if

limn→∞

FXn(t) = FX (t) ∀t ∈ CFX

and is denoted by Xnd→ X .

Convergence in distribution is the weakest form of convergence.

The notion of convergence in distribution only requires theconvergence of the distribution functions. This is the reason why Xand every Xn can be defined on a different probability space.



Example

Let X be a point mass at 0 (i.e. X = 0 almost surely) and let thedistribution function of Xn be given by

FXn(x) =enx

1 + enx∀x ∈ R.

Clearly, the distribution function of X is given by

FX (x) =

0, if x < 0

1, if x ≥ 0,

and the limiting distribution function is given by

limn→∞

FXn(x) = F ∗(x) =

0, if x < 012 , if x = 0

1, if x > 0.



Example (cont’d)

limx→0+

F ∗(x) 6= F ∗(0). Thus, the limiting distribution function is not

right-continuous and therefore not a distribution function.

The definition of convergence in distribution requires only

limn→∞

FXn(t) = FX (t) ∀t ∈ CFX.

CFX= R \ 0.

limn→∞

FXn(x) = FX (x) =

0, if x < 0

1, if x ≥ 0,∀x ∈ CFX

⇒ Xnd→ X .



Convergence in distribution can also be established via momentgenerating functions.

Proposition

Let (Xn)n≥1 be a sequence of random variables such that MXn(t) exist for|t| < h, h ∈ R++ and all n ∈ N. If X is a random variable such thatMX (t) exists for |t| ≤ h1 < h, then

limn→∞

MXn(t) = MX (t) for |t| < h1 ⇒ Xnd→ X .

Proposition (Scheffe’s Lemma)

Let (Xn)n≥1 be a sequence of absolutely continuous random variables withcorresponding sequence of density functions (fXn(x))n≥1 and let X be anabsolutely continuous random variable with density fX (x).Then,

fXn(x)→ fX (x) a.s. ⇒ Xnd→ X .



Exercise

Let (Ω,B,P) be a probability space, where Ω = [0, 1], B = B([0, 1]) andP = U[0,1]. Define the following sequence of random variables Xn : Ω→ Rby

Xn(ω) = ω + ωn ∀n ∈ N

and the following random variable by

X (ω) = ω.

Does it hold that Xna.s.→ X , i.e. P (ω ∈ Ω : limn→∞ Xn(ω) = X (ω)) = 1?



Proposition

Let (Xn)n≥1 be a sequence of random variables, then

Xna.s.→ X ⇒ Xn

p→ X ⇒ Xnd→ X .

Proposition

Let (Xn)n≥1 and (Yn)n≥1 be sequences of random variables and c ∈ R,then

If Xna.s.→ X and Yn

a.s.→ Y ⇒ Xn + Yna.s.→ X + Y .

If Xnp→ X and Yn

p→ Y ⇒ Xn + Ynp→ X + Y .

If Xnd→ X and Yn

d→ c ⇒ Xn + Ynd→ X + c.

If Xna.s.→ X and Yn

a.s.→ Y ⇒ XnYna.s.→ XY .

If Xnp→ X and Yn

p→ Y ⇒ XnYnp→ XY .

If Xnd→ X and Yn

d→ c ⇒ XnYnd→ Xc.



Proposition

If for some c ∈ R it holds that X = c almost surely, i.e P(X = c) = 1,

then Xnd→ X ⇒ Xn

p→ X .

Proof.

Fix ε > 0. Then,

P(|Xn − c | > ε) = P(Xn > ε+ c ∪ Xn < c − ε) (94)

= P(Xn > ε+ c) + P(Xn < c − ε) (95)

≤ (1− FXn(ε+ c)) + Fxn(c − ε) (96)

For any ε > 0, (c ± ε) ∈ CFX, therefore

limn→∞

[(1− FXn(ε+ c)) + Fxn(c − ε)] = 0

Thus, limn→∞

P(|Xn − c | > ε) = 0.



Let X be a Rd -valued random variable and let (Xn)n≥1 be a sequence ofrandom variables defined on the same probability space (Ω,B,P).

Definition (Almost Sure Convergence)

A sequence of Rd -valued random variables (Xn)n≥1 converges almostsurely to a limiting random variable X if

P(ω ∈ Ω : lim

n→∞Xn(ω) = X(ω)

)= 1

and is denoted by Xna.s.→ X.

Almost sure convergence can be established either componentwise ordirectly for the vector sequence:

Xna.s.→ X⇔ Xn,i

a.s.→ Xi ∀i ∈ 1, . . . , d



Let X be a Rd -valued random variable and let (Xn)n≥1 be a sequence ofrandom variables defined on the same probability space (Ω,B,P).

Definition (Convergence in Probability)

A sequence of Rd -valued random variables (Xn)n≥1 converges inprobability to a limiting random variable X if for any ε > 0 we have

limn→∞

P(ω ∈ Ω : ||Xn(ω)− X(ω)|| > ε) = 0

and is denoted by Xnp→ X.

Convergence in probability can be established either componentwiseor via the vector sequence, i.e.:

Xnp→ X⇔ Xn,i

p→ Xi ∀i ∈ 1, . . . , d



The extension of the definition convergence of in distribution forRn-valued random variables is immediate

limn→∞

FXn(t) = FX(t) ∀t ∈ CFX

Multivariate distribution functions are almost intractable.

Unfortunately, componentwise convergence in distribution does notimply convergence in distribution of the vector sequence. Theconverse is indeed true

Xd→ X⇒ Xn,i

d→ Xi ∀i ∈ 1, . . . , d

Proposition (Cramer-Wold device)

Let (Xn)n≥1 be a sequence of random variables, then

Xnd→ X⇔ λ′Xn

d→ λ′X λ ∈ Rd



Almost sure convergence, convergence in probability and convergence indistribution are preserved under continuous transformations.

Theorem (Continuous Mapping Theorem)

Let g : D ⊆ Rd → Rr be a continuous function that does not depend onn, then

If Xna.s.→ X , then g(Xn)

a.s.→ g(X )

If Xnp→ X , then g(Xn)

p→ g(X )

If Xnd→ X , then g(Xn)

d→ g(X )

Theorem (Slutzky’s theorem)

Let Xnd→ X , Yn

p→ Y and Anp→ A for some A ∈ Rk×r . Then

AnXn + Ynd→ AX + Y .



Exercise

Show that√

n(βn − β)d→ N (0,Ω) implies that βn

p→ β.

Exercise

Provide an example of two sequences of random variables (Xn)i≥1 and

(Yn)i≥1 and two random variables X and Y such that Xnd→ X and

Ynd→ Y , but (Xn,Yn)′ 6 d→ (X ,Y ).

Exercise

Let Xn ∼ U[− 1n, 1n

] and let X = 0 almost surely. Is it true that Xnp→ X ?


Law of Large Numbers




There are a lot of different laws of large numbers depending on theassumptions imposed on the random variables.

We have already seen a weak law of large numbers which included thestrong assumptions that the sequence of random variables is i.i.d andin L2. However, the proof of the WLLN still works if we assume thatthe random variables are only uncorrelated instead of independent.Thus, the theorem can be rewritten in the following way:

Theorem (WLLN)

Let (Xn)n≥1 be a sequence of identically distributed and uncorrelated (i.e.

Cov(Xi ,Xj) = 0 ∀j 6= i ) random variables in L2 with µ = E [X1]. DefineXn = 1

n

∑ni=1 Xi , then

Xnp→ µ



Proof.

P(|Xn − E

[Xn

]| ≥ ε

)≤

Var(Xn

)ε2

(97)

=Var

(1n

∑ni=1 Xi

)ε2

(98)

=

∑ni=1

∑nj=1 Cov(Xi ,Xj)

n2ε2(99)

=

∑ni=1 Var(Xi )

n2ε2(100)

=nVar(X1)

n2ε2(101)

=Var(X1)

nε2(102)

Hence, limn→∞

P(|Xn − µ| ≥ ε

)= 0⇒ Xn

p→ µ.



There is a weak law of large numbers which drops the assumptionabout the finiteness of the second moment. This is known asKhintchin’s WLLN:

Theorem (Khintchin’s WLLN)

Let (Xi )i≥1 be a sequence of i.i.d. random variables with E[|X1|] <∞.Define Xn = 1

n

∑ni=1 Xi . Then,

Xnp→ E [X1]

Example

Let Xii .i .d∼ t(2). Then E[Xi ] = 0 and σ2

Xi=∞. The WLLN does not apply

because the second moment fails to be finite. However, Khintchin’s WLLNdoes apply and we can conclude that Xn

p→ E[Xi ] = 0.



n0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Ave

rage

-6

-4

-2

0

2

4

6



The strong law of large numbers refers to almost sure convergence.

Kolmogorov’s SLLN requires the same assumptions as Khintchin’sWLLN, but establishes almost sure convergence.

Theorem (Kolmogorov’s SLLN)

Let (Xi )i≥1 be a sequence of i.i.d random variables with E [|X1|] <∞.

Define Xn = 1n

∑ni=1 Xi . Then,

Xna.s.→ E [X1]

Almost sure convergence and convergence in probability of Rd -valuedrandom variables can be established componentwise. Thus, the law oflarge numbers extend easily to Rd -valued random variables.



The fact that (measurable) functions of independent random variables areagain independent implies the following proposition.

Proposition

Let (Xn)n≥1 be a sequence of i.i.d random variables in Lk and define the

following sequence (Yn)n≥1 :=(X kn

)n≥1

, then

1

n

n∑i=1

X ki

p→ E[X k1 ]

Example

Let Xii .i .d .∼ N (0, 4), then the first four moments are E[X1] = 0,

E[X 21 ] = 4, E[X 3

1 ] = 0 and E[X 41 ] = 54



n0 5000 10000

Firs

t Mom

ent

-10

-5

0

5

10

n0 5000 10000

Sec

ond

Mom

ent

0

10

20

30

40

50

n0 5000 10000

Thi

rd M

omen

t

-10

-5

0

5

10

n0 5000 10000

Fou

rth

Mom

ent

0

50

100


Central Limit Theorem




Theorem (CLT for i.i.d. random variables)

Let (Xn)n≥1 be an i.i.d sequence of random variables in L2 withµ = E [X1] and σ2 = Var(X1) and define Xn = 1

n

∑ni=1 Xi . Then,

√n(Xn − µ)

σ

d→ N (0, 1)

or in terms of sums ∑ni=1 Xi − nµ√

nσ

d→ N (0, 1)

There are a lot of different CLT’s depending on the underlyingassumptions.

Asymptotic approximation: Xna∼ N (µ, σ

2

n ).

Proof via characteristic function and Levy’s Continuity Theorem.



The CLT extends to the multivariate case:

Theorem

Let (Xn)n≥1 be an i.i.d sequence of Rd -valued random variables with meanvector µ = (E [X1] , . . . ,E [Xd ])′ and covariance matrix ΣX = (σi ,j)1≤i ,j≤d .Define Xn = 1

n

∑ni=1 Xi . Then,

√n(Xn − µ

) d→ N (0[d×1],ΣX)

Sidenote: There is no requirement for ΣX to be invertible. Thus, thelimiting normal random variable eventually does not have a density,because a multivariate normal distribution admits a density if ΣX isinvertible.



Example

Let (Xi )i≥1 be an i.i.d. sequence of random variables with P(Xi = 1) = pand P(Xi = 0) = 1− p. Define Sn =

∑ni=1 Xi . Then Sn ∼ Bin(n, p). We

have E [Xi ] = p and Var(Xi ) = p(1− p). By the central limit theorem

Sn − np√np(1− p)

d→ N (0, 1)

and

Sna∼ N

(np,

np(1− p)

n

).


Delta Method

Delta method


Delta Method

Theorem (Delta method)

Let (Xn)i≥1 be a sequence of Rd -valued random variables such that forsome γ > 0,

nγ(Xn − ψ)d→ X .

Further, let f : Rd → Rr be a C1 function with J(ψ) denoting theJacobian matrix of f evaluated at ψ. Then,

nγ(f (Xn)− f (ψ))d→ J(ψ)X .

Corollary

Let (Xn)i≥1 be a sequence of Rd -valued random variables such that√

n(Xn − ψ)d→ N(0,Ω). Then,

√n(f (Xn)− f (ψ))

d→ N(0, J(ψ)ΩJ(ψ)′).


Delta Method

The Delta method can be proven by either starting with the mean valuetheorem or a Taylor series expansion.

Theorem (Mean Value Theorem)

Let h : Rd → Rk be a C1 function. Then h(x) = h(x0) + J(x)(x − x0),where x is between x and x0.

Proof.

By the MVT there exists a Yn ∈ Rd between Xn and ψ (elementwise) suchthat

f (Xn) = f (ψ) + J(Yn)(Xn − ψ).

Rearranging and multiplying both sides with nγ yields

nγ(f (Xn)− f (ψ)) = J(Yn)nγ(Xn − ψ).


Delta Method

Proof.

The fact that nγ(Xn − ψ)d→ X implies that Xn

p→ ψ and since Yn isbetween Xn and ψ, i.e. ||Yn − β|| ≤ ||Xn − β|| it follows that

Ynp→ ψ.

f has continuous first derivatives, thus by the continuation mappingtheorem

J(Yn)p→ J(ψ).

Finally, applying Slutzky’s theorem yields the desired result

√n(f (Xn)− f (ψ)) = J(Yn)nγ(xn − ψ)

d→ J(ψ)X .


Delta Method

Example (Asymptotic distribution of (Xn)2)

Let (Xi )i≥1 be a sequence of i.i.d random variables with µ 6= 0 and

σ2 <∞. The CLT states that

√n(Xn − µ)

d→ N (0, σ2).

The Delta method yields that

√n((Xn)2 − µ2)

d→ J(µ)Z ,Z ∼ N (0, σ2).

Finally, the asymptotic distribution is given by

√n((Xn)2 − µ2)

d→ N (0, 4µ2σ2).

Thus,

(Xn)2 a∼ N (µ2,4µ2σ2

n).


Delta Method

Exercise

Assume Xi ∼ tν ∀i ∈ 1, . . . , n and ν > 4. The MoM-estimator of ν

from an i.i.d sample of size n is given by νMoM =2 1n

∑ni=1 X

2i

1n

∑ni=1 X

2i −1

. Find a

standard error for νMoM .

Exercise

Let (Xi )i≥1 be an i.i.d sequence of real-valued random variables in L2 withµ = E[X1] and σ2 = Var(X1). Define s2 = 1

n−1

∑ni=1(Xi − Xn)2. Show

thatXn − µ

sd→ N (0, 1).

Exercise

Prove the corollary to the Delta method.


Delta Method

Thank you for your attention!


basics of probability theory - uzh › ... › probability-lecture.pdf · 2015-10-07 · overview 1...

Documents