introduction to probability - bgu mathyadina/teaching/probability/prob_notes.… · introduction to...

INTRODUCTION TO PROBABILITY

ARIEL YADIN

Course: 201.1.8001 Fall 2013-2014

Lecture notes updated: December 24, 2014

Contents

Lecture 1. 3

Lecture 2. 9

Lecture 3. 18

Lecture 4. 24

Lecture 5. 33

Lecture 6. 38

Lecture 7. 44

Lecture 8. 51

Lecture 9. 57

Lecture 10. 66

Lecture 11. 73

Lecture 12. 76

Lecture 13. 81

Lecture 14. 87

Lecture 15. 92

Lecture 16. 1011

2

Lecture 17. 105

Lecture 18. 113

Lecture 19. 118

Lecture 20. 131

Lecture 21. 135

Lecture 22. 140

Lecture 23. 156

Lecture 24. 159

3

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 1

1.1. Example: Bertrand’s Paradox

We begin with an example [this is known as Bertrand’s paradox].

Joseph Louis Francois

Bertrand (1822–1900)Question 1.1. Consider a circle of radius 1, and an equilateral triangle bounded in the

circle, say ABC. (The side length of such a triangle is√

3.) Let M be a randomly

chosen chord in the circle.

What is the probability that the length of M is greater than the length of a side of the

triangle (i.e.√

3)?

Solution 1. How to chose a random chord M?

One way is to choose a random angle, let r be the radius at that angle, and let M be

the unique chord perpendicular to r. Let x be the intersection point of M with r. (See

Figure 1, left.)

Because of symmetry, we can rotate the triangle so that the chord AB is perpendicular

to r. Since the sides of the triangle intersect the perpendicular radii at distance 1/2

from 0, M is longer than AB if and only if x is at distance at most 1/2 from 0.

r has length 1, so the probability that x is at distance at most 1/2 is 1/2. ut

Solution 2. Another way: Choose two random points x, y on the circle, and let M be

the chord connecting them. (See Figure 1, right.)

Because of symmetry, we can rotate the triangle so that x coincides with the vertex

A of the triangle. So we can see that y falls in the arc BC on the circle if and only if

M is longer than a side of the triangle.

4

B C

M

A

x

r

B C

A

M

y

Figure 1. How to choose a random chord.

The probability of this is the length of the arc BC over 2π. That is, 1/3 (the arc BC

is one-third of the circle). ut

Solution 3. A different way to choose a random chord M : Choose a random point x in

the circle, and let r be the radius through x. Then choose M to be the chord through

x perpendicular to r. (See Figure 1, left.)

Again we can rotate the triangle so that r is perpendicular to the chord AB.

Then, M will be longer than AB if and only if x lands inside the triangle; that is if

and only if the distance of x to 0 is at most 1/2.

Since the area of a disc of radius 1/2 is 1/4 of the disc of radius 1, this happens with

probability 1/4. ut

How did we reach this paradox?

The original question was ill posed. We did not define precisely what a random chord

is.

The different solutions come from different ways of choosing a random chord - these

are not the same.

We now turn to precisely defining what we mean by “random”, “probability”, etc.

5

1.2. Sample spaces

When we do some experiment, we first want to collect all possible outcomes of the

experiment. In mathematics, a collection of objects is just a set.

The set of all possible outcomes of an experiment is called a sample space. Usually

we denote the sample space by Ω, and its elements by ω.

Example 1.2.

• A coin toss. Ω = H,T. Actually, it is the set of heads and tails, maybe ,~,but H,T are easier to write.

• Tossing a coin twice. Ω = H,T2. What about tossing a coin k times?

• Throwing two dice. Ω = , , , , , 2. It is probably easier to use

1, 2, 3, 4, 5, 62.

• The lifetime of a person. Ω = R+. What if we only count the years? What if

we want years and months?

• Bows and arrows: Shooting an arrow into a target of radius 1/2 meters. Ω =

(x, y) : x2 + y2 ≤ 1/4. What about Ω = (r, θ) : r ∈ [0, 1/2] , θ ∈ [0, 2π).What if the arrow tip has thickness of 1cm? So we don’t care about the point

up to radius 1cm? What about missing the target altogether?

• A random real valued continuously differentiable function on [0, 1]. Ω = C1([0, 1]).

• Noga tosses a coin. If it comes out heads she goes to the probability lecture,

and either takes notes or sleeps throughout. If the coin comes out tails, she goes

running, and she records the distance she ran in meters and the time it took.

Ω = H × notes, sleep ⋃ T × N× R+.

454

1.3. Events

Suppose we toss a die. So the sample space is, say, Ω = 1, 2, 3, 4, 5, 6. We want to

ecode the outcome “the number on the die is even”.

6

This can be done by collecting together all the relevant possible outcomes to this

event. That is, a sub-collection of outcomes, or a subset of Ω. The event “the number

on the die is even” corresponds to the subset 2, 4, 6 ⊂ Ω.

What do we want to require from events? We want the following properties:

• ”Everything” is an event; that is, Ω is an event.

• If we can ask if an event occured, then we can also ask if it did not occur; that

is, if A is an event, then so is Ac.

• If we have many events of which we can ask if they have occured, then we can

also ask if one of them has occured; that is, if (An)n∈N is a sequence of events,

then so is⋃nAn.

X A word on notation: Ac = Ω \ A. One needs to be careful in which space we are

taking the complement. Some books use A.

Thus, if we want to say what events are, we have: If F is the collection of events on

a sample space Ω, then F has the following properties:

• The elements of F are subsets of Ω (i.e. F ⊂ P(Ω)).

• Ω ∈ F .

• If A ∈ F then Ac ∈ F .

• If (An)n∈N is a sequence of elements of F , then⋃nAn ∈ F .

Definition 1.3. F with the properties above is called a σ-algebra (or σ-field).

X Explain the name: algebra, σ-algebra.

Example 1.4. Let Ω be any set (sample space). Then,

• F = ∅,Ω• G = 2Ω = P(Ω)

are both σ-algebras on Ω. 454

When Ω is a countable sample space, then we will always take the full σ-algebra 2Ω.

(We will worry about the uncountable case in the future.)

7

To sum up: For a countable sample space Ω, any subset A ⊂ Ω is an event.

Example 1.5.

• The experiment is tossing three coins. The event A is “second toss was heads”.

Ω = H,T3. A = (x,H, y) : x, y ∈ H,T.• The experiment is “how many minutes passed until a goal is scored in the Manch-

ester derby”. The event A is “the first goal was scored after minute 23 and before

minute 45”.

Ω = 0, 1, . . . , 90 (maybe extra-time?). A = 23, 24, . . . , 44.

454

Example 1.6. We are given a computer code of 4 letters. Ai is the event that the i-th

letter is ‘L’.

What are the following events?

• A1 ∩A2 ∩A3 ∩A4

• A1 ∪A2 ∪A3 ∪A4

• Ac1 ∩Ac2 ∩Ac3 ∩Ac4• Ac1 ∪Ac2 ∪Ac3 ∪Ac4• A3 ∩A4

• A1 ∩Ac2What are the events in terms of Ai’s?

• There are at least 3 L’s

• There is exactly one L

• There are no two L’s in a row

454

Example 1.7. Every day products on a production line are chosen and labeled ‘-’ if

damaged or ‘+’ if good. What are the complements of the following events?

• A = at least two products are damaged

• B = at most 3 products are damaged

8

• C = there are at least 3 more good products than damaged products

• D = there are no damaged products

• E = most of the products are damaged

To which of the above events does the string −−−+ +−+ +−− belong to?

What are the events A ∩B,A ∪B,C ∩ E,B ∩D,B ∪D. 454

9


201.1.8001

Ariel Yadin

Lecture 2

2.1. Probability

Remark 2.1. Two sets A,B are called disjoint if A ∩ B = ∅. For a sequence (An)n∈N,

we say that (An) are mutually disjoint if any two are disjoint; i.e. for any n 6= m,

An ∩Am = ∅.

What is probability? It is the assignment of a value to each event, between 0 and

1, saying how likely this event is. That is, if F is the collection of all events on some

sample space, then probability is a function from the events to [0, 1]; i.e. P : F → [0, 1].

Of course, not every such assignment is good, and we would like some consistency.

What would we like of P? For one thing, we would like that the likelyhood of every-

thing is 100%. That is, P(Ω) = 1. Also, we would like that if A,B ∈ F are two events

that are disjoint, i.e. A∩B = ∅, then the likelyhood that one of A,B occurs is the sum

of their likelyhoods; i.e. P(A ∪B) = P(A) + P(B).

This leads to the following definition:

Andrey Kolmogorov

(1903–1987)

Definition 2.2 (Probability measure). Let Ω be a sample space, and F a σ-algebra on

Ω. A probability measure on (Ω,F) is a function P : F → [0, 1] such that

• P(Ω) = 1.

• If (An) is a sequence of mutually disjoint events in F , then

P(⋃

n

An) =∑

n

P(An).

(Ω,F ,P) is called a probability space.

Example 2.3.

10

• A fair coin toss: Ω = H,T, F = 2Ω = ∅, H , TΩ. P(∅) = 0,P(H) =

P(T) = 1/2,P(Ω) = 1.

• Unfair coin toss: P(∅) = 0,P(H) = 3/4,P(T) = 1/4,P(Ω) = 1.

• Fair die: Ω = 1, 2, . . . , 6 ,F = 2Ω. For A ⊂ Ω let P(A) = |A|/6.

• Let Ω be any countable set. Let F = 2Ω. Let p : Ω → [0, 1] be a function such

that∑

ω∈Ω p(ω) = 1. Then, for any A ⊂ Ω define

P(A) =∑

ω∈Ap(ω).

P is a probability measure on (Ω,F).

• Let γ =∑

n≥1 n−2. For A ⊂ 1, 2, . . . , define

P(A) = 1γ ·∑

a∈Aa−2.

P is a probability measure.

454

Some properties of probability measures:

Proposition 2.4. Let (Ω,F ,P) be a probability space. Then,

(1) ∅ ∈ F and P(∅) = 0.

(2) If A1, . . . , An are a finite number of mutually disjoint events, then

A :=

n⋃

j=1

Aj ∈ F and P(A) =

n∑

j=1

P(Aj).

(3) For any event A ∈ F , P(Ac) = 1− P(A).

(4) If A ⊂ B are events, then P(B \A) = P(B)− P(A).

(5) If A ⊂ B are events, then P(A) ≤ P(B).

(6) (Inclusion-Exclusion Principle.) For events A,B ∈ F ,

P(A ∪B) = P(A) + P(B)− P(A ∩B).

Proof. Let A,B, (An)n∈N be events in F .

11

(1) ∅ = Ωc ∈ F . Set Aj = ∅ for all j ∈ N. This is a sequence of mutually disjoint

events. Since ∅ =⋃j Aj , we have that P(∅) =

∑j P(∅) implying that P(∅) = 0.

(2) For j > n let Aj = ∅. Then (Aj)j∈N is a sequence of mutually disjoint events, so

P(n⋃

j=1

Aj) = P(⋃

j

Aj) =∑

j

P(Aj) =n∑

j=1

P(Aj).

(3) A ∪ Ac = Ω and A ∩ Ac = ∅. So the additivity of P on disjoint sets gives that

P(A) + P(Ac) = P(Ω) = 1.

(4) If A ⊂ B, then B = (B \ A) ∪ A, and (B \ A) ∩ A = ∅. Thus, P(B) =

P(B \A) + P(A).

(5) Since A ⊂ B, we have that P(B)− P(A) = P(B \A) ≥ 0.

(6) Let C = A ∩B. Note that A ∪B = (A \C) ∪ (B \C) ∪C, where all sets in the

union are mutually disjoint. Since C ⊂ A,C ⊂ B we get

P(A∪B) = P(A\C)+P(B\C)+P(C) = P(A)−P(C)+P(B)−P(C)+P(C) = P(A)+P(B)−P(C).

ut

George Boole

(1815–1864)Proposition 2.5 (Boole’s inequality). Let (Ω,F ,P) be a probability space. If (An)n∈N

are events, then

P(⋃

n∈NAn) ≤

∑

n∈NP(An).

Proof. For every n let B0 = ∅, and

Bn =n⋃

j=1

Aj and Cn = An \Bn−1.

Claim 1. For any n > m ≥ 1, we have that Cn ∩ Cm = ∅.Proof. Since m ≤ n− 1, we have that Am ⊂ Bn−1. Since Cm ⊂ Am,

Cn ∩ Cm ⊂ (An \Bn−1) ∩Bn−1 = ∅.

Claim 2.⋃nAn ⊂

⋃nCn.

Proof. Let ω ∈ ⋃nAn. So there exists n such that ω ∈ An. Let m be the smallest

integer such that ω ∈ Am and ω 6∈ Am−1 (if ω ∈ A1 let m = 1). Then, ω ∈ Am, and

12

ω 6∈ Ak for all 1 ≤ k < m. In other words: ω ∈ Am and ω 6∈ ⋃m−1k=1 Ak = Bm−1. Or:

ω ∈ Am\Bm−1 = Cm. Thus, there exists some m ≥ 1 such that ω ∈ Cm; i.e. ω ∈ ⋃nCn.

We conclude that (Cn)n≥1 is a collection of events that are pairwise disjoint, and⋃nAn ⊂

⋃nCn. Using # 5 from Proposition 2.4

P(⋃

n

An) ≤ P(⋃

n

Cn) =∑

n

P(Cn) ≤∑

n

P(An).

ut

Example 2.6 (Monty Hall paradox). In a famous gameshow, the constestant is given

three boxes, two of them contain nothing, but one contains the keys to a new car. All

possibilities of key placements are possible. The contestant chooses a box. Then the host

reveals one of the other two boxes that does not contain anything, and the contestant

is given a choice whether to switch her choice or not.

What should she do?

A = an empty box is chosen in the first choice . B = the keys are in the other

box (she should switch) .One checks that

B ∩Ac = ∅ and A ∩Bc = ∅

so A = B. Since all key placements are equally likely, the probability the key is in the

chosen box is 1/3. That is, P(Ac) = 1/3, and so P(B) = P(A) = 2/3. 454

2.2. Discrete Probability Spaces

Discrete probability spaces are those for which the sample space is countable. We

have already seen that in this case we can take all subsets to be events, so F = 2Ω. We

have also implicitly seen that due to additivity on disjoint sets, the probability measure

P is completely determined by its value on singletons.

That is, let (Ω, 2Ω,P) be a probability space where Ω is countable. If we denote

p(ω) = P(ω) for all ω ∈ Ω, then for any event A ⊂ Ω we have that A =⋃ω∈A ω,

and this is a countable disjoint union. Thus,

P(A) =∑

ω∈AP(ω) =

∑

ω∈Ap(ω).

13

Exercise 2.7. Let Ω = 1, 2, . . . , and define a probability measure on (Ω, 2Ω) by

(1) P(ω) = 1(e−1)ω! .

(2) P′(ω) = 13 · (3/4)ω.

(Extend using additivity on disjoint unions.) Show that both uniquely define probability

measures.

Solution. We need to show that P(Ω) = 1, and that P is additive on disjoint unions.

Indeed,

P(Ω) =

∞∑

ω=1

P(ω) =∞∑

ω=1

1(e−1)ω! = 1.

P′(Ω) = 13 ·

∞∑

ω=1

(3/4)ω = 1.

Now let (An) be a sequence of mutually disjoint events and let A =⋃nAn. Using the

fact that any subset is the disjoint union of the singlton composing it, we have that

ω ∈ A if and only if there exists a unique n(ω) such that ω ∈ An. Thus,

P(A) =∑

ω∈AP(ω) =

∑

n

∑

ω∈An

P(ω) =∑

n

P(An).

The same for P′. ut

The next proposition generalizes the above examples, and characterizes all discrete

probability spaces.

Proposition 2.8. Let Ω be a countable set. Let p : Ω→ [0, 1], such that∑

ω∈Ω p(ω) = 1.

Then, there exists a unique probability measure P on (Ω, 2Ω) such that P(ω) = p(ω)

for all ω ∈ Ω. (p as above is somtimes called the density of P.)

Proof. Let A ⊂ Ω. Define

P(A) =∑

ω∈Ap(ω).

We have by the assumption on p that P(Ω) =∑

ω∈Ω p(ω) = 1. Also, if (An) is a sequence

of events that are mutually disjoint, and A =⋃nAn is their union, then, for any ω ∈ A

14

there exists n such that ω ∈ An. Moreover, this n is unique, since An ∩ Am = ∅ for all

m 6= n, so ω 6∈ Am for all m 6= n. So

P(A) =∑

ω∈Ap(ω) =

∑

n

∑

ω∈An

p(ω) =∑

n

P(An).

This shows that P is a probability measure on (Ω, 2Ω).

For uniquness, let P : 2Ω → [0, 1] be a probability measure such that P (ω) = p(ω)

for all ω ∈ Ω. Then, for any A ⊂ Ω,

P (A) =∑

ω∈AP (ω) =

∑

ω∈Ap(ω) = P(A).

utExample 2.9.

(1) An simple but important example is for finite sample spaces. Let Ω be a finite set.

Suppose we assume that all outcomes are equally likely. That is P(ω) = 1/|Ω|for all ω ∈ Ω, and so, for any event A ⊂ Ω we have that P(A) = |A|/|Ω|.It is simple to verify that this is a probability measure. This is known as the

uniform measure on Ω.

(2) We throw two fair dice. What is the probability the sum is 7? What is the

probability the sum is 6?

Solution. The sample space here is Ω = 1, 2, . . . , 62 = (i, j) : 1 ≤ i, j ≤ 6.Since we assume the dice are fair, all outcomes are equally likely, and so the prob-

ability measure is the uniform measure on Ω.

Now, the event that the sum is 7 is

A = (i, j) : 1 ≤ i, j ≤ 6 , i+ j = 7 = (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1) .

So P(A) = |A|/|Ω| = 6/36 = 1/6.

The event that the sum is 6 is

B = (i, j) : 1 ≤ i, j ≤ 6 , i+ j = 6 = (1, 5), (2, 4), (3, 3), (4, 2), (5, 1) .

So P(B) = 5/36.

15

(3) There are 15 balls in a jar, 7 black balls and 8 white balls. When removing a

ball from the jar, any ball is equally likely to be removed. Shir puts her hand in

the jar and takes out two balls, one after the other. What is the probablity Shir

removes one black ball and one white ball.

Solution. First, we can think of the different balls being represented by the

numbers 1, 2, . . . , 15. So the sample space is Ω = (i, j) : 1 ≤ i 6= j ≤ 15. Note

that the size of Ω is |Ω| = 15 · 14. Since it is equally likely to remove any ball,

we have that the probability measure is the uniform measure on Ω.

Let A be the event that one ball is black and one is white. How can we compute

P(A) = |A|/|Ω|?We can split A into two disjoint events, and use additivity of probability mea-

sures:

Let B be the event that the first ball is white and the second ball is black. Let B′

be the event that the first ball is black and the second ball is white. It is imme-

diate that B and B′ are disjoint. Also, A = B∪B′. Thus, P(A) = P(B)+P(B′),

and we only need to compute the sizes of B,B′.

Note that the number of pairs (i, j) such that ball i is black and ball j is white

is 7 · 8. So |B| = 7 · 8. Similarly, |B′| = 8 · 7. All in all,

P(A) = P(B) + P(B′) =7 · 8

15 · 14+

8 · 715 · 14

=8

15.

(4) A deck of 52 cards is shuffelled, so that any order is equally likely. The top 10

cards are distributed amoung 2 players, 5 to each one. What is the probability

that at least one player has a royal flush (10-J-Q-K-A of the same suit)?

Solution. Each player receives a subset of the cards of size 5, so the sample

space is

Ω = (S1, S2) : S1, S2 ⊂ C, |S1| = |S2| = 5, S1 ∩ S2 = ∅ ,

where C is the set of all cards (i.e. C = A♠, 2♠, . . . , A♦, 2♦, . . . , ). There are(

525

)ways to choose S1 and

(475

)ways to choose S2, so |Ω| =

(525

)·(

475

). Also,

every choice is equally likely, so P is the uniform measure.

16

Let Ai be the event that player i has a royal flush (i = 1, 2). For s ∈♠,♦,♥,♣, let B(i, s) be the event that player i has a royal flush of the suit

s. So Ai =⊎sB(i, s). B(i, s) is the event that Si is a specific set of 5 cards, so

|B(i, s)| =(

475

)for any choice of i, s. Thus,

P(Ai) =∑

s

|B(i, s)||Ω| =

4(525

) .

Now we use the inclusion-exclusion principle:

P(A) = P(A1 ∪A2) = P(A1) + P(A2)− P(A1 ∩A2).

The event A1∩A2 is the event that both players have a royal flush, so |A1∩A2| =4 ·3 since there are 4 options for the first player’s suit, and then 3 for the second.

Altogether,

P(A) =4(525

) +4(525

) − 4 · 3(525

)·(

475

) =8(525

) ·(

1− 3

2 ·(

475

)).

454

Exercise 2.10. Prove that there is no uniform probability measure on N; that is, there

is no probability measure on N such that P(i) = P(j) for all i, j.

Solution. Assume such a probability measure exists. By the defining properties of prob-

ability measures,

1 = P(N) =∞∑

j=0

P(j) =

∞∑

j=1

p =

∞ p > 0

0 p = 0.

A contradiction. ut

Exercise 2.11. A deck of 52 card is ordered randomly, all orders equally likely. what

is the probability that the 14th card is an ace? What is the probability that the first ace

is at the 14th card?

Solution. The sample space is all the possible orderings of the set of cards C = A♠, 2♠, . . . , A♦, 2♦, . . . , .So Ω is the set of all 1-1 functions f : 1, . . . , 52 → C, where f(1) is the first card, f(2)

the second, and so on. So |Ω| = 52!. The measure is the uniform one.

17

Let A be the event that the 14th card is an ace. Let B be the event that the first ace

is at the 14th card.

A is the event that f(14) is an ace, and there are 4 choices for this ace, so if A is the

set of aces, A = f ∈ Ω : f(14) ∈ A, so |A| = 4 · 51!. Thus, P(A) = 452 = 1

13 .

B is the event that f(14) is an ace, but also f(j) is not an ace for all j < 14.

Thus, B = f ∈ Ω : f(14) ∈ A , ∀ j < 14 f(j) 6∈ A. So |B| = 4 · 48 · 47 · · · 36 · 38! =

4 · 48! · 38 · 37 · 36, and P(B) = 4·38·37·3652·51·50·49 = 0.031160772. ut

18


201.1.8001

Ariel Yadin

Lecture 3

3.1. Some set theory

Recall the inclusion-exclusion principle:

P(A ∪B) = P(A) + P(B)− P(A ∩B).

This can be demonstrated by the Venn diagram in Figure 2.

John Venn (1834–1923)

A B

A ∩B

A ⋃B

Figure 2. The inclusion-exclusion principle.

This diagram also illustrates the intuitive meaning of A ∪ B; namely, A ∪ B is the

event that one of A or B occurs. Similarly, A∩B is the event that both A and B occur.

Let (An)∞n=1 be a sequence of events in some probability space. We have

⋃

n≥kAn = ω ∈ Ω : ω ∈ An for at least one n ≥ k

= ω ∈ Ω : there exists n ≥ k such that ω ∈ An .

So⋂∞k=1

⋃n≥k An is the set of ω ∈ Ω such that for any k, there exists n ≥ k such that

ω ∈ An; i.e.

∞⋂

k=1

⋃

n≥kAn = ω ∈ Ω : ω ∈ An for infinitely many n .

19

Definition 3.1. Define the set lim supAn as

lim supAn :=

∞⋂

k=1

⋃

n≥kAn.

The intuitive meaning of lim supnAn is that infinitely many of the An’s occur.

Similarly, we have that

∞⋃

k=1

⋂

n≥kAn = ω ∈ Ω : there exists n0 such that for all n > n0, ω ∈ An

= ω ∈ Ω : ω ∈ An for all large enough n .

Definition 3.2. Define

lim inf An :=∞⋃

k=1

⋂

n≥kAn.

lim inf An means that An will occur from some large enough n onward; that is, even-

tually all An occur.

It is easy to see that

lim inf An ⊆ lim supAn.

This also fits the intuition, as if all An occur eventually, then they occur infinitely many

times.

Definition 3.3. If (An)∞n=1 is a sequence of events such that lim inf An = lim supAn

(as sets) then we define

limAn := lim inf An = lim supAn.

Example 3.4. Consider Ω = N, and the sequence

An =nj : j = 0, 1, 2, . . . ,

.

If m < n and m ∈ An, then it must be that m = n0 = 1. Thus,⋂n≥k An = 1 and

lim infn

An =

∞⋃

k=1

⋂

n≥kAn = 1 .

20

Also, if m < k ≤ n, and m ∈ An, then again m = 1, so⋃n≥k An does not contain any

1 < m < k; i.e.⋃n≥k An ⊂ 1, k, k + 1, . . .. Hence,

lim supAn =∞⋂

k=1

⋃

n≥kAn = 1 .

Thus, the limit exists and limnAn = 1. 454

Definition 3.5. A sequence of events is called increasing if An ⊂ An+1 for all n, and

decreasing if An+1 ⊂ An for all n.

Proposition 3.6. Let (An) be a sequence of events. Then,

(lim inf An)c = lim supAcn.

Moreover, if (An) is increasing (resp. decreasing) then limAn =⋃nAn (resp. limAn =

⋂nAn).

Proof. The first assertion is de-Morgan.

For the second assertion, note that if (An) is increasing, then

⋃

n≥kAn =

⋃

n≥1

An and⋂

n≥kAn = Ak.

So

lim supAn =⋂

k

⋃

n≥kAn =

⋂

k

⋃

n

An =⋃

n

An

lim inf An =⋃

k

⋂

n≥kAn =

⋃

k

Ak.

If (An) is decreasing then (Acn) is increasing. So

lim supAn = (lim inf Acn)c =(⋃

n

Acn)c

= (lim supAcn)c = lim inf An,

and limAn =(⋃

nAcn

)c=⋂nAn. ut

Exercise 3.7. Show that if F is a σ-algebra, and (An) is a sequence of events in F ,

then lim inf An ∈ F and lim supAn ∈ F .

21

3.2. Continuity of Probability Measures

The goal of this section is to prove

Theorem 3.8 (Continuity of Probability). Let (Ω,F ,P) be a probability space. Let (An)

be a sequence of events such that limAn exists. Then,

P(limAn) = limn→∞

P(An).

We start with a restricted version:

Proposition 3.9. Let (Ω,F ,P) be a probability space. Let (An) be a sequence of in-

creasing (resp. decreasing) events. Then,

P(limAn) = limn→∞

P(An).

Proof. We start with the increasing case. Let A =⋃nAn = limAn. Define A0 = ∅, and

Bk = Ak \Ak−1 for all k ≥ 1. So (Bk) are mutually disjoint and

n⊎

k=1

Bk =n⋃

k=1

Ak = An

so⊎nBn = A. Thus,

P(limAn) = P(⊎

n

Bn) =∑

n

P(Bn) = limn

n∑

k=1

P(Bk)

= limn

P(

n⊎

k=1

Bk) = limn

P(An).

The decreasing case follows from noting that if (An) is decreasing, then (Acn) is in-

creasing, so

P(limAn) = P(⋂

n

An) = 1− P(⋃

n

Acn) = 1− limn

P(Acn) = limn

P(An).

ut

Pierre Fatou (1878–1929)

Lemma 3.10 (Fatou’s Lemma). Let (Ω,F ,P) be a probability space. Let (An) be a

sequence of events (that may not have a limit). Then,

P(lim inf An) ≤ lim infn

P(An) and lim supn

P(An) ≤ P(lim supAn).

22

Proof. For all k let Bk =⋂n≥k An. So lim infnAn =

⋃k Bk. Note that (Bk) is an

increasing sequence of events (Bk = Bk+1 ∩ Ak). Also, for any n ≥ k, Bk ⊂ An, so

P(Bk) ≤ infn≥k P(An). Thus,

P(lim inf An) = P(⋃

n

Bn) = limn

P(Bn) ≤ limn

infk≥n

P(Ak) = lim infn

P(An).

For the lim sup,

lim supn

P(An) = lim supn

(1− P(Acn)) = 1− lim infn

P(Acn)

≤ 1− P(lim inf Acn) = 1− P((lim supAn)c) = P(lim supAn).

ut

Fatou’s Lemma immediately proves Theorem 3.8.

Proof of Theorem 3.8. Just note that

lim supn

P(An) ≤ P(lim supAn) = P(limAn) = P(lim inf An) ≤ lim infn

P(An),

so equality holds, and the limit exists. ut

As a consequence we get

Lemma 3.11 (First Borel-Cantelli Lemma). Let (Ω,F ,P) be a probability space. Let

(An) be a sequence of events. If∑

n P(An) <∞, then

P( An occurs for infinitely many n ) = P(lim supAn) = 0.

Emile Borel (1871–1956)Proof. Let Bk =

⋃n≥k An. So (Bk) is decreasing, and so the decreasing sequence P(Bk)

converges to P(lim supAn). Thus, for all k,

P(lim supAn) ≤ P(Bk) ≤∑

n≥kP(An).

Since the right hand side converges to 0 as k →∞ by the assumption that the series is

convergent, we get P(lim supAn) ≤ 0. ut

Francesco Cantelli

(1875–1966)

23

Example 3.12. We have a bunch of bacteria in a patry dish. Every second, the bacteria

give off offspring randomly, and then the parents die out. Suppose that for any n, the

probability that there are no bacteria left by time n is 1− exp(−f(n)).

What is the probability that the bacteria eventually die out if:

• f(n) = logn.

• f(n) = n2−72n2+3n+5

.

Let An be the event that the bacteria dies out by time n. So P(An) = 1− e−f(n), for

n ≥ 1.

Note that the event that the bacteria eventually die out, is the event that there exists

n such that An; i.e. the event⋃nAn. Since (An)n is an increasing sequence, we have

that P(⋃nAn) = limn P(An).

In the first case this is 1. In the second case this is

limn→∞

1− exp

(− n2 − 7

2n2 + 3n+ 5

)= 1− e−1/2.

454

Example 3.13. What if we take the previous example, and the information is that the

probability that the bacteria at generation n do not die out without offspring is at most

exp(−2 log n).

Then, if An is the event that the n-th generation has offspring, we have that P(An) ≤n−2. Since

∑n P(An) <∞, Borel-Cantelli tells us that

P(An occurs for infinately many n) = P(lim supAn) = 0.

That is,

P(∃ k : ∀ n ≥ k : Acn) = P(lim inf Acn) = 1.

So a.s. there exists k such that all generations n ≥ k do not have offspring – implying

that the bacteria die out with probability 1. 454

24


201.1.8001

Ariel Yadin

Lecture 4

4.1. Conditional Probabilities

We start with a simple observation.

Proposition 4.1. Let (Ω,F ,P) be a probability space. Let F ∈ F be an event such that

P(F ) > 0. Define

F|F = A ∩ F : A ∈ F ,

and define P : F|F → [0, 1] by P (B) = P(B)/P(F ) for all B ∈ F|F , and Q : F → [0, 1]

by Q(A) = P(A ∩ F )/P(F ) for all A ∈ F . Then, F|F is a σ-algebra on F , and P is a

probability measure on (F,F|F ), and Q is a probability measure on (Ω,F).

Notation. The probability measure Q above is usually denoted P(·|F ).

Proof. The σ-algebra part just follows from F = F ∩ F and⋃n(F ∩An) = F ∩⋃nAn.

P is a probability measure since P (F ) = P(F )/P(F ) = 1, and if (Bn) is a sequence

of mutually disjoint events in FF , then

P (⋃

n

Bn) = P(⋃

n

Bn)/P(F ) =1

P(F )

∑

n

P(Bn) =∑

n

P (Bn).

Q is a probability measure since Q(Ω) = P(Ω∩F )/P(F ) = 1 and if (An) is a sequence

of disjoint events in F , then

Q(⋃

n

An) =1

P(F )

∑

n

P(An ∩ F ) =∑

n

Q(An).

ut

What is the intuitive meaning of the above measure Q? What we did is restrict all

events to there intersection with F , and look at them only in F , normalized to have

25

total probability 1. This can be thought of as the probability of an event, given that we

know that the outcome is in F .

Definition 4.2 (Conditional Probability). Let (Ω,F ,P) be a probability space, and let

F ∈ F be an event such that P(F ) > 0.

For any event A ∈ F , the quantity P(A ∩ F )/P(F ) is called the conditional prob-

ability of A given F .

The probability measure P(·|F ) := P(· ∩ F )/P(F ) is called the conditional proba-

bility given F .

!!! One must be careful with conditional probabilities - intuition here is many

times misleading.

Example 4.3. An urn contains 10 white balls, 5 yellow balls and 10 black balls. Shir

takes a ball out of the urn, all balls equally likely.

• What is the probability that the ball removed is yellow?

• What is the probability the ball removed is yellow given it is not black?

454

Solution. A = the ball is yellow. B = the ball is not black.

P(A) = 5/25 = 1/5. P(B) = 15/25 = 3/5. P(A ∩ B) = P(A) = 1/5. So P(A|B) =

(1/5)/(3/5) = 1/3. ut

Example 4.4. Noga has the heart of an artist and the mind of a mathematician. Given

she takes a course in probability she will pass with probability 1/3. Given she take a

course in contemporary art she will pass with probability 1/2. She chooses which course

to take using a fair coin toss - heads for probability and tails for art.

What is the probability that Noga takes probability and passes?

What is the probability she passes no matter what course she takes? 454

Solution. The sample space here is

Ω = H,T × pass, fail .

26

The events we are interested in are A = Noga passes the course = H,T×pass. B =

Noga takes probability = H × pass, fail. Bc = Noga takes art = H × pass, fail.So the event that Noga takes and passes probability is A ∩B.

The information in the question tells us that P(A|B) = 1/3 and P(A|Bc) = 1/2.

Thus,

P(A ∩B) = P(A|B)P(B) = 1/6,

and

P(A) = P(A ∩B) + P(A ∩Bc) =1

6+

1

4=

5

12.

ut

Example 4.5. An urn contains 8 black balls and 4 white balls. Two balls are removed

one after the other, any ball being equally likely.

What is the probability that the second ball is black, given that the first is black?

What is the probability that the first is black, given that the second is black? 454

Solution. A = first ball is black. B = second ball is black. We know that P(A) = 8/12,

and that P(A ∩ B) = 8·712·11 . So (as is intuitive) P(B|A) = 7/11. However, maybe

somewhat less intuitive is P(A|B). Note that P(B ∩Ac) = 4·812·11 . So

P(B) = P(B ∩A) + P(B ∩Ac) =8 · (4 + 7)

12 · 11= 8/12.

Thus, P(A|B) = P(A ∩B)/P(B) = 7/11. ut

Exercise 4.6. Prove that for all events A1, . . . , An such that P(A1 ∩ · · · ∩An) > 0

P(A1 ∩ · · · ∩An) = P(An|A1 ∩ · · · ∩An−1) · P(An−1|A1 ∩ · · · ∩An−2) · · ·P(A2|A1) · P(A1)

= P(A1) ·n−1∏

j=1

P(Aj+1|A1, . . . , Aj).

Solution. By induction. The base is simple. The induction step follows from

P(A1 ∩ · · · ∩An+1) = P(An+1|A1 ∩ · · · ∩An) · P(A1 ∩ · · · ∩An).

ut

27

4.2. Bayes’ Rule

Let (Ω,F ,P) be a probability space. Let (An)n be a partition of Ω; that is, (An)

are mutually disjoint and⋃nAn = Ω. Assume further that P(An) > 0 for all n. Let

A,B be events of positive probability. Then, since (B ∩ An)n are mutually disjoint, by

additivity we have

P(B) =∑

n

P(B ∩An) =∑

n

P(B|An)P(An).

This is called the law of total probability.

A1 A2

A3

A4

A5A6

A7

A8

A9

A10

A11

A12

Ω

B

Figure 3. The law of total probability. In this case Ω = A1 ] · · · ]A12,

and

P(B) = P(B|A1)P(A1) + P(B|A2)P(A2) + P(B|A5)P(A5) + P(B|A6)P(A6)

+P(B|A7)P(A7) + P(B|A9)P(A9) + P(B|A10)P(A10).

Thomas Bayes

(1701–1761)

Another observation is Bayes’ rule:

P(B|A) = P(A|B) · P(B)

P(A).

Combining the two, we have that

P(An|B) = P(B|An) · P(An)∑n P(B|An)P(An)

.

28

Exercise 4.7. A new machine is invented that determines if a student will become a

millionaire in the next ten years or not. Given that a student will become a millionaire in

ten years, the machine succeeds in predicting this 95% of the time. Given that the student

will not become a millionaire in the next ten years, the machine predics (wrongly) that

he will become a millionaire in 1% of the cases (this is known as a false positive). Only

0.5% of the students will become millionaires in the next ten years. Given that Zuckerberg

has been predicted to become a millionaire by the machine, what is the probability that

Zuckerberg will actually become one?

Solution. A = Zuckerberg will become a millionaire in the next ten years . B = Zuckerberg is predicted to become a millionaire by the machine . The information is:

P(A) = 0.005 P(B|A) = 0.95 P(B|Ac) = 0.01.

So,

P(A|B) = P(B|A) · P(A)

P(B|A)P(A) + P(B|Ac)P(Ac)=

0.95 · 0.005

0.95 · 0.005 + 0.01 · 0.995

=95

95 + 199= 0.319727891 < 1/3.

Not such a great machine after all... ut

George Polya

(1887–1985)

Exercise 4.8 (Polya’s Urn). An urn contains one black ball and one white ball. At

every time step, a ball is chosen randomly from the urn, and returned with another new

ball of the same color. (So at time t there are a total of t+ 2 balls.) Calculate

p(t, b) = P( there are b black balls at time t ).

Solution. First let’s see how p(t, b) developes when we change t and b by 1.

If at time t there are b black balls out of a total of t+ 2, then at time t+ 1 there are

b+ 1 black balls with probability b/t+ 2 and b black balls with probability 1− b/t+ 2.

Thus,

p(t+ 1, b) =b− 1

t+ 2· p(t, b− 1) +

(1− b− 1

t+ 2

)· p(t, b).

29

Also, the initial conditions are p(0, 1) = 1. Let q(t, b) = (t+1)! ·p(t, b). Then, q(0, 1) = 1

and for t > 0,

q(t, b) = (b− 1)q(t− 1, b− 1) + (t+ 1− b)q(t− 1, b).

Check that q(t, b) = t! solves this. So p(t, b) = 1/(t+ 1). ut

Example 4.9. The probability a family has n children is pn, where∑∞

n=0 pn = 1. Given

that a family has n children, all possibilities for the sex of these children are equally

likely. What is the probability the a family has one child, given that the family has no

girls? 454

Solution. The natural sample space to take here is

Ω = (s1, s2, . . . , sn) : sj ∈ boy, girl , n ≥ 1 ∪ no children .

An = n children , for n ≥ 0. B = no girls .The information in the question is:

For all n ≥ 1 P((s1, . . . , sn) |An) = 2−n,

for any combination of sexes s1, . . . , sn ∈ boy, girl. Thus, P(B|An) = 2−n for all n ≥ 1.

Also, P(B|A0) = 1 = 2−0. The law of total probability gives,

P(B) =∑

n

P(B|An)P(An) =

∞∑

n=0

2−npn.

Using Bayes’ rule

P(A1|B) = P(B|A1) · P(A1)

P(B)=

1

2· p1∑∞

n=0 2−npn.

ut

Example 4.10. There are 6 machines M1, . . . ,M6 in the basement of the math depart-

ment. When pressing the red button on machine Mj , the machine outputs a random

number in N, with the probability that number is k being jj+1 · (j + 1)−k.

A bored mathematician has access to these machines. She tosses a die, and uses

machine Mj if the outcome of the die is j, and writes down the resulting number.

What is the probability the resulting number is greater than 0?

30

What is the probability the outcome of the die is 1 given that the resulting number

is greater than 3? 454

Solution. Aj = the die shows j , j = 1, . . . , 6. Bk = the resulting number is greater

than k . Ck = the resulting number is k . So (Ck) are mutually disjoint, and

Bk =⋃n>k Cn.

The information is:

P(Bk|Aj) =∑

n>k

P(Cn|Aj) =∑

n>k

j

j + 1·(j+1)−n =

j

j + 1·(j+1)−(k+1)·

∞∑

n=0

(j+1)−n = (j+1)−(k+1).

Also, P(Aj) = 1/6 for all j.

By the law of total probability,

P(Bk) =1

6

6∑

j=1

(j + 1)−(k+1) =1

6(2−(k+1) + 3−(k+1) + · · ·+ 7−(k+1)).

So P(B0) = 16(1

2 + 13 + · · · 1

7).

Now, by Bayes

P(A1|B3) = P(B3|A1) · P(A1)

P(B3)= 2−4 · 1

2−4 + 3−4 + · · ·+ 7−4.

ut

4.3. The Erdos-Ko-Rado Theorem

Consider n elements, say 0, 1, . . . , n− 1 without loss of generality. We want to

“collect” subsets of size k ≤ n/2 so that any two of these subsets intersect one another.

We want to collect as many such subsets as possible.

One strategy is as follows: Suppose w.l.o.g. A = 0, 1, . . . , k is our first set. Any

other set that we collect must intersect it, and we want to give ourselves as much freedom

as possible. So lets take all subsets of 0, 1, . . . , n− 1 that contain 0. These all must

intersect pairwise.

How many sets did we collect?(n−1k−1

)because we are just actually choosing k − 1

elements out of 1, . . . , n− 1 for every such subset.

Is this the best possible? The Erdos-Ko-Rado Theorem tells us that it is.

31

Theorem 4.11 (Erdos-Ko-Rado). Let n ≥ 2k. If A ⊂ 20,...,n−1 is a collection of

subsets such that for all A,B ∈ A: |A| = k and A ∩ B 6= ∅. Then, there are at most(n−1k−1

)sets in A; that is, |A| ≤

(n−1k−1

).

This elementary proof is by Katona.

Proof. Let A be as in the theorem.

Let us choose a uniform element σ from Sn the set of all permutations on 0, 1, . . . , n− 1,and also a uniform index I from 0, 1, . . . , n− 1. SetA = σ(I), σ(I + 1), . . . , σ(I + k − 1)where addition is modulo n. That is, P[σ = s, I = i] = 1

n! · 1n .

For any given set B = b0, . . . , bk−1 ⊂ 20,...,n−1 with |B| = k, by the law of total

probability,

P[A = B, I = j] =∑

π∈Sn

P[σ = π, I = j , π(j), π(j + 1), . . . , π(j + k − 1) = B]

= 1n!·n ·# π ∈ Sn : π(j), π(j + 1), . . . , π(j + k − 1) = B .

This last set is of size k!(n− k)! because there are k choices for π(j), then k− 1 choices

for π(j + 1), etc. until only 1 choice for π(j + k − 1), and then n− k choices for j + k,

n− k − 1 choices for j + k + 1, and so on for all j + k, . . . , j + n− 1. Thus,

P[A = B] =

n−1∑

j=0

P[A = B, I = j] =

n−1∑

j=0

(n

k

)−1

· 1

n=

(n

k

)−1

.

We have found an alternative description to choosing a subset of 20,...,n−1 of size

exactly k uniformly.

Now the second part: For π ∈ Sn and an integer s consider the subsets Xπ,s :=

π(s), π(s+ 1), . . . , π(s+ k − 1), where as usual addition is modulo n.

Note that if 0 ≤ m ≤ k − 1 then Xπ,m ∩ Xπ,m+k = ∅ (here we use the assumption

that 2k ≤ m). Indeed if x ∈ Xπ,m ∩Xπ,m+k then there exist 0 ≤ i, j ≤ k − 1 such that

π(m + j) = x = π(m + k + i). Since π is a permutation, this implies that j = k + i

modulo n and this is impossible because 0 ≤ i, j ≤ k − 1 and k ≤ n/2.

Now, for a given π, s, the subsets Xπ,m that intersect Xπ,s are (Xπ,m : m = s −k + 1, . . . , s + k − 1). Without the case m = 0, these can be written in k − 1 pairs:

32

(Xπ,s−m, Xπ,s−m+k)k−1m=1. For each pair, since Xπ,s−m ∩Xπ,s−m+k = ∅, only one of the

pair can be in the intersecting family A.

We conclude that for any 0 ≤ j ≤ n − 1, if Xπ,j ∈ A then Xπ,s ∩ Xπ,j 6= ∅ so

s− k + 1 ≤ j ≤ s+ k − 1, and Xπ,s−m ∈ A if and only if Xπ,s−m+k 6∈ A. Thus,

n−1∑

j=0

1Xπ,j∈A ≤ 1Xπ,s∈A +

k−1∑

m=1

1Xπ,s−m∈A + 1Xπ,s−m+k∈A

≤ 1 + k − 1 = k.

Since all subsets of A must intersect one another, this gives that for any π ∈ Sn,

# 0 ≤ j ≤ n− 1 : Xπ,j ∈ A ≤ k.

Now, if σ = π and A ∈ A then it must be that I = j for some j such that Xπ,j ∈ A.

So, by Boole’s inequality,

P[A ∈ A, σ = π] ≤ P[σ = π, I ∈ 0 ≤ j ≤ n− 1 : Xπ,j ∈ A]

≤n−1∑

j=0

1Xπ,j∈A P[σ = π, I = j] ≤ k

n! · n.

Summing over all π ∈ Sn, by the law of total probability,

P[A ∈ A] ≤ k

n.

We now combine this bound with the fact that A is uniform over all k-subsets of

0, . . . , n− 1. Thus,|A|(nk

) = P[A ∈ A] ≤ k

n.

This is exactly |A| ≤(nk

)· kn =

(n−1k−1

). ut

33


201.1.8001

Ariel Yadin

Lecture 5

5.1. Independence

Until now, we have only used the measure theoretic properties of probability mea-

sures. The main difference between probability and measure theory is the notion of

independence.

This is perhaps the most important definition in probability.

Definition 5.1. Let A,B be events in a probability space (Ω,F ,P). We say that A and

B are independent if

P(A ∩B) = P(A)P(B).

If (An)n is a collection of events, we say that (An) are mutually independent if for any

finite number of events in the collection, An1 , An2 , . . . , Ank we have that

P(An1 ∩An2 ∩ · · · ∩Ank) = P(An1) · P(An2) · · ·P(Ank).

Example 5.2. A card is chosen randomly out of a deck of 52, all cards equally likely.

A is the event that the card is an ace. B is the event that the card is a spade. C is the

event that the card is a jack of diamonds or a 2 of hearts. D is the event that the card

is an even number.

Then, A,B are independent, since

P(A ∩B) =1

52=

4

52· 13

52= P(A) · P(B).

C,D are not independent since

P(C ∩D) =1

526= P(C) · P(D) =

2

52· 20

52.

34

454

Example 5.3. Two dice are tossed, all outcomes equally likely. A = the first die is 4

. B = the sum of the dice is 6 . C = the sum of the dice is 7 .A,B are not independent.

P(A ∩B) =1

366= P(A) · P(B) =

1

6· 5

36.

However, A,C are independent,

P(A ∩ C) =1

36=

1

6· 6

36= P(A) · P(C).

454

Proposition 5.4. Let (Ω,F ,P) be a probability space.

(1) Any event A is independent of Ω and of ∅.(2) If A,B are independent, then Ac, B are independent.

(3) If A,B are independent and P(B) > 0 then P(A|B) = P(A).

Proof. If A,B are independent,

P(B ∩Ac) = P(B)− P(B ∩A) = P(B)(1− P(A)) = P(B)P(Ac).

If also P(B) > 0, then P(A|B) = P(A)P(B)/P(B) = P(A). ut

Exercise 5.5. Prove that if A,B,C are mutually independent, then

(1) A and B ∩ C are independent.

(2) A and B \ C are independent.

Solution.

P(A ∩B ∩ C) = P(A)P(B)P(C) = P(A)P(B ∩ C).

This proves the first assertion. For the second assertion, since B \C = B ∩Cc and since

A ∩B is independent of C, then also A ∩B and Cc are independent. So,

P(A ∩ (B \ C)) = P(A ∩B ∩ Cc) = P(A ∩B)P(Cc) = P(A)P(B)(1− P(C)).

35

Finally,

P(B \ C) = P(B \ (B ∩ C)) = P(B)− P(B ∩ C) = P(B)(1− P(C)),

since B,C are independent. ut

Exercise 5.6. Show that if B is an event with P[B] = 0 then for any event A, A and

B are independent.

Solution. 0 ≤ P[B ∩A] ≤ P[B] = 0 and P[B] · P[A] = 0. ut

Example 5.7. Consider the uniform probability measure on Ω = 0, 12. Let

A = (x, y) ∈ Ω : x = 1 B = (x, y) ∈ Ω : y = 1,

and

C = (x, y) ∈ Ω : x 6= y .

Any two of these events are independent; indeed,

P(A ∩B) = P(A ∩ C) = P(B ∩ C) =1

4,

P(A) = P(B) = P(C) =1

2,

(because C = (0, 1), (1, 0)). However, it is not the case that (A,B,C) are mutually

independent, since

P(A ∩B ∩ C) = 0 6= 1

8= P(A) · P(B) · P(C).

454

5.2. An example using independence: Riemann zeta function and Euler’s

formula

Some number theory via probability:

Bernhard Riemann

(1826-1866)

Let 1 < s ∈ R be a parameter. Let

ζ(s) =

∞∑

n=1

n−s.

36

Let X be a random variable with range 1, 2, . . . , and distribution given by

P(X = n) =1

ζ(s)· n−s.

Let Dm = mZ = the set of positive integers devisible by m. So

PX(Dm) =

∑n(mn)−s∑n n−s = m−s.

Now, let p1, p2, . . . , pk be k different prime numbers. Then,

Dp1 ∩Dp2 ∩ · · · ∩Dpk = Dp1p2···pk .

So

PX(Dp1 ∩Dp2 ∩ · · · ∩Dpk) = (p1p2 · · · pk)−s = PX(Dp1) · PX(Dp2) · · ·PX(Dpk).

Since this is true for any finite number of primes, we have that the collection (Dp)p∈P

are mutually independent, where P is the set of primes.

Since the only number that is not divisible by any prime is 1, we have that 1 =⋂p∈P D

cp, and so

P(X = 1) = PX(1) =∏

p∈P(1− PX(Dp)) =

∏

p∈P(1− p−s).

So we conclude Euler’s formula: for any s > 1,

∞∑

n=1

n−s =∏

p∈P(1− p−s)−1.

Leonhard Euler

(1707–1783)

We can also let s→ 1 from above, and we get

0 = lims→1

ζ(s)−1 =∏

p∈P(1− p−1).

Taking logs∑

p∈Plog(1− p−1) = −∞.

Since log(1− x) ≥ −3x/2 for 0 < x < 1, we have that

−∞ ≥ −32

∑

p∈Pp−1,

37

and so∑

p∈P

1

p=∞.

This is a non-trivial result.

38


201.1.8001

Ariel Yadin

Lecture 6

6.1. Uncountable Sample Spaces

Up to now we have dealt with the case of countable sample spaces, but there is no

reason to restrict to those. For example, what if we wish to experiment by throwing a

dart at a target? The collection of possible outcomes is the collection of points in the

target, which is uncountable.

The basic theory to define such spaces is measure theory.

Consider the sample space Ω = [0, 1]. Why not take a probability measure on all

subsets of [0, 1]?

Well, for example, we would like to consider the experiment of sampling a number from

[0, 1] such that it is uniform over the interval; that is, such that for any 0 ≤ a < b ≤ 1

the probability the number is in (a, b) is b− a.

How can we do this?

The following proposition shows that this cannot be done using all subsets of [0, 1]

(!).

Proposition 6.1. Let Ω = [0, 1]. Let P be a probability measure on (Ω,F), where F is

some σ-algebra on Ω, such that for any x ∈ [0, 1] and A ∈ F , P(A) = P(A + x), where

A+x = a+ x (mod 1) : a ∈ A. Then, there exists a subset S ⊂ Ω such that S 6∈ F .

Proof. Define the equivalence relation x ∼ y iff x− y ∈ Q. So we can choose a represen-

tative for each different equivalence class (using the axiom of choice!). Let R be the set

of these representatives.

Define Rq = R + q = r + q (mod 1) : r ∈ R, for all q ∈ Q ∩ [0, 1). For q 6= p ∈Q∩[0, 1) we have that Rq∩Rp = ∅, since if x ∈ Rp∩Rq then x = r+q = r′+p (mod 1), for

39

r, r′ ∈ R, and this implies that r ∼ r′, a contradiction. Thus, (Rq)q∈Q∩[0,1) is a countable

collection of mutually disjoint events, so 1 = P([0, 1)) = P(⋃q Rq) =

∑q P(Rq) =

∑q P(R) =∞! ut

Thus, we have to develop a theory of sets which we allow to be events, so that

everything is well defined etc.

6.2. σ-algebras Revisited

Recall the definition of a σ-algebra:

Definition 6.2. Let Ω be a sample space. A collection of subsets F ⊂ 2Ω is called a

σ-algebra if

• Ω ∈ F .


• If (An) is a collection of events in F then⋃nAn ∈ F .

These were the most basic properties we wanted of events: Everything can occur, if

something can occur, it can also not occur, and if there are many possible events, we

can ask if there is one of them that has occured.

Exercise 6.3. Show that F ⊂ 2Ω is a σ-algebra if and only if it has the following

properties.

• ∅ ∈ F .


• If (An) is a collection of events in F then⋂nAn ∈ F .

Solution. Use de-Morgan. ut

Suppose we have a collection of subsets of Ω. We want the smallest σ-algebra con-

taining these subsets.

Proposition 6.4. Let Ω be a sample space, and let K ⊂ 2Ω be a collection of subsets.

There exists a unique σ−algebra on Ω, denoted by σ(K), such that

• K ⊂ σ(K).

40

• If F is a σ-algebra such that K ⊂ F , then σ(K) ⊂ F

So σ(K) is the smallest σ-algebra containing K. Moreover, σ(K) = K if and only if Kis itself a σ−algebra.

Proof. Let

Γ(K) = F : F is a σ−algebra such that K ⊂ F ,

and define σ(K) :=⋂

Γ(K). So of course the second property holds. σ(K) is a σ-algebra

since Ω ∈ F for all F ∈ Γ(K) in the above set. Also, if (An) is a collection of events in

σ, then An ∈ F for all n and all F ∈ Γ(K). Thus,⋃nAn ∈ F for every F ∈ Γ(K), and

so⋃nAn ∈ σ(K).

As for uniquenes, if G is another σ-algebra with the above properties, then G ⊂ σ(K)

because K ⊂ G and similarly σ(K) ⊂ G. ut

Definition 6.5. σ(K) as above is called the σ-algebra generated by K.

σ(K) is intuitively all the information carried by the events in K.

Example 6.6. Let Ω be a sample space and A ⊂ Ω. Then, σ(A) = ∅, A,AcΩ.454

Definition 6.7. Let (Ω,F ,P) be a probability space, and let G ⊂ F be a sub-σ-algebra

of F . We say that an event A ∈ F is independent of G if for any B ∈ G, we have that

A,B are independent.

Let (Bα)α be a collection of events. We say that A is independent of (Bn)n if A is

independent of the σ-algebra σ((Bn)n).

Example 6.8. If A,B are independent, then we have already seen that A,Bc are also

independent. Since A, ∅ are always independent, and also A,Ω are always independent,

we have that A is independent of σ(B) = ∅, B,Bc,Ω. 454

6.3. Borel σ-algebra

Let E be a topological space. We can consider the σ-algebra generated by all open

sets. This is called the Borel σ-algebra on E, and an event in this σ-algebra is call a

Borel set.

41

A special case is the case E = Rd, and more specifically E = R. In this case we denote

the Borel σ-algebra by B = B(R), and it can be shown that B = σ((a, b) : a < b ∈ R);

that is B is generated by all intervals.

Exercise 6.9. Show that

B(R) = σ((a, b] : a < b ∈ R) = σ([a, b) : a < b ∈ R) = σ([a, b] : a < b ∈ R).

Solution. Let a < b ∈ R. Since (a, b) ∩ [0, 1] =⋂n((a, b − 1/n] ∩ [0, 1]), we get that

(a, b) ∩ [0, 1] ∈ σ((a, b] ∩ [0, 1] : a < b ∈ R). Since this holds for all a < b ∈ R, we have

that σ((a, b) ∩ [0, 1] : a < b ∈ R) ⊂ σ((a, b] ∩ [0, 1] : a < b ∈ R).

The other inclusions are similar, so all the mentioned σ−algebras are equal. ut

6.4. Lebesgue Measure

The following basic theorem, due to Lebesgue, is shown in measure theory.

Henri Lebesgue

(1875–1941)

Theorem 6.10 (Lebesgue). Let Ω = [0, 1], and let B be the Borel σ-algebra on Ω. Let

F : [0, 1]→ [0, 1] be a right-continuous, non-decreasing function such that F (1) = 1 and

F (0) = 0. Then, there exists a unique probability measure, denoted P = dF , on (Ω,B)

such that for any 0 ≤ a < b ≤ 1, P((a, b]) = F (b)− F (a).

We will not go into the proof of this theorem, but it gives us many new probability

measures on non-discrete sample spaces that we can define.

Example 6.11. Take F (x) = x in the above theorem. So, the resulting probability

measure, sometimes denoted P = dx, has the property that for any interval 0 ≤ a <

b ≤ 1, P((a, b]) = b − a. One can also check that this measure is translation invariant,

i.e. for any x ∈ [0, 1],

P(x+ (a, b] (mod 1)) = P(x+ y (mod 1) : a < y ≤ b) = b− a.

This is also sometimes called the uniform measure on [0, 1]. 454

Exercise 6.12. Show that for the uniform measure on [0, 1], for any a ∈ [0, 1], dx(a) =

0.

42

Deduce that for all 0 ≤ a ≤ b ≤ 1,

dx((a, b)) = dx([a, b]) = dx([a, b)) = dx((a, b]) = b− a.

Solution. Let a ∈ [0, 1]. For all n, let An = (a − 1n , a]. So P(An) = 1/n. Since (An) is

a decreasing sequence of events, and since a =⋂nAn, we have by the continuity of

probability measures

P(a) = limP(An) = 0.

We can now note that

P([a, b]) = P((a, b) ] a ] b) = P((a, b)) + P(a) + P(b) = P((a, b)).

Similarly for the other types of intervals. ut

Example 6.13. Let α < β ∈ R. Let Ω = [α, β]. For a set A ∈ B([0, 1]) let

(β − α)A+ α = (β − α)a+ α : a ∈ A .

Define F = (β − α)A+ α : A ∈ B([0, 1]).We have that (β − α)[0, 1] + α = Ω, ((β − α)A+ α)c = (β − α)Ac + α, and

⋃

n

((β − α)An + α) = (β − α)(⋃

n

An)

+ α.

So F is a σ-algebra on Ω. Also, if dx is the uniform measure on [0, 1] then

P((β − α)A+ α) := dx(A)

define a probability measure on (Ω,F), because A∩B = ∅ if and only if (β−α)A+α∩(β − α)B + α = ∅.

In this case, note that for any α ≤ a < b ≤ β,

P((a, b]) = P((β − α)( 1β−α(a− α), 1

β−αb] + α) = dx(( 1β−α(a− α), 1

β−αb]) =b− aβ − α.

454

Example 6.14. Let Ω be a sample space, and let f : Ω→ [0, 1] be a 1-1 function. Let

F =f−1(B) : B ∈ B([0, 1])

. This is a σ-algebra on Ω since f−1([0, 1]) = Ω and

⋃

n

f−1(Bn) = f−1(⋃

n

Bn).

43

For A ∈ F , since B = f(f−1(B)), we have that f(A) ∈ B, and we can define P(A) =

dx(f(A)). This results in a probability measure on (Ω,F) since P(Ω) = 1 and if (An)

are mutually disjoint events then so are (f(An)),

P(⋃

n

An) = dx(f(⋃

n

An)) = dx

(⋃

n

f(An))

=∑

n

dx(f(An)) =∑

n

P(An).

454

Example 6.15. A dart is thrown to a target of radius 1 meter. Points are awarded

according to the distance to the center of the target. The probability of being between

distance a and distance b is b2 − a2.

What is the probability of hitting the inner circle of radius 1/2? What about radius

1/4? 454

Solution. Let F : [0, 1] → [0, 1] be F (x) = x2. Then, dF is a probability measure

on ([0, 1],B) such that dF ((a, b)) = F (b) − F (a) = b2 − a2. Thus, we can model the

dart throwing experiment by the probability space ([0, 1],B, dF ), where the outcome

ω ∈ [0, 1] is the distance to the target.

Now, the probability of hitting the inner circle of radius 1/2 is dF ([0, 1/2]) = (1/2)2 =

1/4.

Similarly the probability of hitting the circle of radius 1/4 is 1/16. ut

44


201.1.8001

Ariel Yadin

Lecture 7

7.1. Random Variables

Up until now we dealt with abstract spaces (such as dice, cards, etc.). In order to

say things about objects, we prefer to map them to numbers (or vectors) so that we can

perform calculations.

For example, maybe some biological experiment is being carried out in the lab, but

in the end, the biologist may only be interested on the number of bacteria in the dish,

or other such measurements.

Measurements of experiments are carried out via random variables.

Suppose we have an experiment described by a probability space (Ω,F ,P). A mea-

surement is just a function taking outcomes to numbers; i.e. a function X : Ω → R.

The natural way to define a probability space on R, would now be to take the natural

σ-algebra on R, i.e. the Borel σ-algebra B, and then define the measure of a Borel set

B to be the probability that the outcome ω is mapped into B:

∀ B ∈ B PX(B) := P(ω : X(ω) ∈ B) = P(X−1(B)).

But wait! What if the set X−1(B) is not in F? Then P is not defined on that set.

This is a technical detail, but one that needs to be addressed, otherwise we will run into

paradoxes as in the case of the unmeasurable sets on [0, 1].

Definition 7.1. Let (Ω,F), (Ω′,F ′) be a measurable spaces. A function X : Ω→ Ω′ is

called a measurable map from (Ω,F) to (Ω′,F ′), if for all A′ ∈ F ′, X−1(A′) ∈ F .

We usually denote this by X : (Ω,F) → (Ω′,F ′) to stress the dependence on F and

F ′.

45

Exercise 7.2. Show that if X : (Ω,F) → (Ω′,F ′) and Y : (Ω′,F ′) → (Ω′′,F ′′) are

measurable functions, then Y X : Ω→ Ω′′ is measurable with respect to F ,F ′′.

To determine if a function X : (Ω,F)→ (Ω′,F ′) is measurable, it is enough to check

for a family of sets generating F ′:

Proposition 7.3. Let (Ω,F), (Ω′,F ′) be two measurable spaces, such that F ′ = σ(Π′)

for some Π′ ⊂ 2Ω′. Then, X : Ω → Ω′ is measurable if and only if for any A′ ∈ Π′,

X−1(A′) ∈ F .

Proof. The “only if” part is clear. For the “if” part, consider

G′ =A′ ⊂ Ω′ : X−1(A′) ∈ F

.

We are given that Π′ ⊂ G′. If we show that G′ is a σ-algebra, then F ′ = σ(Π′) ⊂ G′, so

for any A′ ∈ F ′ we have that X−1(A′) ∈ F , and X is measurable.

So we only need to show that G′ is a σ-algebra. Indeed, since X−1(∅) = ∅ ∈ F we

have that ∅ ∈ G′.If A′ ∈ G′ then X−1(A′) ∈ F so also X−1(Ω′ \ A′) = Ω \ X−1(A′) ∈ F , and thus

Ω′ \A′ ∈ G′. So G′ is closed under complements.

If (A′n)n is a sequence in G′, then for every n we have that X−1(A′n) ∈ F , and so

X−1(⋃nA′n) =

⋃nX

−1(A′n) ∈ F . Thus,⋃nA′n ∈ G′ and G′ is closed under countable

unions. ut

Corollary 7.4. If Ω,Ω′ are topological spaces, and B(Ω),B(Ω′) are the Borel σ-algebras,

then any continuous map f : Ω→ Ω′ is measurable.

Proof. f is continuous implies that for any open set O′ ⊂ Ω′, f−1(O′) is an open set in

Ω. Thus, since B(Ω′) = σ(O′), where O′ is the set of open sets in Ω′, we have that for

any O′ ∈ O′, f−1(O′) ∈ B(Ω). So f is measurable. ut

Definition 7.5. A (real valued) random variable on a probability space (Ω,F ,P)

is a measurable function X : (Ω,F)→ (R,B).

The function PX(B) = P(X−1(B)) is a probability measure on (R,B), and is called

the distribution of X.

46

Example 7.6. Let (Ω,F ,P) be a probability space. Let A ∈ F be an event. This is

not a random variable, but there is a natural random variable connected to A: Define a

function

IA(ω) =

1 ω ∈ A0 ω 6∈ A.

Note that for any B ∈ B, if 1 6∈ B and 0 6∈ B then I−1A (B) = ∅. If 1 ∈ B and 0 6∈ B

then I−1A (B) = A. If 1 6∈ B and 0 ∈ B then I−1

A (B) = Ac. If 1 ∈ B and 0 ∈ B then

I−1A (B) = Ω. So IA is measurable.

IA is known as the indicator of the event A. Indicators are always measurable, and

they are the simplest kind of random variables, taking only two values 0, 1. We denote

the indicator of an event A by 1A. 454

Exercise 7.7. Let (Ω,F ,P) be a probability space. Prove that if X : (Ω,F) → R is a

measurable function then indeed PX(B) = P(X−1(B)) is a probability measure on (R,B).

Proof. PX(R) = P(X−1(R)) = P(Ω) = 1. If (Bn) is a sequence of mutually disjoint

events, then (X−1(Bn)) are also mutually disjoint and

PX(⋃

n

Bn) = P(⋃

n

X−1(Bn))

=∑

n

PX(Bn).

ut

Notation. To simplify the notation, we will ommit the ω’s. e.g. instead of writing

P(ω : X(ω) ∈ A) we write P(X ∈ A).

Example 7.8. Three balls are removed from an urn containing 20 balls numbered 1 to

20. All possibilities are equally likely. What is the probability the at least one has a

number 17 or higher?

The sample space here is Ω = S ⊂ 1, 2, . . . , 20 : |S| = 3, and the measure is the

uniform measure on this set. We define the random variable X : Ω→ R by X(i, j, k) =

max i, j, k.Why is X a measurable function? For any Borel set B ∈ B, we have that X−1(B) is

a subset of Ω, and the σ-algebra here is 2Ω.

47

What is the probability that there exists a ball numbered 17 or higher? This is

P(ω ∈ Ω : X(ω) ≥ 17) = P(X ≥ 17) = 1− P(X < 17).

Since the number of S ∈ Ω with maxS < 17 is(

163

), we have that P(X < 17) = 16·15·14

20·19·18 =

4·73·19 . 454

Example 7.9. Two dice are thrown. So Ω = (i, j) : 1 ≤ i, j ≤ 6, and P is the

uniform measure on this set. Define the random variable X(i, j) = i + j. Note that

2 ≤ X(i, j) ≤ 12 for any i, j, so

P(X ∈ [2, 12] ∩ N) = 1,

and X is a discrete random variable.

What is the probability that the sum is 7?

P(X = 7) =6

36=

1

6.

454

7.2. Distribution Function

Proposition 7.10. Let (Ω,F) be a measurable space, and let X : Ω → R. Then X is

measurable if and only if

X ≤ r = X−1(−∞, r] ∈ F ∀ r ∈ R.

Proof. The “only if” part is clear.

For the “if” part: Since X > r = X ≤ rc ∈ F , we have that for any a < b ∈ R

X−1((a, b]) = X−1((−∞, b] \ (−∞, a]) = a < X ≤ b = X ≤ b ∩ X > a ∈ F .

Let

G =B ∈ B : X−1(B) ∈ F

.

Since X−1(R) ∈ F and X−1(Bc) = (X−1(B))c and X−1(⋃nBn) =

⋃nX

−1(Bn), we

get that G is a σ-algebra.

From the above, (a, b] ∈ G for all a < b ∈ R. Thus, B = σ((a, b] : a < b ∈ R) ⊂ G. So

for any B ∈ B, we have that B ∈ G, and so X−1(B) ∈ F . ut

48

Definition 7.11. Let (Ω,F ,P) be a probability space, and let X be a random variable.

The (cumulative) distribution function of X is the function

FX : R→ [0, 1] FX(t) = P(X ≤ t).

Example 7.12. X is the random variable that is the sum of two dice. FX(t) =?

FX(t) = 0 for t < 2. FX(t) = 1 for t ≥ 12. FX(t) = 1/36 for t ∈ [2, 3). FX(t) = 2/36

for t ∈ [3, 4). In general, FX(t) = (k − 1)/36 for t ∈ [k, k + 1) and k ≤ 5. 454

Proposition 7.13. Let X be a random variable, and FX its distribution function. Then,

• FX is non-decreasing and right-continuous.

• FX(t) tends to 0 as t→ −∞.

• FX(t) tends to 1 as t→∞.

Proof.

Since for a ≤ b we have that (−∞, a] ⊂ (−∞, b], we get

FX(a) = PX((−∞, a]) ≤ PX((−∞, b]) = FX(b).

For right continuity, let hn → 0 monotonely from the right, so ((−∞, a + hn])n is a

decreasing sequence and

FX(a) = PX(⋂

n

(−∞, a+ hn]) = limn

PX((−∞, a+ hn]) = limnFX(a+ hn).

The limits at ∞ and −∞ are treated similarly. If an → ∞ then limn(−∞, an] =

(−∞,∞). So by the continuity of probability measures PX((−∞, an]) → PX(R) = 1.

Similarly for an → −∞. ut

Exercise 7.14. Give an example of a random variable X whose distribution function

FX is not left-continuous.

Solution. Let (Ω,F ,P) be a probability space, and let A ∈ F . Let X = 1A. What is

FX?

FX(t) =

0 t < 0

P(Ac) = 1− P(A) 0 ≤ t < 1

1 t ≥ 1.

49

So if 0 < P(A) < 1 then

lims→0−

FX(s) = 0 and FX(0) = 1− P(A) > 0,

lims→1−

FX(s) = 1− P(A) and FX(1) = 1 > 1− P(A).

So both 0 and 1 are points where FX is not left-continuous. ut

X Of course one sees that the right-continuity stems from the fact that we define

FX(t) = P(X ≤ t). We could have defined P(X ≥ t) which would have been left-

continuous. This is not important, however we stick to the classical distribution function.

Exercise 7.15. Let X be a random variable with distribution FX . Show that for any

a < b ∈ R,

• P(a < X ≤ b) = FX(b)− FX(a).

• P(a < X < b) = FX(b−)− FX(a).

• P(X = x) = FX(x)− FX(x−). Deduce that FX is continuous at x if and only if

P(X = x) = 0.

Solution.

• Since (a, b] = (−∞, b] \ (−∞, a], and (−∞, a] ⊂ (−∞, b],

FX(b)− FX(a) = PX((−∞, b])− PX((−∞, a]) = PX((a, b]) = P(a < X ≤ b).

• The events (a, b − 1/n] satisfy limn(a, b − 1/n] = (a, b), and by continuity of

probability

P(a < X < b) = PX(limn

(a, b− 1/n]) = limn

PX(a, b− 1/n]) = FX(b−)− FX(a).

• Similarly, x = limn(x− 1/n, x] so

P(X = x) = PX(x) = FX(x)− limnFX(x− 1/n) = FX(x)− FX(x−).

So, FX is left-continuous at x if and only if FX(x−) = FX(x), which is if and

only if P(X = x) = 0.

ut

50

Since segments of the form (a, b], (a, b), (−∞, a] and (−∞, a) are enough to generate

all sets in B(R) it can be shown that the distribution function uniquely determines the

probability measure of all Borel sets, and thus it determines the random variable X.

In the same way as Lebesgue measure is shown to exist, it is shown in measure

theory that there is a 1-1 correspondence between distribution functions and probability

measures on (R,B(R)). That is:

Thomas Stieltjes

(1856–1894)

Theorem 7.16 (Lebesgue-Stieltjes). Every function F : R → [0, 1] that is right-

continuous everywhere, non-decreasing and limt→∞ F (t) = 1, limt→−∞ F (t) = 0 gives

rise to a unique probability measure PF on (R,B(R)) satisfying PF ((a, b]) = F (b)−F (a)

for all a < b ∈ R. (That is, for such F there is always a random variable X such that

F = FX .)

Conversely, we have already seen that if X is a random variable, then PX((a, b]) =

FX(b)− FX(a) for all a < b ∈ R, and FX has the properties mentioned above.

51


201.1.8001

Ariel Yadin

Lecture 8

8.1. Discrete Distributions

Definition 8.1. A random variable X is called discrete if there exists a countable set

R ⊂ R such that P(X ∈ R) = 1.

If X is such a random variable, then if we specify P(X = r) for all r ∈ R, then the

distribution of X can be calculated by

FX(t) = P(X ≤ t) =∑

R3r≤tP(X = r).

The set r ∈ R : P(X = r) > 0 is sometimes called the range ofX, and the function

fX(x) = P(X = x) is called the density of X (sometimes: probability density

function). Note that

FX(t) =∑

R3x≤tfX(x).

8.1.1. Bernoulli Distribution. Let 0 < p < 1. X has the Bernoulli-p distribution,

denoted X ∼ Ber(p) if:

FX(t) =

0 t < 0

1− p 0 ≤ t < 1

1 1 ≤ t.

That is, P(X ∈ 0, 1) = 1, and P(X = 0) = 1− p, P(X = 1) = p.

Jacob Bernoulli

(1654–1705)

The Bernoulli distribution models a coin toss of a biased coin - probability p to fall

head (or 1) and 1− p to fall tails (or 0).

52

8.1.2. Binomial Distribution. Suppose we have a biased coin with probability p to

fall heads (or 1). What if we toss that coin n times in a row, all tosses mutually

independent?

The sample space would be Ω = 0, 1n. Because of independence the measure would

be

P(ω) = p number of ones in ω · (1− p) number of zeros in ω.

So if X : Ω→ R is the number of ones, then since for 0 ≤ k ≤ n there are(nk

)different

ω ∈ Ω with exactly k ones,

0 ≤ k ≤ n fX(k) = P(X = k) =

(n

k

)pk(1− p)n−k.

Such an X is said to have Binomial-(n, p) distribution, denoted X ∼ Bin(n, p).

What is FX in this case?

FX(t) =∑

k≤tk=0,1,...,n

(n

k

)pk(1− p)n−k.

So FX(t) = 0 for t < 0 and FX(t) = 1 for t ≥ n.

Example 8.2. A mathematician sits at the bar. For the next hour, every 5 minutes he

orders a beer with probability 2/3 and drinks it. With the remaining probability of 1/3

he does nothing for those 5 minutes. All orders of beer are mutually independent.

What is the probability that he drinks no more than one beer, so he can still drive

home (legally)? 454

Solution. If X is the number of beers drunk, then X ∼ Bin(12, 2/3), since every beer is

drunk with probability 2/3 independently, in 12 = 60/5 trials. So

P(X ≤ 1) = P(X = 0) + P(X = 1) =

(12

0

)(2/3)0(1/3)12 +

(12

1

)(2/3)1(1/3)11

= (1/3)12 + 12 · 2 · (1/3)12 =25

312<

1

20, 000.

ut

53

8.1.3. Geometric Distribution.

Example 8.3. A student is watching television. Every minute she tosses a biased coin,

if it come up heads she goes to study for the probability exam, if tails she continues

watching TV. All coin tosses are mutually indpendent.

What is the probability she will continue watching TV forever?

What is the probability she watches for exactly k minutes and then starts studying?

The sample space here is Ω = N ∪ ∞.Let T be the number of minutes it takes to go study. The event that T = k is the event

that k − 1 coins are tails, and the k-th coin is heads, where all coins are independent.

So

P(T = k) = (1− p)k−1p.

Note that

P(T =∞) = 1− P(T <∞) = 1−∞∑

k=1

P(T = k) = 1− p∞∑

k=1

(1− p)k−1 = 0.

So the range of T is N. 454

The distribution of T above is called the Geometric-p distribution; that is, X has

the Geometric-p distribution, denoted X ∼ Geo(p), if

fX(k) = P(X = k) =

(1− p)k−1p k ∈ 1, 2, . . . , 0 otherwise.

What is FX?

FX(t) =∑

1≤k≤t : k∈N(1− p)k−1p = 1− (1− p)btc.

So the Geometric distribution is the number of trials of independent identical exper-

iments until the first success.

Example 8.4. An urn contains b black balls and w white balls. We randomly remove

a ball, all balls equally likely, and then return that ball.

Let T be the number of trials until we see a black ball.

What is the distribution of T?

54

Well, every time we try there is independently a probability of p := b/(b + w) to

remove a black ball. So the distribution of T is Geometric-p. 454

Proposition 8.5 (Memoryless property for geometric distribution). Let X be a Geometric-

p random variable. For any k > m ≥ 1,

P(X = k|X > m) = P(X = k −m).

Proof. By definition,

P(X = k|X > m) =P(X = k,X > m)

P(X > m)=

(1− p)k−1p

(1− p)m = (1− p)k−mp = P(X = k −m).

ut

That is, if we are given that we haven’t succeeded until time m, the probability of

succeeding in another k steps is the same as not succeeding for k steps to begin with.

8.1.4. Poisson Distribution. A random variable X has the Poisson-λ distribution

(0 < λ ∈ R) if the range is N and

fX(k) = P(X = k) = e−λ · λk

k!.

Thus,

FX(t) =

btc∑

k=0

e−λ · λk

k!.

Simeon Poisson

(1781–1840)

The importance of the Poisson distribution is in that it seems to occur naturally in

many practical situations (the number of phone calls arriving at a call center per minute,

the number of goals per game in football, the number of mutations in a given stretch of

DNA after a certain amount of radiation, the number of requests sent to a printer from

a network...).

The reason this may happen is that the Poisson serves is the limit of Binomial distri-

butions when the number of trials goes to infinity: Suppose Xn ∼ Bin(n, λ/n). Then,

P(Xn = k) =

(n

k

)(λn

)k ·(1− λ

n

)n−k

=λk

k!· (1− 1

n) · (1− 2n) · · · (1− k−1

n ) ·(1− λ

n

)n−k → e−λλk

k!.

55

So Poisson is like doing infinitely many trials of an experiment that has small proba-

bility of success, and asking how many successes are observed.

Example 8.6. Suppose the number of people at the checkout counter at the supermar-

ket has Poisson distribution. Checkout A has Poisson-5 and Checkout B has Poisson-2.

Both counters are independent.

Noga arrives at the counters and chooses the shorter line (and chooses A if they are

tied). What is the probability she chooses B?

What is the probability both checkouts are empty? 454

Solution. Let X be the number of people at A, and Y the number at B. We want the

probability of X > Y .

The information we have is that

P(Y = k,X = m) = P(Y = k)P(X = m),

for all k,m ∈ N .

So

P(Y < X) =∞∑

m=0

P(X = m)∑

k<m

P(Y = k) =∞∑

m=1

e−5 5m

m!

m−1∑

k=0

e−2 2k

k!.

The probability that both checkouts are empty is

P(X = 0, Y = 0) = P(X = 0)P(Y = 0) = e−5e−2 = e−7.

ut

8.1.5. Hypergeometric. Suppose we have an urn with N ≥ 1 balls, 0 ≤ m ≤ N are

black and the rest are white. If we choose 1 ≤ n ≤ N balls from the urn randomly and

uniformly, what is the number of black balls we get?

The space here is uniform measure on

Ω = ω ⊂ 1, . . . , N : |ω| = n =

(1, . . . , Nn

).

We ask for the random variable X(ω) := |ω ∩ 1, . . . ,m |.

56

A simple calculation reveals that for any 0 ≤ k ≤ min n,m,

P[X = k] =

(mk

)·(N−mn−k

)(Nn

) .

(We interpret(ab

)for b > a as 0, so actually, max 0,m+ n−N ≤ k ≤ min m,n.)

Such a random variable X is said to have hypergeometric distribution with param-

eters N,m, n, and we write X ∼ H(N,m, n).

Example 8.7. Chagai chooses 5 cards from a deck, all possibilities equal. How much

more likely is it for him to choose 3 aces than it is to choose 4 aces? How much more

likely is it to choose 4 diamonds than it is to choose 5 diamonds?

Let X = the number of aces chosen. The size of the deck is N = 52, the number of

cards chosen is n = 5 and the number of “special” cards is m = 4. So

P[X = 3]

P[X = 4]=

(m3

)·(N−mn−3

)(m4

)·(N−mn−4

) =4 · (N −m− n+ 4)

(m− 3) · (n− 3)=

4 · 47

1 · 2 = 94.

For Y = the number of diamonds chosen, we get N = 52, n = 5,m = 13. So

P[X = 4]

P[X = 5]=

(m4

)·(N−mn−4

)(m5

)·(N−mn−5

) =5 · (N −m− n+ 5)

(m− 4) · (n− 4)=

5 · 39

9 · 1 =5 · 13

3= 21 +

2

3.

454

8.1.6. Negative Binomial. We say that X ∼ NB(m, p), or X has negative binomial

distribution (or Pascal distribution) if X is the number of trials until the m-th success,

with each success with probability p.

Blaise Pascal (1623–1662) The event X = k is the event that there are m− 1 successes in the first k − 1 trials,

and then one success in the k-th trial. So

P[X = k] =

(k − 1

m− 1

)pm(1− p)(k−1)−(m−1) =

(k − 1

m− 1

)pm(1− p)k−m,

for k ≥ m.

Example 8.8. Let X ∼ NB(1, p). Then X ∼ Geo(p). Indeed, for any k ≥ 1,

P[X = k] =

(k − 1

0

)p(1− p)k−1 = (1− p)k−1p.

454

57


201.1.8001

Ariel Yadin

Lecture 9

9.1. Continuous Random Variables

Definition 9.1. Let (Ω,F ,P) be a probability space. A random variable X : Ω→ R is

called continuous if the distribution function FX is continuous at all points.

Remark 9.2. Recall that we showed that for all x ∈ R,

P(X = x) = limn→∞

PX((x− 1/n, x]) = FX(x)− limn→∞

FX(x− 1/n) = FX(x)− FX(x−).

Thus, if FX is continuous at x, then P(X = x) = 0.

So, if X is a continuous random variable then P(X = x) = 0 for all x ∈ R.

This is very different than in the discrete case!

Continuous random variables can be very pathological, and we need measure theory

to deal with some aspects regarding these. We restrict ourselves for this course to a

“nicer” family of random variables.

Definition 9.3. Let (Ω,F ,P) be a probability space. A random variable X : Ω → R

is called absolutely continuous if there exists an integrable non-negative function

fX : R→ R+ such that for all x ∈ R,

P(X ≤ x) = FX(x) =

∫ x

−∞fX(t)dt.

Such a function fX is called the probability density function (or just density function,

or PDF) of X.

Remark 9.4. Recall that the fundamental theorem of calculus says that if fX is contin-

uous, then FX is differentiable and F ′X = f .

58

Also, a result of calculus tells us that the function x 7→∫ x−∞ f(t)dt is a continuous

function of x. So any absolutely continuous random variable is also continuous.

X Under the carpet: We don’t know how to prove that an integrable function fX :

R → R+ uniquely defines a probability measure PX on (R,B), since we would need

some measure theory for this (actually we could use the Lebesgue-Steiltjes Theorem).

However, given that such a measure is uniquely defined, we have the following properties:

Proposition 9.5. Let X be an absolutely continuous random variable with PDF fX .

(1) By definition,

PX((a, b]) = FX(b)− FX(a) =

∫ b

−∞fX(t)dt−

∫ a

−∞fX(t)dt =

∫ b

afX(t)dt.

(2) We don’t always know which sets we can integrate over, but if the integral exists

we have

PX(A) =

∫

AfX(t)dt.

For example, if A =⊎nj=1(aj , bj ] this holds.

(3) Continuity of probability gives us that PX(a) = 0 for all a (because X is a

continuous random variable, so FX is contiuous at all points). Thus,

PX((a, b]) = PX([a, b)) = PX((a, b)) = P([a, b]).

(4) We always have ∫

RfX(t)dt = PX(R) = 1.

Proof. Just calculus of the Riemann integral, and continuity of probability measures.

ut

Example 9.6. Let X be a random variable with PDF

fX(t) =

C(2t− t2) 0 ≤ t ≤ 2

0 otherwise.

What is C? What is the probability that X > 1?

59

To compute C we use the fact that∫R fX(t)dt = 1. So

1 =

∫ ∞

−∞fX(t)dt = C

∫ 2

0(2t− t2)dt = C(t2 − t3

3 )∣∣20

= C(4− 8/3) = C · 43 .

So C = 3/4.

The probability that X > 1 is

PX((1,∞)) = 1− PX((−∞, 1]) = 1−∫ 1

0C(2t− t2)dt = 1− 3

4 · 23 = 1

2 .

454

9.1.1. Uniform Distribution. An absolutely continuous random variable X has the

uniform disribution on [a, b] , denoted X ∼ U([a, b]) if X has PDF

fX(t) =

1b−a a ≤ t ≤ b0 otherwise.

Indeed∫R fX(t) = 1.

Example 9.7. A bus arrives at the station every 15 minutes, starting at 5:00. A

passenger arrives between 7:00 and 7:30, uniformly distributed. What is the probability

that he does not wait more than 5 minutes for a bus?

If we take X = the time the passenger arrives, then X has uniform distribution on

[0, 30]. Also, the event that the passenger waits at most 5 minutes is the disjoint union

X = 0 ] 10 ≤ X ≤ 15 ] 25 ≤ X ≤ 30 .

Thus the probability is

P(X = 0) + P(10 ≤ X ≤ 15) + P(25 ≤ X ≤ 30) = 0 +5

30+

5

30=

1

3.

454

The uniform distibution is very symmetric in the following sense:

Exercise 9.8. Let X ∼ U([a, b]). Show that Y = X + c has the uniform distribution on

[a+ c, b+ c].

60

Solution. We need to show that

FY (x) = P(ω : Y (ω) ≤ x) =

∫ x

−∞

1

b+ c− (a+ c)· 1[a+c,b+c]dt.

Indeed, since ω : Y (ω) ≤ x = ω : X(ω) ≤ x− c, we have that

FY (x) = FX(x− c) =

∫ x−c

−∞

1

b− a · 1[a,b]dt =1

b+ c− (a+ c)·∫ x

−∞1[a+c,b+c]du.

ut

Exercise 9.9. What is the distribution function of a U([a, b]) random variable?

Proof. For a ≤ x ≤ b we have

FX(x) =1

b− a

∫ x

adt =

x− ab− a .

For x < a, FX(x) = 0 and for x > b, FX(x) = 1. ut

9.1.2. Exponential Distribution. Let 0 < λ ∈ R. An absolutely continuous random

variable X is said to have exponential distribution of parameter λ, denoted X ∼Exp(λ), if it has PDF

fX(t) = 1t≥0 · λe−λt.

What is the distribution function in this case?

FX(x) = λ

∫ x

0e−λtdt = (−e−λt)

∣∣x0

= 1− e−λx,

for x ≥ 0. For x < 0, Fx(x) = 0.

Example 9.10. The duration of the wait in line at Kupat Cholim is exponentially

distributed with parameter 1/5. What is the probability that the wait is longer than 10

minutes?

Suppose we are given that the wait is longer than 20 minutes. What is the probability

given this information that the wait is longer than 30 minutes?

Let X = waiting time. So X ∼ Exp(1/5).

P(X > 10) =

∫ ∞

10

15e−t/5dt = (−e−t/5)

∣∣∞10

= e−2.

61

Now, using the definition of conditional probability,

P(X > 30|X > 20) =P(X > 30)

P(X > 20).

Since

P(X > x) = (−e−t/5)∣∣∞x

= e−x/5,

we get that

P(X > 30|X > 20) = e−6/e−4 = e−2.

454

Is it a coincidence that we got the same answer? No. Recall that the geometric

distribution had the memoryless property, that is a geometric random variableX satisfies

P(X > m+ n|X > m) = P(X > n).

Similarly, the exponential disribution has the memoryless property:

Proposition 9.11. Let X ∼ Exp(λ). Then for any x, y ∈ R+,

P(X > x+ y|X > x) = P(X > y).

That is, the knowledge of waiting for some time does not give information about how

much more we have to wait.

Proof. First,

P(X > t) = 1− FX(t) = e−λt.

So

P(X > x+ y|X > x) =P(X > x+ y)

P(X > x)=e−λ(x+y)

e−λx= e−λy = P(X > y).

ut

However, it turns out that the exponential distribution is the only continuous random

variable with the memoryless property:

Proposition 9.12. Let X be a continuous random variable, such that for all x, y ∈ R+,

P(X > x+ y|X > x) = P(X > y).

Then, X ∼ Exp(λ) for some λ > 0.

62

Proof. X is continuous, so that means FX is a continuous function. Let g(t) = 1 −FX(t) = P(X > t). So g is a continuous function as well.

Now, for any x, y ∈ R+,

g(x+ y) = g(x)g(y).

Let λ ≥ 0 be such that g(1) = e−λ (recall that g : R → [0, 1]). For any 0 < n ∈ N we

have that

g(n) = g(n− 1)g(1) = · · · = g(1)n.

Since g(n)→ 0 we have that g(1) < 1 and λ > 0.

For any 0 < m ∈ N we have that

g(1) = g( 1m)g(1− 1

m) = · · · = g( 1m)m,

so g(1/m) = g(1)1/m. In a similar way, we get that for any rational number

g( nm) = g(1)n/m = e−λn/m.

Since g is continuous and coincides with e−λ· for all q ∈ Q, we get that g(t) = e−λt. ut9.1.3. Normal Distribution.

Some calculus. Let us look at the following integral:

I =

∫ ∞

−∞e−t

2dt.

How can we compute this value?

We use two-dimensional calculus to compute it: Note that

I2 =

∫

R2

e−(x2+y2)dxdy.

Setting x = r cos θ and y = r sin θ (or r =√x2 + y2 and θ = arctan(y/x) ) so the

Jacobian is

J =

dxdr = cos θ dx

dθ = −r sin θ

dydr = sin θ dy

dθ = r cos θ

So det(J) = r. Thus,

I2 =

∫ 2π

0

∫ ∞

0e−r

2rdrdθ.

63

Substituting s = r2 so that ds = 2rdr, we have that

I2 =

∫ 2π

0

∫ ∞

0

12e−sdsdθ = π.

So I =√π.

Now, we can use another trick to compute

I =

∫ ∞

−∞exp

(−(t− µ)2

2σ2

)dt.

Letting x = t−µ√2σ

we have that dt =√

2σdx. Thus,

I =√

2σ

∫ ∞

−∞e−x

2dx =

√2πσ.

We are ready to define the normal distribution:

Carl Friedrich Gauss

(1777–1855)

Let µ ∈ R and σ ∈ R+. An absolutely continuous random variable has normal-(µ, σ)

distribution, denoted X ∼ N(µ, σ), if it has PDF

fX(t) =1√

2πσ2exp

(−(t− µ)2

2σ2

).

A normal-(0, 1) random variable is called a standard normal random variable.

Example 9.13. The distribution of the lifetime of people in Oceania is distributed as

a normal-(75, 8) random variable. What is the probability a person lives more than 100

years? Show that it is less than e−ξ2/2 · 1√

2π, where ξ = 25/8.

George Orwell

(1903–1950)

If X = lifetime, then X ∼ N(75, 8). So

P(X > 100) =

∫ ∞

100

1√2π · 8

· e(t−75)2/(2·82)dt.

Taking s = (t− 75)/8 we have that dt = 8ds so

P(X > 100) =

∫ ∞

ξ

1√2πes

2/2ds.

Since on (ξ,∞) we have that 1 < sξ we get that

P(X > 100) <

∫ ∞

ξ

1√2πse−s

2/2ds.

Substituting u = s2/2, so du = sds we have that

P(X > 100) <

∫ ∞

ξ2/2

1√2πe−udu = e−ξ

2/2 · 1√2π.

64

454

A nice property of the normal distribution is the following:

Exercise 9.14. Let X ∼ N(µ, σ). Define Y (ω) = aX(ω) + b. Show that Y ∼ N(aµ +

b, aσ).

Solution. We need to prove that for any y ∈ R,

P(Y ≤ y) =

∫ y

−∞

1√2π · aσ

exp

(−(t− aµ− b)2

2a2σ2

)dt.

Since

ω : Y (ω) ≤ y = ω : X(ω) ≤ (y − b)/a ,

we have that

P(Y ≤ y) = P(X ≤ (y − b)/a) =

∫ (y−b)/a

−∞

1√2π · σ

exp

(−(t− µ)2

2σ2

)dt.

Substituting t− µ = s−aµ−ba (or t = s−b

a ), we have that dt = 1ads and

P(Y ≤ y) =

∫ y

−∞

1√2π · aσ

exp

(−(s− aµ− b)2

2a2σ2

)ds.

ut

Corollary 9.15. If X is a normal-(µ, σ) random variable, then X−µσ is a standard

normal random variable.

Example 9.16. Let X ∼ N(µ, σ). Show that FX(t) > 0 for all t.

Indeed,

FX(t) =

∫ t

−∞fX(s)ds ≥

∫ t

t−1fX(s)ds ≥ inf

s∈[t−1,t]fX(s).

Since fX is increasing up to µ and decreasing after µ, we have that the infimum on

[t− 1, t] is obtained at the endpoints. So

FX(t) ≥ min fX(t− 1), fX(t) > 0.

454

65

Example 9.17. Let X ∼ N(0, 1). Show that X2 is absolutely continuous. Is X2

normally distributed?

First, to show that X2 is absolutely continuous we need to show that

FX2(t) =

∫ t

−∞f(s)ds

for some function f : R→ R+. Indeed, if t < 0 thenX2 ≤ t

= ∅ so FX2(t) = 0.

If t ≥ 0,

FX2(t) = P[X2 ≤ t] = P[−√t ≤ X ≤

√t] =

∫ √t

−√t

1√2πe−s

2/2ds.

Since this last function in the integrand is even (symmetric around 0), we have that

FX2(t) = 2

∫ √t

0

1√2πe−s

2/2 =

∫ t

0

1√2πx

e−x/2dx,

where we have used the change of variables x = s2 so ds = 12√xdx.

Now, if we define

f(s) =

1√2πs

e−s/2 if s > 0

0 if s ≤ 0,

then we get from all the above that

FX2(t) =

∫ t

−∞f(s)ds.

So X2 is absolutely continuous with density f = fX .

Is X2 normal? No. This can be seen since FX2(0) = 0 but FN (0) > 0 for any

N ∼ N(µ, σ). 454

66


201.1.8001

Ariel Yadin

Lecture 10

10.1. Independence Revisited

10.1.1. Some reminders. Let (Ω,F ,P) be a probability space. Given a collection of

subsets K ⊂ F , recall that the σ-algebra generated by K, is

σ(K) =⋂G : G is a σ-algebra ,K ⊂ G ,

and this σ-algebra is the smallest σ-algebra containing K. σ(K) can be thought of as all

possible information that can be generated by the sets in K.

Recall that a collection of events (An)n are mutually independent if for any finite

number of these events A1, . . . , An we have that

P(A1 ∩ · · · ∩An) = P(A1) · · ·P(An).

We can also define the independence of families of events:

Definition 10.1. Let (Ω,F ,P) be a probability space. Let (Kn)n be a collection of

families of events in F . Then, (Kn)n are mutually independent if for any finite

number of families from this collection, K1, . . . ,Kn and any events A1 ∈ K1, A2 ∈K2, . . . , An ∈ Kn, we have that

P(A1 ∩ · · · ∩An) = P(A1) · · ·P(An).

For example, we would like to think that (An) are mutually independent, if all the

information from each event in the sequence is independent from the information from

the other events in the sequence. This is the content of the next proposition.

67

Proposition 10.2. Let (An) be a sequence of events on a probability space (Ω,F ,P).

Then, (An) are mutually independent if and only if the σ-algebras (σ(An))n are mutually

independent.

It is not difficult to prove this by induction:

Proof. By induction on n we show that (σ(A1), . . . , σ(An), An+1 , . . .) are mutually

independent.

The base n = 0 is just the assumption.

So assume that (σ(A1), . . . , σ(An−1), An , An+1 , . . .) are mutually independent.

Let n1 < n2 < . . . < nk < nk+1 < . . . < nm be a finite number of indices such that

nk < n < nk+1. Let Bj ∈ σ(Anj ) for j = 1, . . . , k, and Bj = Anj for j = k + 1, . . . ,m.

Let B ∈ σ(An). If B = ∅ then P(B1 ∩ · · · ∩ Bn ∩ B) = 0 = P(B1) · · ·P(Bn) · P(B). If

B = Ω then P(B1∩· · ·∩Bn∩B) = P(B1∩· · ·∩Bn) = P(B1) · · ·P(Bn), by the induction

hypotheses. If B = An then this also holds by the induction hypotheses. So we only

have to deal with the case B = Acn. In this case, for X = B1 ∩ · · · ∩Bn,

P(X ∩B) = P(X \ (X ∩An)) = P(X)− P(X ∩An) = P(X)(1− P(An)) = P(X)P(B),

where we have used the induction hypotheses to say that X and An are independent.

Since this holds for any choice a a finite number of events, we have the induction

step. ut

However, we will take a more windy road to get stronger results... (Corollary 10.8)

10.2. π-systems and Independence

Definition 10.3. Let K be a family of subsets of Ω.

• We say that K is a π-system if ∅ ∈ K and K is closed under intersections; that

is, for all A,B ∈ K, A ∩B ∈ K.

• We say that K is a Dynkin system (or λ-system) if K is closed under set comple-

ments and countable disjoint unions; that is, for any A ∈ K and any sequence

(An) in K of pairwise disjoint subsets, we have that Ac ∈ K and⊎nAn ∈ K.

68

The main goal now is to show that probability measures are uniquely defined once

they are defined on a π-system. This is the content of Theorem 10.7.

Proposition 10.4. If F is Dynkin system on Ω and F is also a π-system on Ω, then

F is a σ−algebra.

Proof. Since F is a Dynkin system, it is closed under complements, and Ω = ∅c ∈ F .

Let (An) be a sequence of subsets in F . Set B1 = A1, C1 = ∅, and for n > 1,

Cn =n−1⋃

j=1

Aj and Bn = An \ Cn.

Since F is a π-system and closed under complements, Cn = (⋂n−1j=1 A

cj)c ∈ F , and so

Bn = An ∩ Ccn ∈ F . Since (Bn) is a sequence of pairwise disjoint sets, and since F is a

Dynkin system,⋃

n

An =⋃

n

Bn ∈ F .

ut

Proposition 10.5. If (Dα)α is a collection of Dynkin systems (not necessarily count-

able). Then D =⋂αDα is a Dynkin system.

leave as ex-

ercise

Proof. Since ∅ ∈ Dα for all α, we have that ∅ ∈ D.

If A ∈ D, then A ∈ Dα for all α. So Ac ∈ Dα for all α, and thus Ac ∈ D.

If (An)n is a countable sequence of pairwise disjoint sets in D, then An ∈ Dα for all

α and all n. Thus, for any α,⋃nAn ∈ Dα. So

⋃nAn ∈ D. ut

Eugene Dynkin (1924–)Lemma 10.6 (Dynkin’s Lemma). If a Dynkin system D contains a π-system K, then

σ(K) ⊂ D.

Proof. Let

F =⋂D′ : D′ is a Dynkin system containing K

.

So F is a Dynkin system and K ⊂ F ⊂ D. We will show that F is a σ−algebra, so

σ(K) ⊂ F ⊂ D.

69

Suppose we know that F is closed under intersections (which is Claim 3 below). Since

∅ ∈ K ⊂ F , we will then have that F is a π-system. Being both a Dynkin system and a

π−system, F is a σ−algebra.

Thus, to show that F is a σ−algebra, it suffices to show that F is closed under

intersections.

Note that F is closed under complements (because all Dynkin systems are).

Claim 1. If A ⊂ B are subsets in F , then B \A ∈ F .

Proof. If A,B ∈ F , then since F is a Dynkin system, also Bc ∈ F . Since A ⊂ B, we

have that A,Bc are disjoint, so A∪Bc ∈ F and so B\A = (Ac∩B) = (A∪Bc)c ∈ F . ut

Claim 2. For any K ∈ K, if A ∈ F then A ∩K ∈ F .

Proof. Let E = A : A ∩K ∈ F.Let A ∈ E and (An) be a sequence of pairwise disjoint subsets in E.

Since K ∈ F and A ∩K ∈ F , by Claim 1 we have that Ac ∩K = K \ (A ∩K) ∈ F .

So Ac ∈ E.

Since (An ∩K)n is a sequence of pairwise disjoint subsets in F , we get that

⋃

n

An⋂K =

⋃

n

(An ∩K) ∈ F .

So we conclude that E is a Dynkin system. Since K is closed under intersections, E

contains K. Thus, by definition F ⊂ E.

So for any A ∈ F we have that A ∈ E, and A ∩K ∈ F . ut

Claim 3. For any B ∈ F , if A ∈ F then A ∩B ∈ F .

Proof. Let E = A : A ∩B ∈ F.Let A ∈ E and (An) be a sequence of pairwise disjoint subsets in E.

Since B ∈ F and A∩B ∈ F , by Claim 1 we have that Ac ∩B = B \ (A∩B) ∈ F . So

Ac ∈ E.

Since (An ∩B)n is a sequence of pairwise disjoint subsets in F , we get that

⋃

n

An⋂B =

⋃

n

(An ∩B) ∈ F .

70

So we conclude that E is a Dynkin system. By Claim 2, K is contained in E. So by

definition, F ⊂ E. ut

Since F is closed under intersections, this completes the proof. ut

The next theorem tells us that a probability measure on (Ω,F) is determined by it’s

values on a π-system generating F .

Theorem 10.7 (Uniqueness of Extension). Let K be a π−system on Ω, and let F =

σ(K) be the σ−algebra generated by K. Let P,Q be two probability measures on (Ω,F),

such that for all A ∈ K, P (A) = Q(A). Then, P (B) = Q(B) for any B ∈ F .

Proof. Let D = A ∈ F : P (A) = Q(A). So K ⊂ D. We will show that D is a Dynkin

system, and since it contains K, the by Dynkin’s Lemma it must contain F = σ(K).

If A ∈ D, then P (Ac) = 1− P (A) = 1−Q(A) = Q(Ac), so Ac ∈ D.

Let (An) be a sequence of pairwise disjoint sets in D. Then,

P (⊎

n

An) =∑

n

P (An) =∑

n

Q(An) = Q(⊎

n

An).

So⊎nAn ∈ D. ut

Corollary 10.8. Let (Ω,F ,P) be a probability space. Let (Πn)n be a sequence of π-

systems, and let Fn = σ(Πn). Then, (Πn)n are mutually independent if and only if

(Fn)n are mutually independent.

Proof. We will prove by induction on n that for any n ≥ 0, the collection (F1,F2, . . . ,Fn,Πn+1,Πn+2, . . . , )

are mutually independent.

For n = 0 this is the assumption. For n > 1, let n1 < n2 < . . . < nk < nk+1 < . . . <

nm be a finite number of indices such that nk < n < nk+1. Let Aj ∈ Fnj , j = 1, . . . , k

and Aj ∈ Πnj , j = k + 1, . . . ,m.

For any A ∈ Fn, if P(A1 ∩ · · · ∩Am) = 0 then A is independent of A1 ∩ · · · ∩Am. So

assume that P(A1 ∩ · · · ∩Am) > 0.

For any A ∈ Fn define the probability measure

P (A) := P(A|A1 ∩ · · · ∩Am).

71

By induction, the collection (F1, . . . ,Fn−1,Πn,Πn+1, . . .) are mutually independent, so

P (A) = P(A) for any A ∈ Πn. Since Πn is a π-system generating Fn, we have by Theo-

rem 10.7 that P (A) = P(A) for any A ∈ Fn. Since this holds for any choice of a finite

number of events A1, . . . , Am, we get that the collection (F1, . . . ,Fn−1,Fn,Πn+1, . . .)

are mutually independent. ut

Corollary 10.9. Let (Ω,F ,P) be a probability space, and let (An)n be a sequence of

mutually independent events. Then, the σ-algebras (σ(An))n are mutually independent.

10.3. Second Borel Cantelli Lemma

We conclude this lecture with

Lemma 10.10 (Second Borel-Cantelli Lemma). Let (An) be a sequence of mutually

independent events. If∑

n P(An) =∞ then P(lim supnAn) = P(An i.o.) = 1.

Proof. It suffices to show that P(lim inf Acn) = 0.

For fixed m > n note that by independence

P(

m⋂

k=n

Ack) =

m∏

k=n

(1− P(Ak)).

Since limm⋂mk=nA

ck =

⋂k≥nA

ck, we have that

P(⋂

k≥nAck) = lim

m→∞

m∏

k=n

(1− P(Ak)) =∏

k≥n(1− P(Ak)) ≤ exp

(−∑

k≥nP(Ak)

),

where we have used 1− x ≤ e−x.

Since∑

n P(An) = ∞, we have that∑

k≥n P(Ak) = ∞ for all fixed n. Thus,

P(⋂k≥nA

ck) = 0 for all fixed n, and

P(lim inf Acn) = P(⋃

n

⋂

k≥nAck) = P(lim

n

⋂

k≥nAck) = lim

n→∞P(⋂

k≥nAck) = 0.

ut

Example 10.11. A biased coin is tossed infinitely many times, all tosses mutually

independent. What is the probability that the sequence 01010 appears infinitely many

times?

72

Let p be the probability the coin’s outcome is 1. Let an be the outcome of the n-th toss.

let An be the event that an = 0, an+1 = 1, an+2 = 0, an+3 = 1, an+4 = 0. Since the tosses

are mutually independent, we have that the sequence of events (A5n+1)n≥0 are mutually

independent. Since P(A5n+1) = p−2(1 − p)−3 > 0, we have that∑∞

n=0 P(A5n+1) = ∞.

So the Borel-Cantelli Lemma tells us that P(A5n+1 i.o.) = 1. Since A5n+1 i.o. implies

An i.o. we are done. 454

73


201.1.8001

Ariel Yadin

Lecture 11

11.1. Independent Random Variables

Let X : (Ω,F) → (R,B) be a random variable. Recall that in the definition of

a random variable we require that X is a measurable function; i.e. for any Borel set

B ∈ B we have X−1(B) ∈ F . We want to define a σ-algebra that is all the possible

information that can be infered from X.

Definition 11.1. Let (Ω,F ,P) be a probability space and let (Xα)α∈I : Ω → R be a

collection of random variables. The σ-algebra generated by (Xα)α∈I is defined as

σ(Xα : α ∈ I) := σ(X−1α (B) : α ∈ I,B ∈ B

).

Note that for a random variable X : Ω → R, the collectionX−1(B) : B ∈ B

is a

σ-algebra, so σ(X) =X−1(B) : B ∈ B

.

We now define independent random variables.

Definition 11.2. Let (Xn)n, (Yn)n be collections of random variables on some proba-

bility space (Ω,F ,P). We say that (Xn)n are mutually independent if the σ-algebras

(σ(Xn))n are mutually independent.

We say that (Xn)n are independent of (Yn)n if the σ-algebras σ((Xn)n) and σ((Yn)n)

are independent.

An important property (that is a consequence of the π-system argument above): keep this for

later

Proposition 11.3. Let (Xn)n be a collection of random variables on (Ω,F ,P). (Xn) are

mutually independent, if for any finite number of random variables from the collection,

74

X1, . . . , Xn, and any real numbers a1, a2, . . . , an, we have that

P[X1 ≤ a1, X2 ≤ a2, . . . , Xn ≤ an] = P[X1 ≤ a1] · P[X2 ≤ a2] · · ·P[Xn ≤ an].

Proof. Let X1, . . . , Xn be a finite number of random variables.

Define two probability measure on the Borel sets of Rn: For any B ∈ Bn let

P1(B) = P((X1, . . . , Xn) ∈ B) and P2(B) = P(X1 ∈ π1B) · · ·P(Xn ∈ πnB),

where πj is the projection onto the j-th coordinate. Since these two measure are identical

on the π-system

(−∞, a1]× · · · × (−∞, an] : a1, a2, . . . , an ∈ R ,

and since this π-system generates all Borel sets on Rn, we get that P1 = P2 for all Borel

sets on Rn.

Thus, if Aj ∈ σ(Xj), j = 1, . . . , n, then for all j, Aj = X−1j (Bj) for some Borel set on

R, so

P(A1∩· · ·∩An) = P((X1, . . . , Xn) ∈ B1×· · ·×Bn) = P(X1 ∈ B1) · · ·P(X1 ∈ Bn) = P(A1) · · ·P(An).

ut

Example 11.4. We toss two dice. Y is the number on the first die and X is the sum

of both dice.

Here the probability space is the uniform measure on all pairs of dice results, so

Ω = 1, 2, . . . , 62.

What is σ(X)?

σ(X) = (x, y) ∈ Ω : x+ y ∈ A ,A ⊂ 2, . . . , 12 .

(We have to know that all subsets of 2, . . . , 12 are Borel sets. This follows from the

fact that r =⋃n(r − 1/n, r].) On the other hand,

σ(Y ) = (x, y) ∈ Ω : x ∈ A , A ⊂ 1, . . . , 6 .

Now note

P[X = 7, Y = 3] =1

36= P[X = 7] · P[Y = 3].

75

This is mainly due to P[X = 7] = 1/6. Similarly, for any y ∈ 1, . . . , 6, we have that

the events X = 7 and Y = y are independent.

Are X and Y independent random variables? NO!

For example,

P[X = 6, Y = 3] =1

366= 5

36· 1

6= P[X = 6] · P[Y = 3].

For independence we need all information fromX to be independent of the information

from Y ! 454

Example 11.5. Let (Xn)n be independent random variables such that Xn ∼ Exp(1)

for all n. Then, for any α > 0,

P[Xn > α log n] = e−α logn = n−α.

Note that in this case,

∑

n≥1

P[Xn > α log n] =

∞ if α ≤ 1

<∞ if α > 1.

For every n the event Xn > α log n ∈ σ(Xn). So the sequence (Xn > α log n)n is

independent. By the Borel-Cantelli Lemma (both parts)

P[Xn > α log n i.o.] =

1 if α ≤ 1

0 if α > 1.

454

76


201.1.8001

Ariel Yadin

Lecture 12

12.1. Joint Distributions

In the same way that we constructed the Borel sets on R, we can consider the σ-

algebra of subsets of Rd, generated by sets of the form (a1, b1]× (a2, b2]× · · · × (ad, bd].

This is the Borel σ-algebra on Rd, usually denoted by Bd or B(Rd).

Suppose now that we have d random variables X1, . . . , Xd on some probability space

(Ω,F ,P). Then, we can define a functionX : Ω→ Rd byX(ω) = (X1(ω), X2(ω), . . . , Xd(ω)).

Suppose we are told that this function is measurable from (Ω,F) to (Rd,Bd). In this

case X is called a Rd-valued random variable, and we say that X1, . . . , Xd have a

joint distribution on (Ω,F ,P).

Proposition 12.1. X1, . . . , Xd are random variables if and only if X = (X1, . . . , Xd)

is a Rd-valued random variable.

Proof. Note that given a measurable function X : (Ω,F) → (Rd,Bd) we can always

write X = (X1, . . . , Xd), and each Xj is a random variable, since for any B ∈ B,

X−1j (B) = X−1(R× · · · ×B × · · · × R) ∈ F .

Now suppose X1, . . . , Xd are random variables on (Ω,F ,P). We need to show that

X−1(B) ∈ F for all B ∈ Bd.Since Bd = σ((a1, b1]× (a2, b2]× · · · × (ad, bd] : aj < bj , j = 1, . . . , d), by Propo-

sition 7.3 it suffices to prove that

X−1((a1, b1]× (a2, b2]× · · · × (ad, bd]) ∈ F

77

for all aj < bj , j = 1, 2, . . . , d. This follows from the fact that

X−1((a1, b1]× (a2, b2]× · · · × (ad, bd]) =d⋂

j=1

X−1j ((aj , bj ]) ∈ F .

ut

The probability measure PX on (Rd,Bd) defined by PX(B) = P(X ∈ B) = P(X−1(B))

is called the distribution of X. We can also define the joint distribution function

of X1, . . . , Xd:

F(X1,...,Xd)(t1, . . . , td) = FX(~t) = P(X1 ≤ t1, X2 ≤ t2, . . . , Xd ≤ td).

A restatement of Proposition 11.3 is:

Proposition 12.2. Let (X1, . . . , Xd) be d random variables on a probability space (Ω,F ,P)

with a joint distribution. Then, (X1, . . . , Xd) are mutually independent if and only if

∀ (t1, . . . , td) ∈ Rd FX(t1, . . . , td) = FX1(t1) · · ·FXd(td).

Example 12.3. An urn contains 3 red balls, 4 white balls and 5 blue balls. We remove

three balls from the urn, all balls equally likely. Let X be the number of red balls

removed and Y the number of white balls removed.

What is the joint distribution of (X,Y )?

A natural probability space here is the uniform measure on Ω = S : |S| = 3, S ⊂ 1, . . . , 12.Suppose f : 1, . . . , 12 → red, white, blue is the function assigning a color to each

ball in the urn.

So

(X,Y )(S) =(# s ∈ S : f(s) = red ,# s ∈ S : f(s) = white

).

So

P(X,Y )((0, 0)) = P(S ∈ Ω : f(s) = blue ∀ s ∈ S) =

(53

)(

123

) .

For any S ∈ Ω, we can write S = Sr ] Sw ] Sb where Sr = s ∈ S : f(s) = red and

similarly for white and blue. So |S| = |Sr|+ |Sw|+ |Sb|.

78

If (X,Y )(S) = (0, 1) then Sr = 0, Sw = 1 and Sb = 2. So

P(X,Y )((0, 1)) =

(30

)·(

41

)·(

52

)(

123

) .

Similary, if (X,Y )(S) = (1, 1) then Sr = 1, Sw = 1 and Sb = 1. So

P(X,Y )((1, 1)) =

(31

)·(

41

)·(

51

)(

123

) .

Generally, for integers x, y ≥ 0, such that x+ y ≤ 3,

P(X,Y )((x, y)) =

(3x

)·(

4y

)·(

53−x−y

)(

123

) .

Of course F(X,Y ) can be calculated by summing these. 454

Example 12.4. Two fair coins are tossed, there outcomes independent. X ∈ 0, 1is the outcome of the first coin. Y ∈ 0, 1 is the outcome of the second coin. Z =

X + Y (mod 2). Show that X,Z are independent.

First P(X = 0) = P(X = 1) = 1/2. Also,

P(Z = 0) = P(X = Y = 0) + P(X = Y = 1) = 14 + 1

4 = 12 .

Similarly, P(Z = 1) = 1/2.

Now,

P(X = 0, Z = 0) = P(X = 0, Y = 0) = 14 = P(X = 0) · P(Z = 0).

Similarly for other values. 454

As in the case of 1-dimensional random variables, we would like to define the infor-

mation a random variable carries as some σ-algebra. Thus, for a d-dimensional random

variable X we define

σ(X) :=X−1(B) : B ∈ Bd

.

This is a σ−algebra.

Recall that if X = (X1, . . . , Xd) then

σ(X1, . . . , Xd) = σ(X−1j (B) : B ∈ B, 1 ≤ j ≤ d

)

= σ(X−1(R× · · · × R×B × R× · · · × R) : B ∈ B

).

79

The following proposition shows that these notations agree.

Proposition 12.5. Let X = (X1, . . . , Xd) be a d-dimensional random variable on some

probability space. Then, σ(X) = σ(X1, . . . , Xd).

Proof. Define G =B ∈ Bd : X−1(B) ∈ σ(X1, . . . , Xd)

we get that this is a σ-algebra,

and that the set of rectangles K = (a1, b1]× · · · × (ad, bd] : aj < bj ∈ R is contained

in G. Since Bd is generated by K, we get that Bd ⊂ G, and σ(X) ⊂ σ(X1, . . . , Xn).

On the other hand, if B ∈ B then X−1j (B) = X−1(R × · · · × B × · · · × R) ∈ σ(X). So

σ(X1, . . . , Xn) ⊂ σ(X) and they are equal. ut

12.2. Marginals

Lemma 12.6. Let X = (X1, . . . , Xd) be a d-dimensional random variable on some

probability space (Ω,F ,P). Let Y = (X1, . . . , Xd−1) and let Z = (X2, . . . , Xd) (which

are d− 1-dimensional). Then, for any ~t ∈ Rd−1,

FY (~t) = limt→∞

FX(~t, t) and FZ(~t) = limt→∞

FX(t,~t).

Proof. Fix ~t = (t1, . . . , td−1) ∈ Rd−1. Let A = X1 ≤ t1, . . . , Xd−1 ≤ td−1.Let sn ∞, and set An = Xd ≤ sn. Since (sn) is an increasing sequence, we have

that An ⊂ An+1, so (An)n is an increasing sequence, and thus (A ∩ An)n is also an

increasing sequence. Note that

limnAn =

⋃

n

An = X <∞ = Ω,

so limn(A ∩An) = A. Thus, using continuity of probability

FY (~t) = P[A] = limn

P[A ∩An] = limn→∞

FX(~t, sn).

Since this holds for any sequence sn ∞, we have the first assertion.

The proof of the second assertion is almost identical, switching the roles of the first

and last coordinate. ut

80

Corollary 12.7. Let X = (X1, . . . , Xd) be a d-dimensional random variable on some

probability space (Ω,F ,P). Then, for any t ∈ R,

FXj (t) = limti→∞ : i 6=j

FX(t1, . . . , tj−1, t, tj+1, . . . , td).

FXj above is called the marginal distribution function of Xj .

81


201.1.8001

Ariel Yadin

Lecture 13

13.1. Discrete Joint Distributions

Recall that a random variable is discrete if there exists a countable set R such that

P(X ∈ R) = 1. R is the range of X, and the function fX(r) = P(X = r) is the density

of X.

Suppose that X,Y are discrete random variables, with ranges RX , RY and densities

fX , fY . Note that R := RX ∪RY is also a range for both X and Y .

Now, we can define the joint density of X and Y by

fX,Y (x, y) = P(X = x, Y = y) = P((X,Y ) = (x, y)).

Note that

P((X,Y ) 6∈ R2) = P(X 6∈ R ∪ Y 6∈ R) ≤ P(X 6∈ R) + P(Y 6∈ R) = 0.

So P((X,Y ) ∈ R2) = 1, and we can think of (X,Y ) as a R2-valued discrete random

variable.

This of course can be generalized:

IfX1, . . . , Xn are discrete random variables. Then (X1, . . . , Xn) is a Rn valued random

variable, supported on some countable set in Rn, and the density of (X1, . . . , Xn) is

defined to be

f(X1,...,Xn)(x1, . . . , xn) = P(X1 = x1, . . . , Xn = xn). in exercise

Exercise 13.1. Let X1, . . . , Xd be random variables. Show that X = (X1, . . . , Xd) is a

jointly discrete random variable if and only if X1, . . . , Xd are each discrete.

82

Assume that X1, . . . , Xd are discrete random variables. Show that X1, . . . , Xd are

mutually independent, if and only if

∀ (t1, . . . , td) ∈ Rd f(X1,...,Xd)(t1, . . . , td) = fX1(t1) · fX2(t2) · · · fXd(td).

13.2. Discrete Marginal Densities

Proposition 13.2. If X = (X1, . . . , Xd) is a discrete joint random variable with range

Rd and density fX , then the density of Xj is given by

fXj (x) =∑

xi∈R : i 6=jfX(x1, . . . , xj−1, x, xj+1, . . . , xd).

Proof. This follows from

FX(t1, . . . , td) =∑

R3rj≤tj : j=1,...,d

fX(r1, . . . , rd).

So

limti→∞ : i 6=j

FX(t1, . . . , tj−1, t, tj+1, . . . , td) =∑

R3rj≤t

∑

ri∈R : i 6=jfX(r1, . . . , rd).

Thus,

fXj (r) = FXj (r)− FXj (r−) =∑

R3rj≤r

∑

ri∈R : i 6=jfX(r1, . . . , rd)−

∑

R3rj<r

∑

ri∈R : i 6=jfX(r1, . . . , rd)

=∑

ri∈R : i 6=jfX(r1, . . . , rj−1, r, rj+1, . . . , rd).

ut

The above density is called the marginal density of Xj .

Example 13.3. Noga and Shir come home from school and decide if to do homework

or watch TV.

Let X = 1 if Noga does homework and X = 0 if she watches TV. Let Y = 1 if Shir

does homework and Y = 0 if she watches TV.

We are given that

P((X,Y ) = (1, 1)) = 14 , P((X,Y ) = (0, 0)) = 1

2 , P((X,Y ) = (1, 0)) = 316 .

What are the marginal densities of X and Y ?

83

Who is more likely to watch TV?

Note that since the density must sum to 1, P((X,Y ) = (0, 1)) = 116 .

Well, let’s calculate the marginal densities:

fX(1) = P((X,Y ) = (1, 0)) + P((X,Y ) = (1, 1)) = 716 .

So it must be that fX(0) = 916 (why?). Similarly,

fY (1) = P((X,Y ) = (0, 1)) + P((X,Y ) = (1, 1)) = 516 .

So fY (0) = 1116 .

We see that Shir is more likely to watch TV. 454

13.3. Conditional Distributions (Discrete)

Example 13.4. After mating season sea turtles land on the beach and lay their eggs.

After hatching, the baby turtles must make it back to the sea on their own, however,

there are many dangers and predators awaiting.

Each female hatches a Poisson-λ number of eggs. After hatching, each baby turtle

has probability p to make it back to the sea, all turtles mutually independent.

What is the probability that exactly k turtles make it back to sea? 454

How do we solve this? We would like to use conditional probabilities, to say that

conditioned on their being n eggs, the number of surviving turtles is Binomial-(n, p).

So we want to define distributions conditioned on the results of other distributions.

Definition 13.5. Let X,Y be discrete random variables defined on some probability

space (Ω,F ,P). For y such that fY (y) > 0 define the random variable X|Y = y as the

random variable whose density is

fX|Y (x|y) = fX|Y=y(x) = P(X = x|Y = y).

X Note that the correct way to do this would be to define a different density fX|y(·)

for every y. This is captured in the notation fX|Y (·|y).

84

Example 13.6. Let the density of X,Y be given by:

X \ Y 0 1

0 0.4 0.1

1 0.2 0.3

So

P(Y = 1) = 0.4 and P(Y = 0) = 0.6.

Thus,

fX|Y (1|1) = P(X = 1|Y = 1) = 0.3/0.4 = 3/4,

fX|Y (0|1) = P(X = 0|Y = 1) = 0.1/0.4 = 1/4,

fX|Y (1|0) = 0.2/0.6 = 1/3,

fX|Y (0|0) = 0.4/0.6 = 2/3.

454

Let’s return to solve the example above:

Solution of Example 13.4. Let X be the number of eggs, and let Y be the number of

turtles that survive.

What we are given, is that X ∼ Poi(λ) and that Y |X = n ∼ Bin(n, p).

That is, for all k ≤ n,

P(Y = k|X = n) =

(n

k

)pk(1− p)n−k.

Thus,

P(Y = k) =

∞∑

n=0

P(Y = k|X = n)P(X = n) =

∞∑

n=k

(n

k

)pk(1− p)n−k · e−λλ

n

n!

=pkλk

k!e−λ

∞∑

n=k

λn−k(1− p)n−k(n− k)!

=(pλ)k

k!e−λeλ(1−p) = e−λp · (λp)k

k!.

That is, the number of surviving turtles has the distribution of a Poisson-λp random

variable. ut

85

Exercise 13.7. Let X,Y be discrete random variables. Show that in exercise

fX,Y (x, y) = fX|Y (x|y)fY (y),

for all y such that fY (y) > 0.

13.4. Sums of Independent Random Variables (Discrete Case)

Suppose that X,Y are random variables on some probability space. One can define

the random variable Z(ω) = X(ω) + Y (ω), or in short Z = X + Y . (This is always a

random variable, because Z = φ(X,Y ), where φ(x, y) = x+y, and since φ is continuous

it is measurable).

Suppose X,Y are independent discrete random variables, with densities fX , fY .

First, from independence, we have that

P(X = x, Y = y) = P(X = x)P(Y = y),

for all x, y. Thus, f(X,Y )(x, y) = fX(x)fY (y).

Now, what is the density of X + Y ? Well,

fX+Y (z) = P(X + Y = z) = P(⊎

y

X + Y = z, Y = y) =∑

y

P(X + Y = z, Y = y)

=∑

y

P(X = z − y)P(Y = y) =∑

y

fX(z − y)fY (y).

This leads to the following definition:

Definition 13.8. Let f, g be two real valued functions supported on a countable set

R ⊂ R. The (discrete) convolution of f, g, denoted f ∗ g, is the real valued function

f ∗ g : R→ R,

(f ∗ g)(z) =∑

y∈R∪(z−R)

f(z − y)g(y).

Note that f ∗ g = g ∗ f by change of variables.

We conclude with the observation:

Proposition 13.9. Let X,Y be two independent discrete random variables on some

probability space. Let Z = X + Y . Then, fZ = fX ∗ fY .

86

Example 13.10. Consider X1, . . . , Xn mutually independent random variables where

Xj ∼ Poi(λj) for some λ1, . . . , λn > 0.

Let Z = X1 + · · ·+Xn. What is the distribution of Z?

Well, since fX1(x) = 0 for x < 0,

fX1+X2(z) = fX1 ∗ fX2(z) =∞∑

x=0

fX1(z − x)fX2(x)

=

z∑

x=0

e−λ1λz−x1

(z − x)!· e−λ2 λ

x2

x!= e−(λ1+λ2) · 1

z!

z∑

x=0

(z

x

)λz−x1 λx2

= e−(λ1+λ2) · (λ1 + λ2)z

z!.

This is the density of the Poisson distribution!

So X1 +X2 ∼ Poi(λ1 + λ2). Continuing inductively, we conclude that Z ∼ Poi(λ1 +

λ2 + · · ·+ λn).

That is: the sum of independent Poisson random variables is again a Poisson random

variable. 454

Example 13.11. Let X ∼ Poi(α) and Y ∼ Poi(β). Assume X,Y are independent.

What is the distribution of X|X + Y = n?

Let’s calculate explicitly: Recall that X + Y ∼ Poi(α+ β). For k ≤ n,

P(X = k|X + Y = n) =P(X = k)P(Y = n− k)

P(X + Y = n)= e−α

αk

k!· e−β βn−k

(n− k)!·(e−(α+β) (α+ β)n

n!

)−1

=

(n

k

)αkβn−k

(α+ β)n=

(n

k

)(α

α+ β

)k·(

1− α

α+ β

)n−k.

So X conditioned on X + Y = n has the Binomial-(n, α/(α+ β)) distribution. 454

87


201.1.8001

Ariel Yadin

Lecture 14

14.1. Continuous Joint Distributions

Definition 14.1. LetX1, . . . , Xd be random variables on some probability space (Ω,F ,P).

We say that X1, . . . , Xd have an absolutely continuous joint distribution (or that

X = (X1, . . . , Xd) is a Rd-valued absolutely continuous random variable) if there ex-

ists a non-negative integrable function fX = fX1,...,Xd : Rd → R+, such that for all

~t = (t1, . . . , td),

FX(~t) =

∫ t1

−∞· · ·∫ td

−∞fX(~s)d~s.

Marginal distributions are as in the discrete case: If X = (X1, . . . , Xd) is a joint

random variable then the marginal distribution of Xj is

FXj (t) = limtk→∞ : k 6=j

FX(t1, . . . , tj−1, t, tj+1, . . . , td).

Specifically, if X is absolutely continuous, then

fXj (t) =

∫· · ·∫fX(t1, . . . , tj−1, t, tj+1, . . . , td)dt1 · · · dtj−1dtj+1 · · · dtd.

!!! It is not true that if X1, . . . , Xd are absolutely continuous, then necessarily

X = (X1, . . . , Xd) is absolutely continuous as a Rd-valued random variable.

Example 14.2. If X = Y are the same absolutely continuous random variable, then

(X,Y ) is not absolutely continuous, since if it was, then for the Borel setA =

(x, y) ∈ R2 : x = y

,

1 =

∫ ∫fX,Y (x, y)dxdy =

∫ ∫

AfX,Y (x, y)dxdy = 0

since fX,Y (x, y) = 0 if (x, y) 6∈ A and since A has measure 0 in the plane.

88

A more elementary way to see this (in the scope of these notes) is: If fX,Y is the

density of a pair (X,Y ), such that X = Y is an absolutely continuous random variable,

then: Let gs(x) =∫ s−∞ fX,Y (x, y)dy. Then,

FX,Y (∞, s) = P[X ≤ ∞, Y ≤ s] = P[X ≤ s, Y ≤ s] = FX,Y (s, s).

So∫ ∞

−∞gs(x)dx = FX,Y (∞, s) = FX,Y (s, s) =

∫ s

−∞gs(x)dx,

which implies that∫ ∞

sgs(x)dx = 0.

Thus, gs(x) = 0 for all (actually a.e.) x ≥ s (since gs is a non-negative function). Now,

if x ≥ s then

0 = gs(x) =

∫ s

−∞fX,Y (x, y)dy,

so fX,Y (x, y) = 0 for all (actually a.e.) x ≥ s, y ≤ s. Since this holds for all s we have

that fX,Y = 0 for all (or a.e.) x ≥ y. Exchanging the roles of X,Y we get that fX,Y = 0

(a.e.), which is not a density function, contradiction!

454

However, we have the following criterion for independence:

Proposition 14.3. X1, . . . , Xn are mutually independent absolutely continuous random

variables if and only if X = (X1, . . . , Xn) is a Rn-valued absolutely continuous random

variable with density fX(t1, . . . , tn) = fX1(t1) · · · fXn(tn).

Proof. The “if” part follows from Proposition 11.3.

For the “only if” part: Let f(t1, . . . , tn) = fX1(t1) · · · fXn(tn). We need to show that

P[X1 ≤ t1, . . . , Xn ≤ tn] =

∫

Rnf(t1, . . . , tn)dt1 · · · dtn.

Indeed, this follows from independence. ut

89

14.2. Conditional Distributions (Continuous)

Recall that if X,Y are discrete random variables, then

fX|Y (x|y) = P[X = x|Y = y].

For continuous random variables this is not defined, since P[Y = y] = 0. However, we

still can define the following:

Let X,Y be absolutely continuous random variables. Let y ∈ R be such that fY (y) >

0. Define a function fX|Y (·|y) : R→ R+ by

fX|Y (x|y) :=f(X,Y )(x, y)

fY (y).

Then, for every y such that fY (y) > 0, this function is a density; indeed, recall that

fY (y) =∫f(X,Y )(x, y)dx so

∫ ∞

−∞fX|Y (x|y)dx =

∫f(X,Y )(x, y)

fY (y)dx = 1.

Thus, fX|Y (·|y) defines an absolutely continuous random variable, denoted X|Y = y,

and called X conditioned on Y = y. fX|Y (·|y) is the conditional density of X

conditioned on Y = y. We have the identity

f(X,Y )(x, y) = fX|Y (x|y) · fY (y).

For any Borel set B ∈ B, we define

P[X ∈ B|Y = y] := P[(X|Y = y) ∈ B] =

∫

BfX|Y (x|y)dx.

In many cases, the information is given as conditional densities, rather than joint

densities.

Example 14.4. Let X,Y have joint density

fX,Y (x, y) =

1ye−x/ye−y x, y ≥ 0

0 otherwise

Let’s compute the density of X|Y : First, the marginal density of Y is

fY (y) =

∫ ∞

0

1

ye−x/ye−ydx = e−y,

90

for y ≥ 0 and 0 otherwise. For x, y ≥ 0,

fX|Y (x|y) =

1ye−x/ye−y

e−y=

1

ye−x/y,

and fX|Y (x|y) = 0 for x < 0. So, X|Y = y ∼ Exp(1/y).

Thus, for example,

P[X > 1|Y = y] =

∫ ∞

1fX|Y (x|y)dx = e−1/y.

454

14.3. Sums (Continuous)

Let X,Y be independent absolutely continuous random variables. Let Z = X + Y .

What is FZ?

P[Z ≤ z] = P[X + Y ≤ z] =

∫ ∫

x+y≤zfX,Y (x, y)dxdy

=

∫ ∞

−∞

∫ z−y

−∞fX(x)fY (y)dxdy

=

∫ ∞

−∞FX(z − y)fY (y)dy = FX ∗ fY (z).

Now, note that

FZ(z) =

∫ ∞

−∞

∫ z−y

−∞fX(x)fY (y)dxdy

=

∫ z

−∞

∫ ∞

−∞fX(s− y)fY (y)dyds =

∫ z

−∞fX ∗ fY (s)ds.

Thus, Z = X + Y is an absolutely continuous random variable with density fX ∗ fY .

Example 14.5. What is the distibution of X + Y , for independent X,Y , when:

• X ∼ N(0, 1) and Y ∼ N(0, 1).

• X ∼ Exp(λ) and Y ∼ Exp(α).

In the normal case we have that (x−y)2+y2

2 = x2

4 + (y − x2 )2. So,

fX+Y (x) =1

2πe−x

2/4 ·∫ ∞

−∞e−(y−x/2)2dy =

1

2πe−x

2/4 · √π =1√

2π ·√

2e−x

2/4.

So X + Y ∼ N(0,√

2).

91

In the exponential case for x > 0,

fX+Y (x) = fX ∗ fY (x) = αλe−λx∫ ∞

−∞1[0,∞)(x− y)1[0,∞)(y)e−(α−λ)ydy

=αλ

α− λe−λx(1− e−(α−λ)x) =

αλ

α− λ(e−λx − e−αx).

Since P[X + Y < 0] = 0 this is the density of X + Y . 454

92


201.1.8001

Ariel Yadin

Lecture 15

In this lecture we want to define the average value of a random variable X. If X is

discrete, it would be intuitive to define the average as∑

x P[X = x]x. If X is abso-

lutely continuous, perhaps one would also think of defining the average analogously as∫xfX(x)dx. But what about the general case?

We will follow Lebesgue integration theory for this task.

15.1. Preliminaries For the TA

Proposition 15.1. Let X = (X1, . . . , Xn) and Y = (Y1, . . . , Ym) be random variables

on some probability space (Ω,F ,P). Then, X is independent of Y if and only if for

any measurable functions g : Rn → R and h : Rm → R we have that g(X), h(Y ) are

independent.

Proof. Now, for the “only if” direction: Let g, h be measurable functions as in the

proposition. Let A1 ∈ σ(g(X)), A2 ∈ σ(h(Y )). Note that there exist Borel sets B1, B2

such that A1 = X−1g−1(B1) and A2 = Y −1h−1(B2). Since g is measurable, g−1(B1) ∈Bn, and similarly, h−1(B2) ∈ Bm. Thus, A1 ∈ σ(X) and A2 ∈ σ(Y ). Thus, A1, A2 are

independent. Since this holds for all A1, A2 we have that g(X), h(Y ) are independent.

For the “if” direction: Let A1 ∈ σ(X), A2 ∈ σ(Y ). So A1 = X−1(B1) and A2 =

Y −1(B2) for some B1 ∈ Bn and B2 ∈ Bm. Define g(~x) = 1~x∈B1 and h(~y) = 1~y∈B2.

Since these are measurable functions, and since A1 ∩ A2 = X ∈ B1, Y ∈ B2, we have

that

P[A1 ∩A2] = P[g(X) = 1, h(Y ) = 1] = P[g(X) = 1]P[h(Y ) = 1] = P[A1]P[A2].

Since this holds for all A1, A2 we have that X is independent of Y . ut

93

Example 15.2. The function g(x) = bxc is measurable, since for any b ∈ R,

g−1(−∞, b] = x : bxc ≤ b = (−∞, bbc] ∈ B.

The function g(x) = c · x is measurable, since it is continuous. The function g(x) =

max x, 0 is measurable since g−1(−∞, b] = [0, b] if b ≥ 0 and g−1(−∞, b] = ∅ if b < 0.

In a similar way g(x) = max −x, 0 is measurable.

Thus, if X,Y are independent random variables, then X+ := max X, 0 , Y + :=

max Y, 0 are independent, and X+n = 2−nb2nX+c is independent of Y +

n = 2−nb2nY +c.454

15.2. Simple Random Variables

Let us start with simple random variables:

This dfn is

important,

all ω ∈ Ω are

sent to some

non-negative

valueDefinition 15.3. Let (Ω,F ,P) be a probability space. X is a simple random variable

if there exists a finite set of positive numbers A = a1, . . . , am ⊂ R such that Ω =

X = a1 ] · · · ] X = am ] X = 0.

Specifically, such a X is a discrete random variable. In this case it is intuitive that

the average of X should be∑m

k=1 ak P[X = ak]. Note that

X =

m∑

k=1

ak1X=ak,

where all ak > 0 and the events X = ak are pairwise disjoint.

We want to say that simple random variables somehow approximate all random vari-

ables, so that we can define the average on these. This is the content of the following

definitions and proposition.

Definition 15.4. Let (Ω,F ,P) be a probability space. A sequence of random variables

(Xn)n is monotone non-decreasing if for all ω ∈ Ω, and all n, Xn(ω) ≤ Xn+1(ω).

Let (Ω,F ,P) be a probability space. Let (Xn)n be a sequence of random variables.

Suppose that for every ω ∈ Ω the limit X(ω) := limn→∞Xn(ω) exists. Then, the

94

function X is a random variable, since X = lim supXn = lim inf Xn, and since for any

Borel set B,

X−1(B) =ω : lim inf

n→∞Xn(ω) ∈ B

=⋃

n

⋂

k≥nω : Xk(ω) ∈ B = lim inf X−1

n (B) ∈ F .

(In fact, limX−1n (B) = X−1(B).)

Definition 15.5. Let (Ω,F ,P) be a probability space. Let (Xn)n be a sequence of

random variables. Suppose that for every ω ∈ Ω the limit X(ω) := limn→∞Xn(ω) exists.

Then, the function X is a random variable. In this case we say that Xn converges

everywhere to X.

If in addition, (Xn)n is a monotone non-decreasing sequence, we say that Xn con-

verges everywhere monotonely (from below) to X, and denote this by Xn X.

Finally, a non-negative random variable X is a random variable satisfying X(ω) ≥ 0

for all ω ∈ Ω. We write this as X ≥ 0. In general, we write X ≤ Y if X(ω) ≤ Y (ω) for

all ω ∈ Ω. (So, for example, a sequence (Xn)n is monotone non-decreasing if Xn ≤ Xn+1

for all n.)

Proposition 15.6. Let (Ω,F ,P) be a probability space, and X a random variable. If

X is non-negative, then there exists a sequence (Xn)n of simple random variables such

that Xn X.

Proof. For all n, k define An,k := X−1[k2−n, (k + 1)2−n), which is of course an event.

Note that for all n,

Ω =2nn−1⊎

k=0

An,k⊎X−1[n,∞).

For all n ≥ 1 define

Xn =

2nn−1∑

k=0

2−nk1An,k + n1X−1[n,∞).

[That is, we divide the interval [0, n) into intervals of length 2−n, and replace X with

the leftmost value in that interval, cutting this after n. See Figure 4]

So (Xn)n is a sequence of simple random variables.

95

Note that Xn(ω) = min n, 2−nb2nX(ω)c. If X(ω) < n then

Xn(ω) = 2−nb2nX(ω)c = 2−nb122n+1X(ω)c ≤ 2−(n+1)b2n+1X(ω)c = Xn+1(ω).

(For all x, if bx/2c ≤ x/2 then 2bx/2c ≤ x.) If n ≤ X(ω) < n+1 then 2n+1n ≤ 2n+1X(ω)

so

Xn(ω) = n ≤ 2−(n+1)b2n+1X(ω)c = Xn+1(ω).

Finally, if X(ω) ≥ n + 1, then Xn(ω) = n ≤ n + 1 = Xn+1(ω). So, Xn(ω) ≤ Xn+1(ω)

for all ω and all n, and (Xn)n is a non-decreasing sequence.

Now to show that limnXn(ω) = X(ω): Note that |X(ω) − 2−nb2nX(ω)c| ≤ 2−n.

Thus, for all ω ∈ Ω, limn→∞Xn(ω) = X(ω). ut

0 n n + 1

XnXn+1

Figure 4. Xn and Xn+1

96

15.3. Expectation of Non-Negative Random Variables

We want to define the expectation on a random variable by approximating by simple

random variables as in Proposition 15.6, and taking the limit. For this we have to make

sure the limit exists and makes sense.

Let us start by defining the expectation for a simple random variable.

Definition 15.7. If X =∑m

k=1 ak1X−1(ak) is a simple random variable, define

E[X] :=m∑

k=1

ak P[X = ak].

Proposition 15.8. The above definition does not depend on the choice of the partition

of Ω; that is, if X =∑n

j=1 bj1Bj where Ω = ]nj=1Bj are a partition of Ω, then E[X] =∑n

j=1 bj P[Bj ].

Proof. Suppose that

X =

m∑

k=1

ak1X−1(ak) =

n∑

j=1

bj1Bj ,

where Ω = ]mj=1Bj are a partition of Ω as in the statement of the proposition. We

assume without loss of generality that am = 0 so Ω = ]mk=1 X = ak. Then, for any

k, j, if ω ∈ X = ak ∩ Bj then ak = X(ω) = bj . Otherwise, if X = ak ∩ Bj = ∅ the

P[X = ak, Bj ] = 0. Thus, for all k, j,

ak P[X = ak, Bj ] = bj P[X = ak, Bj ].

Since Ω = ]nj=1Bj = ]mk=1 X = ak, by the law of total probability we obtain,

E[X] =

m∑

k=1

ak P[X = ak] =

m∑

k=1

n∑

j=1

ak P[X = ak, Bj ]

=n∑

j=1

m∑

k=1

bj P[X = ak, Bj ] =

n∑

j=1

bj P[Bj ].

ut

Proposition 15.9. If X,Y are simple random variables, then:

• If X ≤ Y then E[X] ≤ E[Y ].

• For all a ≥ 0, E[aX + Y ] = aE[X] + E[Y ].

97

Specifically we have for a simple random variable X,

E[X] = sup E[S] : 0 ≤ S ≤ X ,S is simple .

Proof. Assume that X =∑

k ak1X=ak and Y =∑

j bj1Y=bj. If ω ∈ Ω is such that

X(ω) = ak and Y (ω) = bj then ak ≤ bj . So, ak P[X = ak, Y = bj ] ≤ bj P[X = ak, Y =

bj ]. Thus,

E[X] =∑

k,j

ak P[X = ak, Y = bj ] ≤∑

k,j

bj P[X = ak, Y = bj ] = E[Y ].

Now, note that if Z = aX + Y for some a ≥ 0, then

Z =∑

k,j

(a · ak + bj)1X=ak,Y=bj

since Ω = ]k X = ak ] X = 0 and Ω = ]j Y = bj ] Y = 0. So

E[Z] =∑

k,j

(a · ak + bj)P[X = ak, Y = bj ] = aE[X] + E[Y ].

ut

Definition 15.10. If X is a general non-negative random variable, define

E[X] := sup E[S] : 0 ≤ S ≤ X ,S is simple ,

(where we allow the value E[X] =∞, and in this case we say X has infinite expectation).

This definition coincides with the previous definition for simple random variables by

the proposition above.

important

for what

followsNote that the definition immediately implies that

0 ≤ X ≤ Y ⇒ E[X] ≤ E[Y ]

because the supremum for Y is over a larger set.

98

15.4. Monotone Convergence

Theorem 15.11 (Monotone Convergence). Let (Xn)n be a non-decreasing sequence of

non-negative random variables. If Xn X then E[Xn] E[X].

Proof. The proof has four cases:

• Case 1: Xn are simple, X = 1A.

For this case, let ε > 0, and set An := Xn > 1− ε.If ω ∈ A then X(ω) = 1, then since Xn(ω) X(ω), then Xn(ω) > 1 − ε

for some n. On the other hand, if Xn(ω) > 1 − ε, then X(ω) > 1 − ε and so

X(ω) = 1 and ω ∈ A. Thus, (An)n is an increasing sequence of events, such that

limnAn =⋃nAn = A.

Since Xn are non-negative, Xn ≥ (1− ε)1An . Hence, (1− ε)P[An] ≤ E[Xn] ≤E[X] = P[A]. Continuity of probability finishes this case.

• Case 2: Xn, X simple.

Assume X =∑m

k=0 ak1X=ak, where 0 = a0 < a1 < · · · < am and Ω =

X = a0 ] X = a1 ] · · · ] X = am. Fix k > 0 and consider the sequence

Yn := a−1k 1X=akXn. Xn X implies that Yn 1X=ak. By Case 1, since

Yn are simple, we have that E[Yn] P[X = ak].

Since Xn ≤ X we have that 0 ≤ Xn1X=a0 ≤ X1X=0 = 0. Thus, Xn =

Xn ·∑m

k=0 1X=ak =∑m

k=1 ak · a−1k 1X=akXn. By linearity of expectation for

simple random variables,

E[Xn] =∑

k

ak · E[a−1k 1X=akXn]

∑

k

ak P[X = ak] = E[X].

• Case 3: Xn simple, X general (non-negative).

Let 0 ≤ S ≤ X be a simple random variable, and Yn := min Xn, S. Xn X

implies that Yn S. So Case 2 and monotonicity of expectation for simple

random variables give that E[Xn] ≥ E[Yn] E[S]. Thus, limn E[Xn] ≥ E[S] for

all simple S ≤ X. Taking the supremum we are done.

99

• Case 4: Xn, X general.

Define Yn := min 2−nb2nXnc, n. So Yn are all simple, and Yn ≤ Xn ≤ X, and

so E[Yn] ≤ E[Xn] ≤ E[X]. If we show that Yn X then we are done by Case 3.

As in Proposition 15.6, Yn ≤ min

2−(n+1)b2n+1Xnc, n+ 1≤ Yn+1, so (Yn)n

is a non-decreasing sequence.

For all ω and n such that Xn(ω) < n, we have that |Yn(ω) − Xn(ω)| ≤2−n. So, for any ω, and for any ε > 0, if n > n0(ω, ε) is large enough, then

|Yn(ω)−X(ω)| < ε. That is, Yn X.

ut

15.5. Linearity

Lemma 15.12. Let X,Y be a non-negative random variables. Then, for any a > 0,

• X ≤ Y implies E[X] ≤ E[Y ].

• E[aX + Y ] = aE[X] + E[Y ].

Proof. The first assertion is immediate from the definition, since if 0 ≤ S ≤ X is simple

then 0 ≤ S ≤ Y , so we are taking a supremum over a larger set.

Let Xn X and Yn Y be sequences of simple random variables converging

everywhere monotonely to X,Y respectively, as in Proposition 15.6. So aXn + Yn aX + Y .

Monotone convergence tells us that

E[Xn] E[X] E[Yn] E[Y ] and E[aXn + Yn] E[aX + Y ].

Since for all n, E[aXn + Yn] = aE[Xn] + E[Yn], we are done. ut

15.6. Expectation

Finally, let us define the expectation of a general random variable.

First, some notation. Let X be a random variable. Define X+ := max X, 0 and

X− := max −X, 0. Note that X+, X− are non-negative random variables, and that

X = X+ −X− and |X| = X+ +X−.

100

Definition 15.13. Let X be a random variable. If E[X+] = E[X−] = ∞ we say that

the expectation of X is not defined (or X does not have expectation). If at least one of

E[X+],E[X−] is finite then define the expectation of X by

E[X] := E[X+]− E[X−].

15.6.1. Properties of Expectation. The most important property of expectation is

linearity:

Theorem 15.14. Let X,Y be random variables on some probability space (Ω,F ,P) such

that E[X],E[Y ] exist. Then:

• For all a ∈ R, E[aX + Y ] = aE[X] + E[Y ].

• X ≤ Y implies that E[X] ≤ E[Y ].

• |E[X]| ≤ E[|X|].

Proof. For linearity, note that (X+Y )+−(X+Y )− = X+Y = X+−X−+Y +−Y −, so

(X+Y )+ +X−+Y − = (X+Y )−+X+ +Y +. Since both sides are linear combinations

of non-negative random variables,

E(X + Y )+ + EX− + EY − = E(X + Y )− + EX+ + EY +

which implies that

E[X + Y ] = E(X + Y )+ − E(X + Y )− = EX+ − EX− + EY + − EY − = EX + EY.

Also, for any a > 0, (aX)+ = aX+ and (aX)− = aX−. If a < 0 then (aX)+ = −aX−

and (aX)− = −aX+. Thus, in both cases E[aX] = aE[X]. This proves linearity.

For the second assertion, let Z = Y − X. So Z ≥ 0, and thus E[Z] ≥ 0. Since

E[Z] = E[Y ]− E[X] by linearity, we are done.

Finally, for the last assertion

|E[X]| = |E[X+]− E[X−]| ≤ E[X+] + E[X−] = E[|X|],

where we have used the fact that X+ ≥ 0 and X− ≥ 0. ut

101


201.1.8001

Ariel Yadin

Lecture 16

16.1. Expectation - Discrete Case Expectation

for discrete

RVsProposition 16.1. Let X be a discrete random variable, with range R and density fX .

Then,

E[X] =∑

r∈RrfX(r).

Proof. For all N , let

X+N :=

∑

r∈R∩[0,N ]

1X=rr and X−N := −∑

r∈R∩[−N,0]

1X=rr.

Note that X+N X+ and X−N X−. Moreover, by linearity

E[X+N ] =

∑

r∈R∩[0,N ]

P[X = r]r and E[X−N ] = −∑

r∈R∩[−N,0]

P[X = r]r.

Using monotone convergence we get that

E[X] = E[X+]− E[X−] = limN→∞

E[X+N ]− E[X−N ] =

∑

r∈RfX(r)r.

ut Examples:

Ber, Bin,

Poi, Geo

Example 16.2. Let us calculate the expectations of different discrete random variables:

• If X ∼ Ber(p) then

E[X] = 1 · P[X = 1] + 0 · P[X = 0] = p.

102

• If X ∼ Bin(n, p) then since(nk

)· k = n ·

(n−1k−1

)for 1 ≤ k ≤ n,

E[X] =

n∑

k=0

fX(k) · k =

n∑

k=0

(n

k

)pk(1− p)n−k · k

= np ·n∑

k=1

(n− 1

k − 1

)pk−1(1− p)n−k = np.

An easier way, would be to note thatX =∑n

k=1Xk whereXk ∼ Ber(p) (and in

fact X1, . . . , Xn are independent). Thus, by linearity, E[X] =∑n

k=1 E[Xk] = np.

• For X ∼ Poi(λ):

E[X] =∞∑

k=0

e−λλk

k!· k = λ ·

∞∑

k=1

e−λλk−1

(k − 1)!= λ.

• For X ∼ Geo(p): Note that for g(x) = −(1−x)k, we have p∂xg(x) = k(1−x)k−1.

E[X] =∞∑

k=1

fX(k) · k =∞∑

k=1

(1− p)k−1p · k

= −p ·∞∑

k=1

∂∂p(1− p)k = −p · ∂∂p

( ∞∑

k=1

(1− p)k)

= −p · ∂∂p(

1p − 1

)= p · 1

p2= 1

p .

Do this oneAnother way: Let E = E[X]. Then,

E = p+∞∑

k=2

(1−p)k−1pk = p+

∞∑

k=2

(1−p)k−1p(k−1)+∞∑

k=2

(1−p)k−1p = p+(1−p)E+1−p.

So E = 1 + (1− p)E or E = 1/p.

454

Example 16.3. A pair of independent fair dice are tossed. What is the expected

number of tosses needed to see Shesh-Besh?

Note that each toss of the dice is an independent trial, such that the probability of

seeing Shesh-Besh is 2/36 = 1/18. So if X = number of tosses until Shesh-Besh, then

X ∼ Geo(1/18). Thus, E[X] = 18. 454

103

16.1.1. Function of a random variable. Let g : Rd → R be a measurable function.

Let (X1, . . . , Xd) be a joint distribution of d discrete random variables with range R

each. Then, Y = g(X1, . . . , Xd) is a random variable. What is its expectation?

Well, first we need the density of Y : For any y ∈ R we have that

P[Y = y] = P[(X1, . . . , Xd) ∈ g−1(y)] =∑

(r1,...,rd)∈Rd∩g−1(y)

P[(X1, . . . , Xd) = (r1, . . . , rd)].

So, if RY := g(Rd), then since

Rd =⊎

y∈RY

Rd ∩ g−1(y),

and since Y is discrete we get that

E[g(X1, . . . , Xd)] =∑

y∈RY

y P[Y = y]

=∑

y∈RY

∑

~r∈Rd∩g−1(y)

f(X1,...,Xd)(~r) · g(~r)

=∑

~r∈Rdf(X1,...,Xd)(~r) · g(~r).

Specifically,

Proposition 16.4. Let g : Rd → R be a measurable function, and X = (X1, . . . , Xd) a

discrete joint random variable with range Rd. Then,

E[g(X)] =∑

~r∈RdfX(~r) · g(~r).

Example 16.5. Manchester United plans to earn some money selling Wayne Rooney

jerseys. Each jersey costs the club x pounds, and is sold for y > x pounds. Suppose

that the number of people who want to buy jerseys is a discrete random variable with

range N.

What is the expected profit if the club orders N jerseys?

How many jerseys should be ordered in order to maximize this profit?

Solution. Let pk = fX(k) = P[X = k]. Let gN : N→ R be the function that gets the

number of people that want to buy jerseys and gives the profit, if the club ordered N

104

jerseys. That is,

gn(k) =

ky −Nx k ≤ NN(y − x) k > N

The expected profit is then

E[gN (X)] =∞∑

k=0

pkgN (k) =∞∑

k=0

pkN(y−x)−N∑

k=0

pk(N−k)y = N(y−x)−N∑

k=0

pk(N−k)y.

Call this G(N) := N(y−x)−∑Nk=0 pk(N−k)y. We want to maximize this as a function

of N . Note that

G(N + 1)−G(N) = y − x−N∑

k=0

ypk(N + 1−N) = y − x− y P[X ≤ N ].

So G(N+1) > G(N) as long as P[X ≤ N ] < y−xy , so the club should order N+1 jerseys

for the largest N such that P[X ≤ N ] < y−xy . 454

Example 16.6. Some more calculations:

• X ∼ H(N,m, n) (recall that X is the number of “special” objects chosen, when

choosing uniformly n objects out of N objects, with a total of m special objects).

We use the fact that(m

k

)· k =

(m− 1

k − 1

)·m and

(N

n

)=

(N − 1

n− 1

)· Nn.

E[X] =

(N

n

)−1

·n∧m∑

k=0

k ·(m

k

)·(N −mn− k

)

=

(N − 1

n− 1

)−1

· nN·m ·

n∧m∑

k=1

(m− 1

k − 1

)·(

(N − 1)− (m− 1)

(n− 1)− (k − 1)

)=nm

N.

• X ∼ NB(m, p) (the number of trials until the m-th success). Using(k − 1

m− 1

)· k =

(k

m

)·m

with j = k + 1 and n = m+ 1,

E[X] =∞∑

k=m

k ·(k − 1

m− 1

)pm(1− p)k−m = mp−1 ·

∞∑

j=n

(j − 1

n− 1

)pn(1− p)j−n =

m

p.

454

105


201.1.8001

Ariel Yadin

Lecture 17

17.1. Expectation - Continuous Case

Our goal is now to prove the following theorem: Goal is

E g(X)

Theorem 17.1. Let X = (X1, . . . , Xd) be an absolutely continuous random variable,

and let g : Rd → R be a measurable function. Then,

E[g(X)] =

∫

Rdg(~x)fX(~x)d~x.

The main lemma here is

Lemma 17.2. Let X = (X1, . . . , Xd) be an absolutely continuous random variable.

Then, for any Borel set B ∈ Bd,

P[X ∈ B] =

∫

BfX(~x)d~x.

Proof. Let Q(B) =∫B fX(~x)d~x. Then Q is a probability measure on (Rd,Bd), that

coincides with PX on the π-system of rectangles (−∞, b1]×· · ·× (−∞, bd]. Thus, P[X ∈B] = PX(B) = Q(B) for all B ∈ Bd. ut

Remark 17.3. We have not really defined the integral∫B fX(~x)d~x. However, for our

purposes, we can define it as P[X ∈ B], and note that this coincides with the Riemann

integral on intervals.

Proof of Theorem 17.1. First assume that g ≥ 0, so Rd = g−1[0,∞). For all n define

Yn = 2−nb2ng(X)c which are discrete non-negative random variables.

First, we show that

E[Yn] =

∫

Rd2−nb2ng(~x)cfX(~x)d~x.

106

Indeed, for n, k ≥ 0 let Bn,k = g−1[2−nk, 2−n(k + 1)) ∈ Bd.Note that

Yn =∞∑

k=0

2−nk1X∈Bn,k and E[Yn] =∞∑

k=0

2−nk P[X ∈ Bn,k].

Now, since

Rd = g−1[0,∞) =

∞⊎

k=0

g−1[2−nk, 2−n(k + 1)) =

∞⊎

k=0

Bn,k,

and since

1Bn,k2−nb2ngcfX = 1Bn,k2−nkfX ,

we have by the lemma above that

∫

Rd2−nb2ng(~x)cfX(~x)d~x =

∫

Rd

∞∑

k=0

1Bn,k2−nb2ng(~x)cfX(~x)d~x =

∞∑

k=0

2−nk

∫

Bn,k

fX(~x)d~x

=∞∑

k=0

2−nk P[X ∈ Bn,k] = E[Yn].

Here we have used the fact that

N∑

k=0

1Bn,k2−nk ∞∑

k=0

1Bn,k2−nk.

Since |2−nb2ngc − g| ≤ 2−n , we get that |Yn − g(X)| ≤ 2−n. Thus,

∣∣∣∣E[g(X)]−∫

Rdg(~x)fX(~x)d~x

∣∣∣∣ ≤ |E[g(X)]− E[Yn]|+∣∣∣∣∫

Rdg(~x)fX(~x)d~x−

∫

Rd2−nb2ng(~x)cfX(~x)d~x

∣∣∣∣

≤ 2 · 2−n → 0.

This proves the theorem for non-negative functions g.

Now, if g is a general measurable function, consider g = g+ − g−. Since g+, g− are

non-negative, we have that

E[g(X)] = E[g+(X)− g−(X)] = E[g+(X)]− E[g−(X)] =

∫

Rd(g+(~x)− g−(~x))fX(~x)d~x.

ut Expectation

for cont.

107

Corollary 17.4. Let X be an absolutely continuous random variable with density fX .

Then,

E[X] =

∫ ∞

−∞tfX(t)dt.

X Compare∫tfX(t)dt to

∑r P[X = r] in the discrete case. This is another place

where fX is “like” P[X = ·] (although the latter is identically 0 in the continuous case,

as we have seen).

Exam-

ples: Unif.,

Normal, Exp.

Example 17.5. Expectations of some absolutely continuous random variables:

• X ∼ U [0, 1]: E[X] =∫ 1

0 tdt = 1/2.

• More generally, X ∼ U [a, b]:

E[X] =

∫ b

at · 1

b−adt = 12(b−a)(b2 − a2) = b+a

2 .

• X ∼ Exp(λ): We use integration by parts, since∫e−λt = −λ−1e−λt,

E[X] =

∫ ∞

0t · λe−λtdt = −te−λt

∣∣∣∞

0+

∫ ∞

0e−λtdt =

1

λ.

• X ∼ N(µ, σ):

E[X] =

∫ ∞

−∞

t√2πσ

exp(− (t−µ)2

2σ2

)dt.

Change u = t− µ so du = dt,

E[X] =1√2πσ

·∫ ∞

−∞u exp

(− u2

2σ2

)du+

∫ ∞

−∞µfX(t)dt.

Since the function u 7→ u exp(− u2

2σ2

)is an odd function, its integral is 0, so

E[X] = µ

∫ ∞

−∞fX(t)dt = µ.

A simpler way is to notice that X−µσ ∼ N(0, 1) so

E[X − µσ

] =1√2π

∫ ∞

−∞te−t

2/2dt = 0,

as an integral over an odd function. E[X] = µ follows from linearity of expecta-

tion.

454

108

Proposition 17.6. Let X be an absolutely continuous random variable, such that E[X]

exists. Then,

E[X] =

∫ ∞

0(P[X > t]− P[X ≤ −t])dt =

∫ ∞

0(1− FX(t)− FX(−t))dt.

Proof. Note that∫ ∞

0P[X > t]dt =

∫ ∞

0

∫ ∞

tfX(s)dsdt =

∫ ∞

0

∫ s

0dtfX(s)ds =

∫ ∞

0sfX(s)ds.

Similarly,∫ ∞

0P[X ≤ −t]dt =

∫ ∞

0

∫ −t

−∞fX(s)dsdt =

∫ 0

−∞

∫ −s

0dtfX(s)ds =

∫ 0

−∞−sfX(s)ds.

Subtracting both we have the result. ut in exercises

In a similar way we can prove

Exercise 17.7. Let X be a discrete random variable, with range Z such that E[X] exists.

Then,

E[X] =∞∑

k=0

(P[X > k]− P[X < −k]).

Example 17.8. Let X ∼ N(0, 1). Compute E[X2].

By the above,

E[X2] =1√2π

∫

Rx2e−x

2/2dx = − 1√2πxe−x

2/2∣∣∣∞

−∞+

1√2π

∫

Re−x

2/2dx = 1

where we have used integration by parts, with ∂∂xe−x2/2 = −xe−x2/2. 454

17.2. Examples Using Linearity

Example 17.9 (Coupon Collector). Gilad collects “super-goal” cards. There are N

cards to collect altogether. Each time he buys a card, he gets one of the N uniformly

at random, independently.

What is the expected amount of cards Gilad needs to buy in order to collect all cards?

For k = 0, 1, . . . , N − 1, let Tk be the number of cards bought after getting the k-th

new card, until getting the (k + 1)−th new card. That is, when Gilad has k different

cards, he buys Tk more cards until he has (k + 1) different cards.

109

If Gilad has k different cards, then with probability N−kN he buys a card he does not

already have. So, Tk ∼ Geo(N−kN ).

Since the total number of cards Gilad buys until getting all N cards is T = T0 +T1 +

T2 + · · ·+ TN−1, using linearity of expectation

E[T ] = E[T0] + E[T1] + · · ·+ E[TN−1] = 1 +N

N − 1+ · · ·+N = N ·

N∑

k=1

1

k.

454

Example 17.10 (Permutation fixed points). n soldiers receive packages from their

families. The packages get mixed up at the post office, so that each order is equally

likely. Let X be the number of soldiers that receive the correct package from their own

family. What is the expectation of X?

Let us solve this by writing X as a sum. Let Xi be the indicator of the event that

soldier i receives the correct package. What is the probability that Xi = 1? There are

(n− 1)! possible ordering for which this may happen, so E[Xi] = (n−1)!n! = 1

n .

Note that X =∑n

i=1Xi, so by linearity, E[X] = n · 1n = 1.

Note also that the Xi’s are not independent, for example, if i 6= j,

P[Xi = 1, Xj = 1] = (n−2)!n! = 1

n(n−1) 6= 1n2 .

454

Example 17.11. We toss a die 100 times. What is the expected sum of all tosses?

Here it really begs to use linearity. If Xk is the outcome of the k-th toss, and X =∑100

k=1Xk then

E[X] =

100∑

k=1

E[Xk] = 100 · 72 = 350.

454

Example 17.12. 200 random numbers are output by a computer, each distributed

uniformly on [0, 1]. What is their expected sum?

E[X] = 200 · E[U [0, 1]] = 200 · 12 = 100.

110

454

Example 17.13. Let Xn ∼ U [0, 2−n], for n ≥ 0, and let SN =∑N

k=0Xk. What is the

expectation of SN?

Linearity of expectation gives

E[SN ] =

N∑

k=0

E[Xk] =

N∑

k=0

2−(k+1) = 1− 2−(N+1).

Note that if S∞ =∑∞

k=0Xk, then SN S∞ so monotone convergence gives that

E[S∞] = 1. 454

17.3. A formula for general random variables

Theorem 17.14. Let X be a random variable. Then

E[X+] =

∫ ∞

0(1− FX(t))dt and E[X−] =

∫ ∞

0FX(−t)dt.

Thus, if E[X] exists then

E[X] =

∫ ∞

0(1− FX(t)− FX(−t))dt.

Proof. First suppose that X ≥ 0 and discrete. Then, if the range of X is RX =

0 = r0 < r1 < r2 < · · · , we have for any t ∈ [rj , rj+1),

P[X > t] = P[X ≥ rj+1] =∞∑

n=j+1

P[X = rn].

Thus,∫ rj+1

rj

(1− FX(t))dt = (rj+1 − rj) ·∞∑

n=j+1

P[X = rn].

Summing over j and interchange the sums we have∫ ∞

0(1− FX(t))dt =

∞∑

j=0

∫ rj+1

rj

(1− FX(t))dt =

∞∑

j=0

∞∑

n=j+1

(rj+1 − rj) · P[X = rn]

=

∞∑

n=1

P[X = rn]

n−1∑

j=0

(rj+1 − rj) =∞∑

n=1

rn P[X = rn] = E[X].

Since for t > 0, the positivity of X tells us that FX(−t) = 0, we have the theorem for

discrete non-negative random variables.

111

We also have the formula: for t ∈ (rj , rj+1],

P[X ≥ t] = P[X ≥ rj+1] =∞∑

n=j+1

P[X = rn].

Thus,

∫ rj+1

rj

P[X ≥ t]dt = (rj+1 − rj) ·∞∑

n=j+1

P[X = rn] =

∫ rj+1

rj

P[X > t]dt,

so

E[X] =

∫ ∞

0P[X > t]dt =

∫ ∞

0P[X ≥ t]dt.(17.1)

Now, if X ≥ 0 is any general non-negative random variable, let Xn := 2−nb2nXc. So

Xn is discrete, non-negative and |Xn−X| ≤ 2−n. Also, because Xn ≤ X, for any t > 0,

|FXn(t)− FX(t)| = |P[Xn ≤ t]− P[X ≤ t]| = P[Xn ≤ t,X > t]

≤ P[t < X ≤ t+ 2−n] = FX(t+ 2−n)− FX(t)→ 0,

as n → ∞ by right continuity. Since Xn ≤ Xn+1 we get that 1 − FXn(t) 1 − FX(t),

and by monotone convergence,

E[Xn] =

∫ ∞

0(1− FXn(t))dt

∫ ∞

0(1− FX(t))dt.

However, since |E[Xn] − E[X]| ≤ 2−n we obtain the theorem for general non-negative

random variables.

Finally, if X = X+−X− is any random variable, we have for all t ≥ 0 that FX+(t) =

P[X+ ≤ t] = P[0 ≤ X ≤ t] + P[X < 0] = P[X ≤ t] = FX(t), and since X+ ≥ 0,

E[X+] =

∫ ∞

0(1− FX(t))dt.

Also, since X− = (−X)+, for all t ≥ 0, we have P[X− ≥ t] = P[−X ≥ t] = FX(−t),so by (17.1),

E[X−] =

∫ ∞

0P[X− ≥ t]dt =

∫ ∞

0FX(−t)dt.

ut

112

Example 17.15. Let X be a random variable with distribution function

FX(t) =

0 t < 0

t+12 0 ≤ t < 1

2

78

12 ≤ t < 10

1 10 ≤ t

Is X discrete? Is X absolutely continuous?

X cannot be absolutely continuous, because it is not even continuous, since 0, 12 , 10

are discontinuity points of FX .

X cannot be discrete because if r is such that P[X = r] > 0 then r is a discontinuity

point of FX , so these can only be 0, 12 , 10. However,

P[X 6∈

0, 12 , 10

] ≥ P[1

8 < X ≤ 14 ] = FX(1

4)− FX(18) = 5

8 − 916 = 1

16 > 0.

Finally, let us compute the expectation of X. Since X ≥ 0, we have that X = X+, so

E[X] =

∫ ∞

0(1− FX(t))dt =

∫ 1/2

0

1−t2 dt+

∫ 10

1/2

18dt

= 12(t− t2

2 )∣∣∣1/2

0+ 1

8 · 192 = 1

4 − 116 + 19

16 = 118 .

454

113


201.1.8001

Ariel Yadin

Lecture 18

18.1. Jensen’s Inequality

Recall that a function g : R → R is convex on [a, b] if for any 0 < λ < 1, and any

x, y ∈ [a, b],

g(λx+ (1− λ)y) ≤ λg(x) + (1− λ)g(y).

For example, any twice differentiable function such that g′′ ≥ 0 is convex. e.g. x2,

− log x.

Convex functions are continuous, and so always measurable.

First a small lemma regarding convex functions:

convex

funcs lie

above tan-

gentsLemma 18.1. Let a < m < b, and let g : [a, b] → R be convex. Then, there exist

A,B ∈ R such that g(m) = Am+B and for all x ∈ [a, b], g(x) ≥ Ax+B.

Proof. For all a ≤ x < m < y ≤ b we have m = λx+ (1− λ)y where λ = y−my−x ∈ (0, 1).

Thus, g(m) ≤ λg(x) + (1 − λ)g(y) which implies that g(m)−g(x)m−x ≤ g(y)−g(m)

y−m . Thus,

taking supx<mg(m)−g(x)m−x ≤ A ≤ infy>m

g(y)−g(m)y−m , we have that for all x < y ∈ [a, b],

g(x) ≥ A(x−m) + g(m) = Ax+ (g(m)−Am). ut

Theorem 18.2 (Jensen’s Inequality). Let g : R → R be convex on [a, b]. Let X be a

random variable such that P[a ≤ X ≤ b] = 1. Then,

g(E[X]) ≤ E[g(X)].

Johan Jensen

(1859–1925)

Proof. If X is constant, then there is nothing to prove, so assume X is not constant.

Thus, a < E[X] < b. Let m = E[X]. Then, g(X) ≥ AX +B and g(E[X]) = AE[X] +B

114

m

Figure 5. A convex function lies above some linear function at an inner

point m.

for some A,B ∈ R. Specifically, since g(X)− ≤ |A||X|+ |B| ≤ |A|max |a|, |b|+ |B| we

get that E[g(X)−] <∞, so E[g(X)] is well defined.

Now, since AX +B ≤ g(X),

g(E[X]) = AE[X] +B = E[AX +B] ≤ E[g(X)].

ut A-G in-

equality

Example 18.3. Let a1, . . . , an be n positive numbers. Let X be a discrete random

variable with density fX(ak) = 1n .

Note that the function g(x) = − log x is a convex function on (0,∞) (since g′′(x) =

x−2 > 0). So Jensen’s inequality gives that

− log1

n

n∑

k=1

ak = − log(E[X]) ≤ E[− log(X)] = − 1

n

n∑

k=1

log ak.

Exponentiating we get

1

n

n∑

k=1

ak ≥(

n∏

k=1

ak

)1/n

.

This is the arithmetic-geometric mean inequality. 454

115 Example:

Lp in Lq for

p > q Example 18.4. Let p ≥ 1. The function g(x) = |x|p is convex, because away from

0, g′′(x) = p(p − 1)|x|p−2, and at 0 we have g(0) = 0 ≤ λg(x) + (1 − λ)g(y) for any

0 < λ < 1 and x, y.

Thus, E[|X|p] ≥ |E[X]|p for any random variable X such that E[X] is defined.

Furthermore, for 1 ≤ q < p, if E[|X|p] <∞, then

E[|X|q] = (E[|X|q]p/q)q/p ≤ (E[|X|p])q/p <∞,

since the function x 7→ |x|p/q is convex. 454

18.2. Moments

Definition 18.5. Let X be a random variable on a probability space (Ω,F ,P).

The n-th moment of X is defined to be E[Xn], if this exists and is finite. If this

expectation does not exist (or is infinite), then we say that X does not have a n-th

moment.

Definition 18.6. Let X be a random variable on a probability space (Ω,F ,P), such

that |E[X]| <∞.

The variance of X is defined to be

Var[X] := E[|X − E[X]|2].

The standard deviation of X is defined to be√

Var[X].

Proposition 18.7. X has a second moment if and only if |E[X]| <∞ and Var[X] <∞.

In fact, Var[X] = E[X2]− E[X]2.

Proof. Using linearity of expectation,

E[(X − a)2] = E[X2]− 2aE[X] + a2.

So E[(X − a)2] <∞ if and only if −∞ < E[X] <∞ and E[X2] <∞. ut

Example 18.8. Let us calculate some second moments and variances:

• X ∼ Ber(p):

E[X2] = P[X = 1] ·12 +P[X = 0] ·02 = p. So Var[X] = E[X2]−E[X]2 = p(1−p).

116

• X ∼ Bin(n, p):

We use the identity for all 2 ≤ k ≤ n, k(k − 1)(nk

)= n(n− 1)

(n−2k−2

).

E[X(X − 1)] =n∑

k=0

k(k − 1)P[X = k] =n∑

k=2

n(n− 1)

(n− 2

k − 2

)pk−2p2(1− p)n−k

= p2n(n− 1) ·n−2∑

k=0

(n− 2

k

)pk(1− p)n−2−k = p2n2 − p2n.

By linearity of expectation,

p2n(n− 1) = E[X2]− E[X] = E[X2]− np.

So E[X2] = p2n(n− 1) + np and Var[X] = np− p2n = np(1− p).• X ∼ Geo(p):

E[X2] =∞∑

k=1

k2(1− p)k−1p = p+∑

k=2

(k − 1 + 1)2(1− p)k−2p(1− p)

= p+ (1− p)∞∑

k=1

(k + 1)2(1− p)k−1p = p+ (1− p)E[(X + 1)2]

= p+ (1− p)E[X2] + (1− p) + (1− p)2E[X] = 1 +2(1− p)

p+ (1− p)E[X2].

So

E[X2] =1

p+

2(1− p)p2

=2− pp2

.

Thus

Var[X] = E[X2]− E[X]2 =1− pp2

.

• X ∼ Poi(λ):

E[X2] =∞∑

k=0

k2 · e−λλk

k!

= λ ·∞∑

k=1

(k − 1 + 1)e−λλk−1

(k − 1)!= λ ·

∞∑

k=0

(k + 1)e−λλk

k!

= λ · E[X + 1] = λ(λ+ 1).

Thus, Var[X] = E[X2]− E[X]2 = λ.

117

• X ∼ Exp(λ): Use integration by parts:

E[X2] =

∫ ∞

0t2λe−λtdt = −t2e−λt

∣∣∣∞

0+

∫ ∞

02te−λtdt

=2

λE[X] =

2

λ2.

So Var[X] = 2λ2− 1

λ2= 1

λ2.

454 Var[aX]

Exercise 18.9. Let X be a random variable with Var[X] < ∞. Show that for any

a, b ∈ R,

Var[aX + b] = a2 Var[X].

Solution. Let µ = E[X]. So E[aX + b] = aµ+ b. Then,

Var[aX + b] = E[(aX + b− (aµ+ b))2] = E[(aX − aµ)2] = a2 Var[X].

ut

Example 18.10. • X ∼ U [a, b]: Y = X−ab−a ∼ U [0, 1]. (Prove this!) So

E[Y 2] =

∫ 1

0s2ds = 1

3 .

Thus Var[Y ] = 13 − 1

4 = 112 . We have that

Var[X] = Var[(b− a)Y + a] = (b− a)2 Var[Y ] =(b− a)2

12.

• X ∼ N(µ, σ): Recall that Y := (X − µ)/σ has N(0, 1) distribution. Thus, since

X = σY + µ, we have that Var[X] = σ2 Var[Y ].

Now, using integration by parts, since ∂∂xe−x2/2 = −xe−x2/2,

Var[Y ] = E[Y 2] =1√2π

∫ ∞

−∞t2e−t

2/2dt

= − 1√2π· te−t2/2

∣∣∣∞

−∞+

1√2π

∫ ∞

−∞e−t

2/2dt = 1.

So Var[X] = σ2. (Note that this implies that E[X2] = σ2 + µ2.)

454

118


201.1.8001

Ariel Yadin

Lecture 19

19.1. Covariance

Definition 19.1. Let X,Y be random variables on some probability space (Ω,F ,P)

such that E[X],E[Y ],E[XY ] are finite. The covariance of X and Y is defined to be

Cov[X,Y ] := E[(X − E[X]) · (Y − E[Y ])].

The following are immediate:

Proposition 19.2. Let X,Y, Z be random variables on some probability space (Ω,F ,P).

Then,

• Cov[X,Y ] = E[XY ]− E[X]E[Y ].

• Cov[X,Y ] = Cov[Y,X].

• Cov[aX + Y,Z] = aCov[X,Z] + Cov[Y, Z].

• Cov[X,X] = Var[X] ≥ 0.

Where each equality holds if the relevant covariances are defined.

Note that Cov is almost an inner product on the vector space of random variables on

(Ω,F ,P) with finite expectation.

19.2. Inner Products and Cauchy-Schwarz

Reminder:

Let V be a vector space over R. An inner-product on V is a function < ·, · >: V ×V →R such that

• Symmetry: For all v, u ∈ V : < u, v >=< v, u >.

119

• Linearity: For all α ∈ R and all v, u, w ∈ V : < αu + v, w >= α < u,w > + <

v,w >.

• Positivity: For all v ∈ V , < v, v >≥ 0.

• Definiteness: For all v ∈ V : < v, v >= 0 if and only if v = 0.

For our purposes, we would like to replace the last condition by

• Definiteness’: For all v ∈ V : < v, v >= 0 if and only if < v, u >= 0 for all

u ∈ V .

A basic, but super important and fundamental result is the Cauchy-Schwarz inequal-

ity: (Denote ||v|| = √< v, v >.)

Augustin-Louis Cauchy

(1789–1857)

C-S for in-

ner products

Theorem 19.3 (Cauchy-Schwarz Inequality). Let < ·, · > be a an inner product on V

with the above modified definiteness condition. For all v, u ∈ V we have

| < v, u > | ≤ ||v|| · ||u||.

Proof. If ||u|| = 0 then both sides are 0 (because of the modified definiteness condition),

so we can assume without loss of generality that ||u|| > 0. Set λ = ||u||−2· < v, u >.

Then,

0 ≤< v − λu, v − λu >=< v, v > +λ2 < u, u > −2λ < v, u >= ||v||2 − ||u||−2 < v, u >2

Multiplying by ||u||2 completes the proof. ut

Hermann Schwarz

(1843–1921)

19.2.1. Two Inner Products. Let (Ω,F ,P) be a probability space. Consider the

following vector space over R: Let L2(Ω,F ,P) be the space of all random variables X

with finite second moment.

In the exercise we show that if Cov[X,X] = 0 then Cov[X,Y ] = 0 for all Y , and if

E[X2] = 0 then E[XY ] = 0 for all Y . Thus, we have that both Cov[X,Y ] and E[XY ]

form (modified) inner products on L2(Ω,F ,P).

X In order to get honest to goodness inner products, one needs to take random

variables modulo measure 0 (for E[X,Y ]) and modulo measure 0 and additive constants

for Cov.

120

We conclude: C-S for Cov

and E[XY ]

Theorem 19.4 (Cauchy-Schwarz inequality). Let X,Y be random variables with finite

second moment. Then,

|Cov[X,Y ]|2 ≤ Var[X] ·Var[Y ] and |E[XY ]|2 ≤ E[X2] · E[Y 2].

19.3. Correlation

Definition 19.5. If X,Y are random variables such that Cov[X,Y ] = 0 we say that

X,Y are uncorrelated. X1, . . . , Xn are said to be pairwise uncorrelated if for any

k 6= j, Xk, Xj are uncorrelated.

We will soon see that if X,Y are independent, they are uncorrelated. The opposite,

however, is not true. Indeed, Uncor-

related 6⇒

independent Example 19.6. Let X,Y have a joint distribution given by

fX,Y =

X \ Y −1 0 1

0 1/3 0 1/3

1 0 1/3 0

.

So fX(1) = 1/3 and fY (1) = fY (−1) = 1/3. Hence,

E[XY ] = P[X = 1, Y = 1]− P[X = 1, Y = −1] = 0,

E[X] = 1/3 and E[Y ] = 1/3 − 1/3 = 0. So Cov[X,Y ] = E[XY ] − E[X]E[Y ] = 0. But

X,Y are not independent, as

P[X = 1, Y = 1] = 0 6= 1

3· 1

3= P[X = 1] · P[Y = 1].

454

X We show that independence is the fact that any functions of X and Y are uncor-

related.

indepen-

dence iff

cov=0 for all

funcsTheorem 19.7. Let X,Y be random variables on some probability space (Ω,F ,P).

Then, X,Y are independent, if and only if for any measurable functions g, h such that

Cov[g(X), h(Y )] is defined, we have that Cov[g(X), h(Y )] = 0.

121

indepen-

dence implies

uncorrelated

Lemma 19.8. Let X,Y be random variables with finite second moment. If X,Y are

independent then they are uncorrelated.

Proof. It suffices to show that if X,Y are independent, then E[XY ] = E[X]E[Y ].

If X,Y are discrete random variables with range R,

E[XY ] =∑

r,r′∈RP[X = r, Y = r′]rr′ =

∑

r,r′∈RP[X = r]r P[Y = r′]r′ = E[X]E[Y ].

Define X+n := 2−nb2nX+c and Y +

n := 2−nb2nY +c. Since X+n , Y

+n are independent

and discrete, we have that

E[X+n Y

+n ] = E[X+

n ]E[Y +n ].

Since X+n X+ and Y +

n Y +, and these are non-negative, we get by monotone

convergence that

E[X+Y +] = limn→∞

E[X+n Y

+n ] = lim

n→∞E[X+

n ]E[Y +n ] = E[X+]E[Y +].

In a similar way, for X−n := 2−nb2nX−c and Y −n := 2−nb2nY −c, we have that

X−n , Y−n are independent, X−n , Y

+n are independent, and X+

n , Y−n are independent. Thus,

E[XξnY

ζn ] = E[Xξ

n]E[Y ζn ] for any choice of ξ, ζ ∈ +,−, and taking limits we get that

E[XξY ζ ] = E[Xξ]E[Y ζ ].

Altogether,

E[XY ] = E[(X+ −X−)(Y + − Y −)] = E[X+Y +]− E[X+Y −]− E[X−Y +] + E[X−Y −]

= E[X+]E[Y +]− E[X+]E[Y −]− E[X−]E[Y +] + E[X−]E[Y −]

= (E[X+]− E[X−]) · (E[Y +]− E[Y −]) = E[X] · E[Y ].

ut

Proof of Theorem 19.7. We repeatedly make use of the fact that Cov[X,Y ] = 0 if and

only if E[XY ] = E[X]E[Y ].

122

For the “if” direction: Let A1 ∈ σ(X), A2 ∈ σ(Y ). So A1 = X−1(B1), A2 = Y −1(B2)

for Borel sets B1, B2. For the functions g(x) = 1x∈B1 and h(x) = 1x∈B2 we have

that

P[A1 ∩A2] = E[1X∈B1,Y ∈B2] = E[g(X)h(Y )] = E[g(X)]E[h(Y )] = P[A1]P[A2].

This was the “easy” direction.

Now, for the “only if” direction: We want to show that if X,Y are independent,

then E[g(X)h(Y )] = E[g(X)]E[h(Y )] for any measurable functions such that the above

expectations exist. Since g(X), h(Y ) are independent, the theorem follows from Lemma

19.8. ut in exercise

Pythago-

rian Theorem Exercise 19.9. Let X1, . . . , Xn be random variables with finite second moment. Show

that

Var[X1 + · · ·+Xn] =n∑

k=1

Var[Xk] + 2∑

j<k

Cov[Xj , Xk].

Deduce the Pythagorian Theorem: If X1, . . . , Xn are all pairwise uncorrelated, then

Var[X1 + · · ·+Xn] =n∑

k=1

Var[Xk].

The Pythagorian Theorem lets us calculate very easily the variance of a binomial

random variable: Var[Ber]

Example 19.10. Let X ∼ Bin(n, p). We know that X =∑n

k=1 Yk, where Y1, . . . , Yn

are independent Ber(p) random variables. So

Var[X] =n∑

k=1

Var[Yk] = np(1− p).

The straightforward calculation would be more difficult. 454

19.4. Examples

Example 19.11. Let X ∼ U [−1, 1] and Y = X2.

123

What is the distribution of Y ?

P[Y ≤ t] = P[X ∈ [−√t,√t]] =

√t 0 ≤ t ≤ 1

0 t < 0

1 t ≥ 1

Note that for

fY (t) =

12 t−1/2 t ∈ [0, 1]

0 t 6∈ [0, 1]

we have that for any t ∈ [0, 1],

∫ t

−∞fY (s)ds =

∫ t

0

12s−1/2ds =

√t = FY (t),

so Y is absolutely continuous with density fY .

Are X,Y independent? Of course not, for example,

P[X ∈ [0, 1/2] , Y ∈ [1/2, 1]] = 0 6= 12 · (1− 2−1/2) = P[X ∈ [0, 1]] · P[Y ∈ [1/2, 1]].

What is Cov(X,Y )? Well,

E[XY ] = E[X3] =

∫ 1

−1t3 1

2dt = 0.

Also, E[X] = 0 so

Cov(X,Y ) = E[XY ]− E[X] · E[Y ] = 0.

That is, X,Y are uncorrelated.

What about Z = X2n for general n? In this case,

E[XZ] = E[X2n+1] =

∫ 1

−1t2n+1 1

2dt =1

2(2n+ 2)t2n+2

∣∣∣1

−1= 0.

So X,X2n are uncorrelated for any n.

As for odd moments, if W = X2n+1, then

E[XW ] =

∫ 1

−1t2n+2 1

2dt =2

2(2n+ 3)=

1

2n+ 3,

So X,W are not uncorrelated, because Cov(X,W ) = 12n+3 . 454

124

Example 19.12. Let a ∈ (−1, 1) and

A =

1 a

a 1

So |A| = 1− a2 and

A−1 =1

1− a2·

1 −a−a 1

Let (X,Y ) be a two dimensional absolutely continuous random variable with density

fX,Y (t, s) =1

2π ·√|A|·exp

(−1

2〈A−1(t, s), (t, s)〉)

=1

2π ·√

1− a2·exp

(− 1

2(1−a2)· (t2 + s2 − 2ats)

).

First, let us show that fX,Y is indeed a density. We will use the fact that t2+s2−2ats =

(t− as)2 + s2 − a2s2.

∫ ∞

−∞fX,Y (t, s)dt =

1

2π ·√

1− a2· exp

(− s2(1−a2)

2(1−a2)

)·∫ ∞

−∞exp

(− (t− as)2

2(1− a2)

)dt

=1

2π ·√

1− a2· e−s2/2 ·

√2π(1− a2) =

1√2π· e−s2/2.

We have used that 1√2π(1−a2)

·exp(− (t−as)2

2(1−a2)

)is the density of a N(as,

√1− a2) random

variable. So Y ∼ N(0, 1). Symmetrically, X ∼ N(0, 1).

Thus∫ ∫

fX,Y dtds = 1 and fX,Y is indeed a density. Moreover we have calculated

the marginal densities of X and Y .

Let us calculate

Cov(X,Y ) = E[XY ] =

∫ ∫fX,Y (t, s)tsdtds

=

∫ ∞

−∞s · 1√

2π· e−s2/2

∫ ∞

−∞t · 1√

2π(1− a2)· exp

(− (t− as)2

2(1− a2)

)dtds

=

∫ ∞

−∞as · sfY (s)ds = E[aY 2] = a.

So now we also have that

Var[X + Y ] = Var[X] + Var[Y ] + 2 Cov(X,Y ) = 1 + 1 + 2a = 2(1 + a).

Note that X,Y are independent if and only if Cov(X,Y ) = 0. 454

125

Example 19.13. A particle moves in the NE (nord-est) quadrant of the plane, [0,∞)2.

The direction θ ∼ U [0, π/2] and the velocity V ∼ U [0, 1] independently.

What is the covariance of X(t) and V where the particle’s position at time t is given

by (X(t), Y (t))?

Note that for any t X(t) = V t cos(θ). So we want to calculate

E[X(t)V ]− E[X(t)]E[V ] = tE[V 2] · E[cos(θ)]− tE[V ]2 · E[cos(θ)] = tVar[V ] · E[cos(θ)].

We know that Var[V ] = 1/12. Also,

E[cos θ] =

∫ π/2

0

2π cos tdt = 2

π .

So Cov(X(t), V ) = t6π . 454

Example 19.14. Suppose Y ∼ Exp(λ) and for any s > 0, X|Y = s ∼ U [0, s]. What is

Cov(X,Y )?

We can calculate

E[XY ] =

∫ ∫tsfX,Y (t, s)dtds =

∫ ∞

0sfY (s)

∫tfX|Y (t|s)dtds

=

∫ ∞

0sfY (s) · s2ds = 1

2 · E[Y 2] = λ−2.

Also, E[Y ] = λ−1. If we consider the function (t, s) 7→ t we get

E[X] =

∫ ∫tfX,Y (t, s)dtds =

∫ ∞

0fY (s)

∫ s

0tfX|Y (t|s)dtds =

∫ ∞

0E[X|Y = s]fY (s)ds

=

∫ ∞

0

s2fY (s)ds = 1

2 E[Y ] = 12λ .

So altogether,

Cov(X,Y ) =1

λ2− 1

2λ· 1

λ=

1

2λ2.

454

19.5. Markov and Chebyshev inequalities Markov

Theorem 19.15 (Markov’s inequality). Let X ≥ 0 be a non-negative random variable.

Then, for any a > 0,

P[X ≥ a] ≤ E[X]

a.

Andrey Markov

(1856–1922)

126

Proof. Consider the random variable Y = a1X≥a. Note that Y ≤ X (here we use the

non-negativity of X). So

E[X] ≥ E[Y ] = aP[X ≥ a].

ut Chebyshev

Theorem 19.16 (Chebyshev’s inequality). Let X be a random variable with finite sec-

ond moment. Then, for any a > 0,

P[|X − E[X]| ≥ a] ≤ Var[X]

a2.

Pafnuty Chebyshev

(1821–1894)

Proof. Apply Markov to the non-negative random variable Y = (X − E[X])2, and note

that |X − E[X]| ≥ a =Y ≥ a2

. ut

If X is a random variable with E[X] = µ and Var[X] = σ2, then Chebyshev’s in-

equality tells us that the probability that X deviates from its mean kσ (i.e. k standard

deviations) is at most 1/k2.

Example 19.17. The average wage in Eurasia is 1000 a month.

• A person is chosen at random. What is the probability that this person earns

at least 10, 000 a month?

By Markov’s inequality, if X is the random person’s salary, then

P[X ≥ 10, 000] ≤ E[X]

10, 000=

1

10.

• The average wage for an Inner Party member is 10, 000 a month, and the stan-

dard deviation of the wage is 100. What is the probability that a randomly

chosen Inner Party member earns at least 11, 000?

Here we have that E[X] = 10, 000 and Var[X] = 10, 000. So

P[X ≥ 11, 000] ≤ P[|X − E[X]| ≥ 1, 000] ≤ Var[X]

1, 0002=

1

100.

454

Example 19.18. Let X ∼ N(0, σ). Chebyshev’s inequality tells us that P[|X| > kσ] ≤k−2.

127

However because σ−1X ∼ N(0, 1), we can calculate for a = tσ ,

P[|X| > t] = P[ 1σX > a] + P[ 1

σX < −a] =

∫ −a

−∞fσ−1X(s)ds+

∫ ∞

afσ−1X(s)ds

= 2

∫ ∞

a

1√2πe−s

2/2ds ≤ 2

∫ ∞

a

s

a· 1√

2πe−s

2/2ds

=2

a ·√

2π· (−e−s2/2)

∣∣∣∞

a=

2

a ·√

2π· e−a2/2.

If we plug in t = kσ we get

P[|X| > kσ] ≤ 2√2πk2

· e−k2/2 << k−2,

which is a much better bound. 454

19.6. The Weierstrass Approximation Theorem

Karl Weierstrass

(1815–1897)

Here is a probabilistic proof by Bernstein for the famous Weierstrass Approximation

Theorem.

Theorem 19.19 (Weierstrass Approximation Theorem). Let f : [0, 1]→ R be continu-

ous. For every ε > 0 there exists a polynomial p(x) such that supx∈[0,1] |p(x)−f(x)| < ε.

(That is, the polynomials are dense in L∞([0, 1]).)

Proof. By classical analysis, f is uniformly continuous and bounded on [0, 1]; that is,

there exists M > 0 such that supx∈[0,1] |f(x)| ≤ M and for every ε > 0 there exists

δ = δ(ε) > 0 such that if |x− y| ≤ δ then |f(x)− f(y)| < ε2 .

Fix ε > 0. Let δ = δ(ε). Let x ∈ (0, 1). Consider B ∼ Bin(n, x). Chebyshev’s

inequality guaranties that

P[|B − nx| ≥ n2/3] ≤ Var[B] · n−4/3 = x(1− x) · n−1/3 ≤ 1

4n1/3.

Thus, we may choose n = n(ε,M) large enough so that this probability is at most ε4M

for any x. Assume also that n−1/3 < δ.

Define the polynomial

p(x) :=n∑

j=0

(n

j

)xj(1− x)n−jf( jn).

That is, p(x) = E[f(B/n)] where B ∼ Bin(n, x).

128

Let A =|Bn − x| > δ

. So

P[A] ≤ P[|B − nx| > n2/3] ≤ ε

4M.

Thus,

|p(x)− f(x)| = |E[f(B/n)− f(x)]|

≤ E[|f(B/n)− f(x)| · 1A] + E[|f(B/n)− f(x)| · 1Ac ]

≤M · P[A] + ε2 · P[Ac] ≤ 3ε

4 < ε.

This bound is independent of x. ut

19.7. The Paley-Zygmund Inequality

Raymond Paley

(1907–1933)

Antoni Zygmund

(1900–1992)

Theorem 19.20 (Paley-Zygmund Inequality). Let X ≥ 0 be a non-negative random

variable with finite variance. Then, for any α ∈ [0, 1],

P[X > αE[X]] ≥ (1− α)2 · (E[X])2

E[X2].

Specifically,

P[X > 0] ≥ (E[X])2

E[X2].

Proof. Note that

X = X1X≤αE[X] +X1X>αE[X].

Since X ≥ 0 we have that X1X≤αE[X] ≤ αE[X] and by Cauchy-Schwarz,

(E[X1X>αE[X]])2 ≤ E[X2] · P[X > αE[X]].

Combining these we have,

E[X] ≤ αE[X] +√E[X2] · P[X > αE[X]],

which implies that

(1− α)2(E[X])2 ≤ E[X2] · P[X > αE[X]].

ut

129

Example 19.21 (Erdos-Renyi random graph). Suppose that n students are on Face-

book. For every two different students, say x, y, they become friends with probability

p, all pairs of students being independent.

We say that students x and y are connected if there is some sequence of students

x = x1, . . . , xk = y such that xj and xj+1 are friends for every j.

What is the probability that some student has no friends? What is the probability

that all students are connected?

Paul Erdos (1913–1996)

Alfred Renyi (1921–1970)

Suppose 1, 2, . . . , n is the set of students. Let Ix,y be the indicator of the event that

x and y are friends. So (Ix,y)x 6=y are(n2

)independent Bernoulli-p random variables.

Let X be the number of students with no friends; the isolated students.

Let us begin by calculating the expected number of isolated students: A student x is

isolated if Ix,y = 0 for all y 6= x. So this happens with probability (1− p)n−1. Thus, by

linearity,

E[X] =∑

x

P[x is isolated ] = n(1− p)n−1 =: µ.

Let us also calculate the second moment of X: First, for x 6= y, is both are isolated

then for all z 6∈ x, y we have Ix,z = 0 and Iy,z = 0. Also, Ix,y = 0. So,

P[x is isolated , y is isolated ] = (1− p)2(n−2)+1 = (1− p)2(n−1)(1− p)−1.

Thus,

E[X2] = E∑

x,y

1x is isolated ,y is isolated

=∑

x

P[x is isolated ] +∑

x6=yP[x is isolated , y is isolated ]

= µ+ n(n− 1)(1− p)2(n−1)(1− p)−1 = µ+ µ2 · (1− p)−1 · (1− 1n).

Specifically,

(E[X])2

E[X2]=(µ−1 + (1− p)−1 · (1− 1

n))−1

.

We will make use of the inequalities e− p

1−p ≤ 1− p ≤ e−p valid for all p ≤ 12 .

130

Now, if p ≥ (1+ε) lognn−1 , then

E[X] ≤ ne−p(n−1) = n−ε → 0,

so by Markov’s inequality,

P[X ≥ 1] ≤ E[X]→ 0.

That is, with high probability all students are connected.

On the other hand, if p ≤ (1−ε) lognn−1 , then

µ ≥ n exp(− 1

1−p · (1− ε) log n)→∞

(because 11−p(1− ε) < 1 for large enough n). Thus, by the Paley-Zygmund inequality,

P[X > 0] ≥ (E[X])2

E[X2]=(µ−1 + (1− p)−1 · (1− 1

n))−1 → 1.

That is, with high probability there exists a student that has no friends. 454

131


201.1.8001

Ariel Yadin

Lecture 20

20.1. Convergence of Random Variables

As in the case of sequences of numbers, we would like to talk about convergence of

random variables. There are many ways to say that a sequence of random variables

approximates some limiting random variable. We will talk about 4 such possibilities.

20.2. a.s. and Lp Convergence

We have already seen some type of convergence; namely, Xn → X if Xn(ω)→ X(ω)

for all ω ∈ Ω. This is a strong type of convergence; we are usually not interested in

changes that have probability 0, so we will define the following type of convergence:

Let (Xn)n be a sequence of random variables on some probability space (Ω,F ,P). We

have already seen that lim inf Xn, lim supXn are both random variables. Thus, the set

ω : lim inf Xn(ω) = lim supXn(ω) is an event in F . Hence we can define: a.s. conver-

genceDefinition 20.1. Let (Xn)n be a sequence of random variables on (Ω,F ,P). Let X be

a random variable.

We say that (Xn) converges to X almost surely, or a.s., denoted

Xna.s.−→ X

if P[limXn = X] = 1; that is, if

P[ω : lim

n→∞Xn(ω) = X(ω)

] = 1.

Another type of convergence is a type of average convergence:

132

Lp conver-

genceDefinition 20.2. Let p > 0. Let (Xn)n be a sequence of random variables on (Ω,F ,P)

such that E[|Xn|p] <∞ for all n. Let X be a random variable such that E[|X|p] <∞.

We say that (Xn) converges to X in Lp, denoted

XnLp−→ X

if

limn→∞

E[|Xn −X|p] = 0.

In the exercises we will show that for 0 < p < q, Lq convergence implies Lp conver-

gence.

We will also show that the limit is unique; that is the limit is

unique

Exercise 20.3. Suppose that Xna.s.−→ X and Xn

Lp−→ Y . Show that P[X = Y ] = 1.

Suppose that XnLp−→ X and Xn

Lq−→ Y . Show that P[X = Y ] = 1. a.s. 6⇒ Lp

Example 20.4. Fix p > 0. Let (Ω,F ,P) be uniform measure on [0, 1]. For all n let Xn

be the discrete random variable

Xn(ω) =

n1/p ω ∈ [0, 1/n]

0 otherwise .

So Xn has density

fXn(s) =

1n s = n1/p

1− 1n s = 0

0 otherwise

Note that

E[|Xn|q] = 1n |n1/p|q = nq/p−1.

Thus, if 0 < q < p, then E[|Xn − 0|q]→ 0, so XnLq−→ 0.

However, for q ≥ p we have that E[|Xn − 0|q] ≥ 1 for all n, so Xn 6 Lq

−→ 0 (and thus

Xn 6 Lq

−→ X for any X, since the limit must be unique).

We claim that Xna.s.−→ 0. Indeed, let ω ∈ [0, 1]. If ω > 0 the for all n > 1

ω , we get

that Xn(ω) = 0, and thus Xn(ω)→ 0. Thus,

P[ω : Xn(ω) 6→ 0] ≤ P[0] = 0,

133

so Xna.s.−→ 0.

This example has shown that

• We can have Lq convergence without Lp convergence, for p > q.

• We can have a.s. convergence without Lp convergence.

454 Lp 6⇒ a.s.

Example 20.5. For all n let Xn be mutually independent Ber(1/n) random variables.

Thus, for all n and all p > 0,

E[|Xn|p] = 1n → 0 so Xn

Lp−→ 0.

On the other hand, let An = |Xn| > 1/2 . So (An)n is a sequence of mutually

independent events, and P[An] = 1/n. Thus, since∑

n P[An] = ∞, using the second

Borel-Cantelli Lemma,

P[lim supn

An] = 1.

That is, with probability 1, there exist infinitely many n such that Xn > 1/2. Thus,

P[lim supn

Xn ≥ 1/2] = 1.

So we cannot have that Xna.s.−→ 0. Since the limit must be unique, we cannot have

Xna.s.−→ X for any X.

We conclude that: It is possible for (Xn) to converge in Lp for all p, but not to

converge a.s. 454

20.3. Convergence in Probability convergence

in prob.

Definition 20.6. Let (Xn)n be a sequence of random variables on (Ω,F ,P). We say

that (Xn) converges in probability to X, denoted

XnP−→ X

if for all ε > 0,

limn→∞

P[|Xn −X| > ε] = 0.

134

a.s. ⇒ P

Example 20.7. Suppose that Xna.s.−→ X. That is,

P[|Xn −X| → 0] = 1.

Fix ε > 0. For any n let An = |Xn −X| > ε. Note that if An occurs for infinitely

many n, then lim supn |Xn −X| ≥ ε > 0. Thus, by Fatou’s Lemma,

0 ≤ lim supn

P[An] = P[lim supn

An] = P[An i.o.] ≤ P[lim supn|Xn −X| > 0] = 0.

Thus,

limn

P[|Xn −X| > ε] = lim supn

P[An] = 0,

and so XnP−→ X.

That is: a.s. convergence implies convergence in probability. 454 Lp ⇒ P

Example 20.8. Assume that XnLp−→ X. Then, for any ε > 0, by Markov’s inequality,

P[|Xn −X| > ε] ≤ E[|Xn −X|p]εp

→ 0.

Thus, XnP−→ X.

That is, convergence in Lp implies convergence in probability. 454

The following is in the exercises: limit is

unique

Exercise 20.9. Let XnP−→ X and Xn

P−→ Y . Show that P[X = Y ] = 1.

135


201.1.8001

Ariel Yadin

Lecture 21

21.1. The Law of Large Numbers

In this section we will prove

Theorem 21.1 (Strong Law of Large Numbers). Let X1, X2, . . . , Xn, . . . , be a sequence

of mutually independent random variables, such that E[Xn] = 0 and supn E[X2n] < ∞.

For each N let

SN =

N∑

n=1

Xn.

Then,

SNN

a.s.−→ 0.

21.1.1. Kolmogorov’s inequality.

Lemma 21.2 (Kolmogorov’s inequality). Let X1, X2, . . . , Xn, . . . , be a sequence of mu-

tually independent random variables, such that E[Xn] = 0 and E[X2n] <∞. For each N

let

SN =N∑

n=1

Xn and MN = max1≤n≤N

|Sn|.

Then, for any λ > 0,

P[MN ≥ λ] ≤∑N

n=1 E[X2n]

λ2.

Proof. Fix λ > 0 and let

An = S1 < λ, S2 < λ, . . . , Sn−1 < λ, Sn ≥ λ .

We will make a few observations.

136

First note that since (Xn) are mutually independent, they are uncorrelated, so

E[SN ] = 0 and Var[SN ] = E[S2N ] =

N∑

n=1

E[X2n].

Next, note that MN ≥ λ if and only if there exists 1 ≤ n ≤ N such that |Sn| ≥ λ and

|Sk| < λ for all 1 ≤ k < n. That is, MN ≥ λ implies that there exists 1 ≤ n ≤ N such

that S2n · 1An ≥ λ2.

Since (X1, . . . , Xn) are independent of (Xn+1, . . . , XN ), we have that the random

variable Sn · 1An is independent of Xn+1 +Xn+2 + · · ·+XN = SN − Sn. Thus, for any

1 ≤ n ≤ N we have that

E[S2N · 1An ] = E[(SN − Sn + Sn)2 · 1An ] = E[(SN − Sn)2 · 1An ] + E[S2

n · 1An ] + 2E[(SN − Sn) · Sn · 1An ]

≥ E[S2n · 1An ],

where we have used that

E[(SN − Sn) · Sn · 1An ] = E[SN − Sn] · E[Sn · 1An ] = 0

by independence.

We now have that

E[S2N ] ≥ E[S2

N · 1MN≥λ] = E[

N∑

n=1

1AnS2N ] ≥

N∑

n=1

E[S2n · 1An ]

Note that by Boole’s inequality, since MN ≥ λ implies that there exists 1 ≤ n ≤ N such

that S2n · 1An ≥ λ2,

P[MN ≥ λ] ≤N∑

n=1

P[S2n · 1An ≥ λ2] ≤ 1

λ2

N∑

n=1

E[S2n · 1An ] ≤ 1

λ2E[S2

N ].

ut

21.1.2. Kronecker’s Criterion.

Lemma 21.3. Suppose that x1, . . . , xn, . . . , is a sequence of numbers such that∑∞

n=1xnn

converges. Then,

limN→∞

1

N

N∑

n=1

xn = 0.

137

Proof. Let sN =∑N

n=1xnn . So sN → s :=

∑∞n=1

xnn . Thus, also 1

N

∑Nn=1 sn → s. Note

that for all n, xn = n(sn − sn−1) (where we set s0 = 0). Thus,

1

N

N∑

n=1

xn =1

N(NsN + ((N − 1)−N)sN−1 + ((N − 2)− (N − 1))sN−2 + · · ·+ (1− 2)s1)

= SN −N − 1

N· 1

N − 1

N−1∑

n=1

sn → s− s = 0.

ut

21.1.3. Proof of Law of Large Numbers.

Lemma 21.4. Let X1, X2, . . . , Xn, . . ., be a sequence of mutually independent random

variables such that E[Xn] = 0 for all n. If∑∞

k=1 Var[Xk] <∞ then

P

[ ∞∑

k=1

Xk converges

]= 1.

Proof. Let Sn =∑n

k=1Xk.

For every ω ∈ Ω,∑∞

k=1Xk(ω) converges if and only if Sn(ω) converges, which is if

and only if (Sn(ω))n form a Cauchy sequence. Thus, it suffices to show that

P[(Sn)n is a Cauchy sequence ] = 1.

Let n > 0, and consider

Sn+m − Sn = Xn+1 +Xn+2 + · · ·+Xn+m.

Kolmogorov’s inequality gives that

P[ max1≤m≤N

|Sn+m − Sn| ≥ λ] ≤ 1

λ2·N∑

k=1

E[X2n+k] ≤

1

λ2·∑

k≥1

E[X2n+k].

Taking N →∞ on the left-hand side, using continuity of probability for the increasing

sequence of events,

P[supm≥1|Sn+m − Sn| ≥ λ] ≤ 1

λ2·∑

k≥1

E[X2n+k].

138

This last term is the tail of the convergent series∑

k E[X2k ]. Thus, for any r > 0 there

exists n(r) so that∑

k≥n E[X2k ] < r−3, which implies that

P[infn

supm≥1|Sn+m − Sn| > r−1] ≤ P[sup

m≥1|Sn(r)+m − Sn(r)| > r−1] < r−1.

Now, the events

infn supm≥1 |Sn+m − Sn| > r−1

are increasing in r, so taking r →∞with continuity of probability,

P[infn

supm≥1|Sn+m − Sn| > 0] ≤ 0.

That is,

P[infn

supm≥1|Sn+m − Sn| = 0] = 1.

Now let ω ∈ Ω be such that infn supm≥1 |Sn+m(ω) − Sn(ω)| = 0. This implies that

for all ε > 0 there exists N = N(ε, ω) such that

supm≥1|SN+m(ω)− SN (ω)| < ε/2.

Thus, for any n > N(ε, ω) and any m ≥ 1,

|Sn+m(ω)− Sn(ω)| ≤ |SN+n+m−N (ω)− SN (ω)|+ |SN+n−N (ω)− SN (ω)| < ε.

That is, for any ε > 0 there exists N = N(ε, ω) such that for all n > N(ε, ω) and all

m ≥ 1, |Sn+m(ω) − Sn(ω)| < ε; that is, for this ω, (Sn(ω))n is a Cauchy sequence. We

have shown that

ω : infn

supm≥1|Sn+m(ω)− Sn(ω)| = 0 ⊂ ω : (Sn(ω))n is a Cauchy sequence .

Since the first event has probability 1, so does the second event, and we are done. ut

The proof of the law of large numbers is now immediate:

Proof of Theorem 21.1. By Kronecker’s Criterion is suffices to show that

P[ ∞∑

n=1

Xn

nconverges

]= 1.

By the above Lemma it suffices to show that

∞∑

n=1

Var[Xn/n] <∞.

139

Since Var[Xn/n] = 1n2 , this is immediate. ut

Example 21.5. The Central Bureau of Statistics wants to asses how many citizens use

the train system. They poll N randomly and independently chosen citizens, and set

Xj = 1 if the j-th citizen uses the train, and Xj = 0 otherwise. Any citizen is equally

likely to be chosen, and all are independent.

As N →∞ the sample mean

1

N

N∑

j=1

Xj

converges to E[Xj ] = E[X1] which is the probability that a randomly chosen citizen uses

the train system. Since we choose each citizen uniformly, this probability is exactly the

number of citizens that use the train, divided by the total number of citizens.

Thus, for large N ,

C · 1

N

N∑

j=1

Xj

is a good approximation for the number of citizens that use the train system, where C

is the number of citizens. 454

140


201.1.8001

Ariel Yadin

Lecture 22

22.1. Convergence in Distribution

Previously we talked about types of convergence that required the sequence and the

limit to be defined on the same probability space. We now look at a type of convergence

which does not have this requirement.

Definition 22.1. Let (Xn)n be a sequence of random variables. We say that (Xn)

converges in distribution to X, denoted

XnD−→ X,

if for all t ∈ R such that FX(t) is continuous at t,

limn→∞

FXn(t) = FX(t).

That is, the distribution functions of Xn converge pointwise to the distribution func-

tion of X.

Remark 22.2. !!! Note that the convergence is required only at continuity points of

FX .

Example 22.3. Let Xn ∼ U [0, 1/n]. Then XnD−→ 0. Indeed, the distribution function

of 0 is F0(t) = 1 for t ≥ 0 and FX(t) = 0 for t < 0. Note that

FXn(t) =

0 t < 0

nt 0 ≤ t ≤ 1/n

1 t > 1/n

141

Since 0 is not a continuity point for F0, we don’t care about it. For t < 0 we have that

FXn(t) = 0 = FX(t). For t > 0 we have that for large enough n > 1/t, FXn(t) = 1 =

FX(t). 454

X Note that for the other types of convergence we needed all random variables to live

on the same space. For convergence in distribution, this is not required. However, this

is the weakest kind of convergence, as can be seen by the following proposition.

Proposition 22.4. Convergence in probability implies convergence in distribution.

Proof. Let XnP−→ X. Fix t ∈ R.

First, for all ε > 0,

P[Xn ≤ t] = P[Xn ≤ t, |Xn −X| ≤ ε] + P[Xn ≤ t, |Xn −X| > ε]

≤ P[X ≤ t+ ε] + P[|Xn −X| > ε]→ P[X ≤ t+ ε].

So lim supn FXn(t) ≤ FX(t + ε) for all ε. Since FX is right-continuous, we get that

lim supn FXn(t) ≤ FX(t).

On the other hand, for any ε > 0,

P[X ≤ t− ε] = P[X ≤ t− ε, |Xn −X| ≤ ε] + P[X ≤ t− ε, |Xn −X| > ε]

≤ P[Xn ≤ t] + P[|Xn −X| > ε].

Taking lim inf we get that FX(t − ε) ≤ lim infn FXn(t). Now, if FX is continuous at t,

then taking ε→ 0 gives

FX(t) ≤ lim infn

FXn(t) ≤ lim supn

FXn(t) ≤ FX(t).

So XnD−→ X. ut

22.2. Approximating by Smooth Functions

Lemma 22.5. Let (Xn)n be be a sequence of random variables. Let C3b be the space of

all three times continuously differentiable functions on R with bounded derivatives. If

for every φ ∈ C3b ,

E[φ(Xn)]→ E[φ(X)]

142

then XnD−→ X.

Proof. We want to choose a C3 function ψ that will approximate the step function

1(−∞,0]. For this we choose a C3 function that is 1 on (−∞, 0], decreasing on [0, 1] and

1 on [1,∞). Call this function ψ. (An explicit construction can be found in Section 22.8

below.)

For every t ∈ R and ε > 0 define

ψt,ε(x) = ψ(ε−1(x− t)).

Note that for every t ∈ R and ε > 0 we have

ψt−ε,ε(x) ≤ 1(−∞,t](x) ≤ ψt,ε(x).

For convergence in distribution we need to show that for every t ∈ R such that FX is

continuous at t,

limn→∞

E[1(−∞,t](Xn)] = E[1(−∞,t](X)].

Fix t ∈ R. For any ε > 0 we have that

E[1(−∞,t](Xn)]− E[1(−∞,t](X)]

≤ E[ψt,ε(Xn)]− E[1(−∞,t](X)]→ E[ψt,ε(X)]− E[1(−∞,t](X)]

≤ E[1(−∞,t+ε](X)]− E[1(−∞,t](X)] = P[X ∈ (t, t+ ε]] = FX(t+ ε)− FX(t).

Similarly,

E[1(−∞,t](X)]− E[1(−∞,t](Xn)]

≤ E[1(−∞,t](X)]− E[ψt−ε,ε(Xn)]→ E[1(−∞,t](X)]− E[ψt−ε,ε(X)]

≤ E[1(−∞,t](X)]− E[1(−∞,t−ε](X)] = P[X ∈ (t− ε, t]] = FX(t)− FX(t− ε).

We conclude that

lim supn→∞

∣∣E[1(−∞,t](Xn)]− E[1(−∞,t](X)]∣∣ ≤ |FX(t+ ε)− FX(t)|+ |FX(t)− FX(t− ε)|

for all ε > 0.

For any t that is a continuity point of FX the above tends to 0 as ε→ 0. ut

143

22.3. Central Limit Theorem

We will now build the tools for the following very important approximation theorem:

Theorem 22.6 (Central Limit Theorem). Let X1, X2, . . . , Xn, . . . , be mutually inde-

pendent random variables, such that E[Xn] = 0 and E[X2n] = 1. Let

SN =

N∑

n=1

Xn.

Then,SN√N

D−→ N(0, 1).

That is, for all t ∈ R,

P[SN ≤ t√N ]→

∫ t

−∞

1√2πe−s

2/2ds.

In the exercises we will show that

Exercise 22.7. Let X1, X2, . . . , Xn, . . . , be mutually independent random variables, such

that E[Xn] = µ and Var[Xn] = σ2. Let

SN =N∑

n=1

Xn.

Then,SN −Nµ√

Nσ

D−→ N(0, 1).

X That is, no matter what distribution we start with - we can approximate the sum

of many independent trials by a normal random variable.

22.4. Uses for the CLT

Example 22.8. Every student’s grade in a course has expectation 80 and standard

deviation 20. Give an approximation of the average grade for 100 students, if all students

are independent. What is a good estimate for the probability of the average grade to

be below 70?

If Xk is the grade of the k-th student, then the average grade is M := 1100

∑100k=1Xk.

144

The central limit theorem tells us that 110·20 · (100M − 100 · 80) = 1

2(M − 80) is very

close to a N(0, 1) random variable. Thus,

P[M < 70] = P[12(M − 80) < −5] ≈ P[N(0, 1) < −5] =

∫ −5

−∞

1√2πe−s

2/2ds = ...

which can be calculated using a standard normal distribution table. 454

Example 22.9. X1, X2, . . ., are independent random variables with the same distribu-

tion, such that E[Xn] = 60 and Var[Xn] = 25.

Show that we can take N large enough so that the empirical average 1N

∑Nk=1Xk is

between 55 and 65 with probability greater than 0.99.

How large should we take N if we want to ensure this?

Let SN =∑N

k=1Xk. The central limit theorem tells us that the distribution of

15√N

(SN − 60N) converges to the distribution of a N(0, 1) random variable. Thus,

P[55 < 1N SN < 65] = P[−

√N <

SN − 60N

5√N

<√N ] ≈ P[−

√N < N(0, 1) <

√N ].

The right hand side above

P[−√N <

SN − 60N

5√N

<√N ]

goes to 1. So there is some large enough N so that this is greater than 0.99. 454

Example 22.10. X1, X2, . . ., are independent random variables with the same distri-

bution, such that E[Xn] = µ and Var[Xn] = 1.

Show that there exists N0 and c > 0 such that for all N > N0,

P[|SN −Nµ| ≤ 12

√N ] ≥ c,

where SN =∑N

k=1Xk.

Note that we cannot use Kolmogorov here, because that would only give

P[ max1≤n≤N

|Sn − nµ| > 12

√N ] ≤ 4.

However, the central limit theorem tells us that

P[|SN −Nµ| ≤ 12

√N ]→ P[|N(0, 1)| ≤ 1

2 ].

145

If we take c = 12 P[|N(0, 1)| ≤ 1

2 ], we get that for all large enough N ,

P[|SN −Nµ| ≤ 12

√N ] > c.

454

Example 22.11. The government decides to test if the lab workers at the central lab

of Kupat Cholim are adequate. They are given an exam, and graded between 0 and

100. The exam is such so that the expectation of a random worker’s grade is 80 and

standard deviation is 20.

If the lab has 10000 workers, give a good approximation of the probability that the

average grade in the lab is smaller than 70?

In this case, we can let X1, . . . , X10000 be the grades of the workers, so E[Xj ] = 80 and

Var[Xj ] = 400. If A = 110000

∑10000j=1 Xj is the average grade, the central limit theorem

tells us that

P[A ≤ 70] = P[100(A− 80) ≤ −10 · 100] = P[ 1100·20

10000∑

j=1

(Xj − 80) ≤ −50]

is very close to

P[N(0, 1) ≤ −50] =

∫ −50

−∞

1√2πe−s

2/2ds.

454

Example 22.12. Suppose there are 1.2 million people that will actually vote in the

elections. We want to forecast how many mandates party A will receive in the elections.

Each mandate is worth 1.2 · 106/120 = 104 people. There are a people that will actually

vote for party A, so party A will get 10−4a mandates.

Suppose we choose a random voter uniformly at random. The probability that she

votes for party A is then p = a1.2·106

, and the variance is p(1− p).If we repeat this experiment 900 times, we can sum up the number of people who

said they would vote for party A, and the average of this number should be close to p.

So if Xj is the indicator of the event that the j-th voter polled said she would vote for

party A, we get that

120 · 1

900

900∑

j=1

Xj

146

is close to 120p = 1.2 · 106 · 10−4p = 10−4a, which is the number of mandates party A

gets.

How good is this approximation? What is the number of mandates so that this is

close to the real number with 95% confidence?

For any k, we have that

P[∣∣120 · 1

900

900∑

j=1

Xj − 10−4a∣∣ > k] = P[

∣∣ 1900

900∑

j=1

(Xj − p)∣∣ > k

120 ]

= P[∣∣ 1

30

900∑

j=1

(Xj − p)∣∣ > k

4 ]

which is close to

P[|N(0, 1)| > k

4√p(1−p)

]

by the central limit theorem. If we take k ≥ 4√p(1− p)t then this is at most

2

∫ ∞

t

1√2π· e−s2/2ds ≤ 2

∫ ∞

t

st

1√2π· e−s2/2ds =

√2

t√πe−t

2/2.

For t = 2.05 this is smaller than 0.05. Since p(1−p) ≤ 1/4 we can take k = 5 > 2 ·2.05 ≥4√p(1− p) · 2.05 and then

P[∣∣120 · 1

900

900∑

j=1

Xj − 10−4a∣∣ > k] < 0.05.

454

Example 22.13. Let (Xn)n be independent identically distributed random variables, let

(Yn)n be independent identically distributed random variables, such that E[Xn] = E[Yn]

for all n, and for all n:

Var[Xn] = Var[Yn] = 1 and Cov(Xn, Yn) = 12 .

Let S =∑104

j=1Xj and T =∑104

j=1 Yj . Give a good approximation for P[S > T ].

We have that

S − T =104∑

j=1

(Xj − Yj),

147

where (Xn − Yn)n are independent with mean 0 and

Var[Xj − Yj ] = Var[Xj ] + Var[Yj ]− 2 Cov(Xj , Yj) = 1.

So

P[S > T ] = P[S − T > 0] = P[ 1100(S − T ) > 0] ≈ P[N(0, 1) > 0] =

1

2.

454

Example 22.14. A gambler plays a game repeatedly where at each game he earns

Geo(0.1) Shekel and loses Poi(10) Shekel, both wins and losses independent.

• What is a good approximation for the probability that he has more than 20

Shekel after 100 games?

• What is a good approximation for the probability that he has exactly 0 Shekel?

The amount of money after 100 games is

M :=100∑

j=1

Xj − Yj

where Xj ∼ Geo(0.1) and Yj ∼ Poi(10) all independent. Since E[Xj − Yj ] = 0 and since

Var[Xj − Yj ] = Var[Xj ] + Var[Yj ] = 100 · 0.9 + 10 = 100,

we have using the central limit theorem

P[M > 20] = P[ 1100M > 0.2] ≈ P[N(0, 1) > 0.2]

For P[M = 0] we cannot use the approximation P[N = 0] = 0! So we use the following

trick: Since M takes on only integer values, M = 0 if and only if M ∈ [−1/2, 1/2]. So,

P[M = 0] = P[ 1100M ∈ [− 1

200 ,1

200 ]] ≈ P[− 1200 ≤ N(0, 1) ≤ 1

200 ]

454

148

22.5. Lindeberg’s Method

Jarl Lindeberg

(1876–1932)

A “hands-on” proof of the CLT (that does not use the Levy Continuity Theorem) was

given by Lindeberg. The main idea is: If Z1, . . . , Zn are independent N(0, 1) random

variables, then n−1/2(Z1 + · · · + Zn) ∼ N(0, 1). So we can change X1, X2, . . . , Xn into

Z1, . . . , Zn one by one, each time comparing the sums. Hopefully each of the n differences

will not be large.

First, a technical lemma regarding C3 functions.

Lemma 22.15. For any C3 function φ and any x, y ∈ R,

|φ(x+ y)− φ(x)− φ′(x)y − 1

2φ′′(x)y2| ≤ min

||φ′′||∞|y|2, ||φ′′′||∞|y|3

.

Proof. For any y > 0, by Taylor’s theorem, expanding around x,

Brook Taylor

(1685–1731)

|φ(x+ y)− φ(x)− φ′(x)y − 1

2φ′′(x)y2| ≤ 1

6 ||φ′′′||∞ · y3.

The second order Taylor expansion

|φ(x+ y)− φ(x)− φ′(x)y| ≤ 12 ||φ′′||∞ · y2,

with the triangle inequality gives that

|φ(x+ y)− φ(x)− φ′(x)y − 1

2φ′′(x)y2| ≤ |φ(x+ y)− φ(x)− φ′(x)y|+ 1

2 ||φ′′||∞ · y2

≤ ||φ′′||∞ · y2.

If y < 0 apply the previous argument to the function φ(−x), and note that the

absolute values of the derivatives of the two functions are always the same. ut

Lemma 22.16. Let A,X,Z be independent random variables with E[A] = E[X] =

E[Z] = 0 and E[X2] = E[Z2]. Then for any C3 function φ and any ε, δ > 0,

∣∣∣E[φ(A+ δX)]− E[φ(A+ δZ)]∣∣∣

≤Mφδ2 ·(εE[X2] + εE[Z2] + E[Z21|Z|>εδ−1] + E[X21|X|>εδ−1

),

where Mφ = max ||φ′′||∞, ||φ′′′||∞.

We gain ε

in the first

term, and the

1|X|>εδ−1term for the

second term.

149

Proof. Set

Y =

(φ(A+ δX)− φ(A)− φ′(A) · δX − 1

2φ′′(A) · δ2X2

)

−(φ(A+ δZ)− φ(A)− φ′(A) · δZ − 1

2φ′′(A) · δ2Z2

).

Since E[X] = E[Z] and E[X2] = E[Z2] we get that

E[Y ] = E[φ(A+ δX)]− E[φ(A+ δZ)].

Also, for YX = φ(A+ δX)− φ(A)− φ′(A) · δX − 12φ′′(A) · δ2X2,

|E[YX ]| = |E[YX1|X|>εδ−1] + E[YX1|X|≤εδ−1]|

≤ ||φ′′||∞δ2 E[X21|X|>εδ−1] + ||φ′′′||∞δ3 E[|X|31|X|≤εδ−1]

≤ ||φ′′||∞δ2 E[X21|X|>εδ−1] + ε||φ′′′||∞δ2 E[X2].

ut

We are now ready to prove the Central Limit Theorem.

Theorem 22.17 (Central Limit Theorem). Let (Xn)n be independent identically dis-

tributed random variables with E[Xk] = 0 and E[X2k ] = 1. Let Sn =

∑nk=1Xk and

Z ∼ N(0, 1). Then, for any C3b function φ,

E[φ(n−1/2Sn)]→ E[φ(Z)],

so consequently, n−1/2SnD−→ Z.

Proof. Let (Zn)n be independent N(0, 1) random variables, also independent of (Xn)n.

Note that n−1/2∑n

k=1 Zk ∼ N(0, 1).

Fix n and let 0 ≤ k ≤ n. Define

Ak = X1 + · · ·+Xk + Zk+1 + · · ·Zn.

So Sn = An and n−1/2A0 ∼ N(0, 1). We can write the telescopic sum

E[φ(Z)]− E[φ(n−1/2Sn)] =

n−1∑

k=0

E[φ(n−1/2Ak)]− E[φ(n−1/2Ak+1)].

So it suffices to bound the increments |E[φ(n−1/2Ak)]− E[φ(n−1/2Ak+1)]|.

150

For 0 ≤ k ≤ n − 1 let A = n−1/2(X1 + · · · + Xk + Zk+2 + · · ·Zn). So n−1/2Ak =

A+ n−1/2Zk+1 and n−1/2Ak+1 = A+ n−1/2Xk+1. Thus, for any ε > 0,

∣∣∣E[φ(n−1/2Ak)]− E[φ(n−1/2Ak+1)]∣∣∣ ≤Mφ · n−1 ·

(2ε+ E[Z2

k+11|Zk+1|>ε√n] + E[X2

k+11|Xk+1|>ε√n])

Summing over k we get for any ε > 0,

∣∣∣E[φ(Z)]− E[φ(n−1/2Sn)]∣∣∣ ≤Mφ ·

(2ε+ E[Z21|Z|>ε√n] + E[X21|X|>ε√n]

).

Taking n → ∞, and recalling that ε > 0 was arbitrary, it suffices to show that for any

random variable X with finite variance, the quantity E[X21|X|>ε√n]→ 0.

Indeed, the sequence (X21|X|≤ε√n)n is monotonically increasing to X2. So mono-

tone convergence implies that

E[X21|X|>ε√n] = E[X2]− E[X21|X|≤ε√n]→ 0.

ut

22.6. Characteristic Function

The characteristic function is actually just the Fourier transform of a random variable.

22.6.1. Preliminaries. Let (X,Y ) be a two-dimensional random variable. We can

think of (X,Y ) as a complex valued random variable and write (X,Y ) = X+iY . Define

the expectation of a complex valued random variable to be E[X + iY ] = E[X] + iE[Y ]

as long as these expectations are all finite.

Note that if g : R → C is a measurable function, then we can write g = Reg + Img,

and these are also measurable. Moreover, if X,Y are independent, and g, h : R→ C are

measurable, then g(X), h(Y ) are independent, since Reg(X), Img(X) are independent

of Reh(Y ), Imh(Y ).

Definition 22.18. Let X be a random variable. Define a function ϕX : R→ C by

ϕX(t) := E[eitX ] = E[cos(tX)] + iE[sin(tX)].

φX is called the characteristic function of X.

151

Note that φX is always defined since E[| cos(tX)|] ≤ 1 and E[| sin(tX)|] ≤ 1. Moreover,

we get that (as a complex number) |φX(t)| ≤ 1.

Note that if X is discrete with density fX then

φX(t) =∑

r

fX(r)eitr.

If X is absolutely continuous with density fX then

φX(t) =

∫ ∞

−∞eitsfX(s)ds.

(This latter object is the Fourier transform of fX .)

Example 22.19. Let’s calculate some characteristic functions:

• X ∼ Ber(p):

φX(t) = peit + (1− p) = 1− p(1− eit).

• Z = X + Y for X,Y independent:

φZ(t) = E[eitXeitY ] = E[eitX ]E[eitY ] = φX(t) · φY (t).

• Y = cX for some real c ∈ R:

φY (t) = E[eitcX ] = φX(tc).

• X ∼ Bin(n, p): We can write X =∑n

k=1Xj where (Xk)nk=1 are independent

Ber(p). Thus,

φX(t) =n∏

k=1

φXk(t) = (1− p(1− eit)n.

• X ∼ Geo(λ):

φX(t) =

∞∑

k=1

(1− p)k−1peitk =peit

1− (1− p)eit .

• X ∼ Poi(λ):

φX(t) =∞∑

k=0

e−λλk

k!eitk = e−λ(1−eit).

152

• X ∼ U [a, b]:

φX(t) =

∫ ∞

−∞eitsfX(s)ds =

1

b− a

∫ b

aeitsds =

1

it(b− a)(eitb − eita).

• X ∼ Exp(λ):

φX(t) =

∫ ∞

0λe−λseitsds =

λ

λ− it .

• X ∼ N(µ, σ): Write Y = (X − µ)/σ. So Y ∼ N(0, 1). We use the fact

that s 7→ sin(ts)e−s2/2 is an odd function, so its integral over R is 0. Since

eits = cos(ts) + i sin(ts) we get that

φY (t) =

∫ ∞

−∞

1√2πe−s

2/2eitsds =

∫ ∞

−∞

1√2πe−s

2/2 cos(ts)ds.

Let χs(t) = e−s2/2 cos(ts). Since |χs(t)| ≤ 1, since χ′s(t) = −e−s2/2s sin(ts)

which is continuous in t for every s, since for any t

∫

R

∫ ε

−ε|χ′s(t+ η)|dηds ≤

∫

R|s|e−s2/2ds <∞,

and since

∫

Rχ′s(t)ds = −

∫ ∞

−∞sin(ts)·se−s2/2ds = sin(ts)e−s

2/2∣∣∣∞

−∞−∫ ∞

−∞t cos(ts)e−s

2/2ds = −t∫

Rχs(t)ds,

which is continuous in t, we can differentiate φY under the integral to get

φ′Y (t) =1√2π

∫

Rχ′s(t) = −t · 1√

2π

∫

Rχs(t) = −tφY (t).

Dividing by φY we get that

d

dt(log φY (t)) =

φ′Y (t)

φY (t)= −t.

Integrating, log φY (t) = − t2

2 + C and since φY (0) = 1 we get that C = 0 and

φY (t) = e−t2/2. This completes the calculation for Y ∼ N(0, 1). For X = σY +µ

we get

φX(t) = E[eitσY ]eitµ = eitµφσY (t) = eitµφY (σt) = eitµe−σ2t2/2.

454

153

Lemma 22.20. Let X be a random variable with E[X] = 0 and E[X2] = 1. Then, as

t→ 0,

φX(t) = 1− t2

2+ o(t2).

Proof. As in Lemma 22.15 we get that for any x ∈ R and t > 0,∣∣∣∣e−itx − 1− itx+

t2

2x2

∣∣∣∣ ≤ mint2x2, t3|x|3

.

Thus, for any ε > 0,

|E[e−itX ]−1− t2

2| ≤ E[min

t2X2, t3|X|3

]

≤ t2 E[X21|X|>εt−1] + εt2.

We have already seen that for 0 < η < 1, as t→ 0, E[X21|X|>tη−1]→ 0 by monotone

convergence. So choosing ε = tη, as t→ 0,

φX(t) = 1− t2

2+ o(t2).

ut

22.7. Levy’s Continuity Theorem

Perhaps the main theorem using characteristic functions is a theorem of Paul Levy

giving a necessary and sufficient condition for convergence in distribution.

Paul Levy (1886–1971)

Theorem 22.21 (Levy’s Continuity Theorem). Let (Xn)n be a sequence of random

variables, and let X be another random variable. Then, XnD−→ X if and only if

φXn(t)→ φX(t) for all t ∈ R.

22.7.1. Proof of CLT. We can use Levy’s Continuity Theorem to give a different proof

of the Central Limit Theorem.

Proof of Theorem 22.6. Let φ(t) = φXn(t). Note that

φSN/√N (t) = φ(t/

√N)N .

As N →∞, we have that

φ(t/√N) = 1− t2

2N+ o(t2N−1),

154

as N →∞. Thus,

limN→∞

φ(t/√N)N = e−t

2/2.

So φSN/√N (t)→ e−t

2/2 which is the characteristic function of a N(0, 1) random variable.

By Levy’s Continuity Theorem we get that SN/√N

D−→ N(0, 1). ut

22.8. An Explicit Approximation of the Step Function

In this section we construct a function ψ, which is C3b , takes values 1 on (−∞, 0],

decreasing on [0, 1] and values 0 on [1,∞).

Some consideration leads us to understand that we would like ψ′ < 0 on (0, 1), and so

we search for a function such that ψ′′(x) = −ψ′′(1−x) on [0, 1], vanishes at the endpoints

0, 1 and obtains local minimum or maximum at the endpoint 0, 1. Some thought leads

to the function g(x) = x2(1− x)2(1− 2x). Integrating and taking a derivative we have

that for

h(x) =x4

12− x5

5+x6

6− x7

21,

the derivatives on (0, 1) satisfy h′(x) = 13x

3(1− x)3, h′′(x) = g(x) and h′′′(x) = g′(x) =

2x− 12x2 + 20x3 − 10x4.

Set A = 420 = 7 · 5 · 3 · 4. Because h(1) = A−1 and h(0) = 0 we take

ψ(x) =

1−A · h(x) x ∈ [0, 1]

1 x < 0

0 x > 1

We have that on (0, 1), ψ′ = −Ah′ < 0, also ψ′′ = −Ah′′ and ψ′′′ = −Ah′′′. These

three derivatives all vanish at 0 and 1 so ψ is C3b .

155

Figure 6. The second derivative −20 · g(x) and the function 1−A · h(x).

156


201.1.8001

Ariel Yadin

Lecture 23

23.1. Concentration of Measure

Suppose (Xn)n are a sequence of i.i.d. random variables with P[Xn = 1] = P[Xn =

−1] = 1/2. Let

Sn =n∑

k=1

Xk.

So E[Sn] = 0 and Var[Sn] = E[S2n] = n.

Chebyshev’s inequality tells us that

P[|Sn| ≥ λ√n] ≤ λ−2.

However, the central limit theorem gives that for large n,

P[|Sn| ≥ λ√n] ≈ P[|N(0, 1)| ≥ λ] =

∫ ∞

λ

1√2πe−s

2/2ds+

∫ −λ

−∞

1√2πe−s

2/2ds

≤√

2/π · λ−1

∫ ∞

λse−s

2/2ds =√

2/πλ−1 · (−e−s2/2)∣∣∣∞

λ

=√

2/πλ−1e−λ2/2.

This decays much faster than λ−2. However, we don’t know how good our approximation

by a standard normal is.

Because (Xn)n are independent, we can obtain an very nice concentration result using

a smart trick by Bernstein.

Sergei Bernstein

(1880-1968)

Theorem 23.1 (Bernstein’s Inequality). Let (Xn)n be independent random variables

such that for all n, |Xn| ≤ 1 a.s. and E[Xn] = 0. Let Sn =∑n

k=1Xk. Then for any

λ > 0,

P[Sn ≥ λ] ≤ exp

(−λ

2

2n

),

157

and consequently,

P[|Sn| ≥ λ] ≤ 2 exp

(−λ

2

2n

).

Proof. There are two main ideas:

The first idea, is that for a random variable X with E[X] = 0 and |X| ≤ 1 a.s. one

has E[eαX ] ≤ eα2/2. Indeed, g(x) = eαx is a convex function, so for |x| ≤ 1 we can write

x = β · 1 + (1− β) · (−1), where β = x+12 , and

eαx ≤ βeα + (1− β)e−α = cosh(α) + x sinh(α).

(Here 2 cosh(α) = eα + e−α and 2 sinh(α) = eα − e−α.) Thus, because E[X] = 0, and

using (2k)! ≥ 2kk!,

E[eαX ] ≤ cosh(α) + E[X] sinh(α) = cosh(α)

=∞∑

k=0

α2k

(2k)!≤∞∑

k=0

α2k

2kk!= eα

2/2.

For the second idea, due to Sergei Bernstein, one applies the Chebyshev / Markov

inequality to the non-negative random variable eαX , and then optimizes on α.

In our case, using independence,

E[eαSn ] =n∏

k=1

E[eαXk ] ≤ eα2n/2.

Now apply Markov’s inequality to the non-negative random variable eαSn to get for any

α > 0,

P[Sn ≥ λ] = P[eαSn ≥ eαλ] ≤ exp(

12nα

2 − αλ).

Optimizing over α we get that for α = λ/n,

P[Sn ≥ λ] ≤ exp

(−λ

2

2n

).

ut

Example 23.2. Suppose a gambler plays a game repeatedly and independently, such

that with probability p < 1/2 he earns one dollar and loses one dollar with probability

1− p.

158

For every n let Xn be the amount he won in the n-th game. So Sn =∑n

k=1Xk is the

total amount he wins after n games.

Note that E[Xk] = 2p − 1 < 0, so Yk = 12(Xk − (2p − 1)) has expectation 0 and

|Yk| ≤ 1. By Bernstein’s inequality

P[Sn > 0] = P[

n∑

k=1

Yk >n2 (1− 2p)] ≤ exp

(−n(1− 2p)2

8

).

So the probability of winning is very small even if the house has a very small advantage.

454

159


201.1.8001

Ariel Yadin

Lecture 24

24.1. Review

Exercise 24.1. We have N urns. Urn number k contains k white balls and N −k black

balls.

An urn is chosen randomly, all urns equally likely. Then, we start removing balls

from that urn, all balls equally likely, all balls independent, returning the removed ball

to the urn after removing it. Let A be the event that the first n balls removed are white.

Let B be the event that the n+ 1-th ball removed is black.

Calculate P[A],P[B ∩A],P[B|A].

Now calculate the same if the process is that each time a random urn is chosen inde-

pendently, and then a ball is removed and returned to and from that urn.

Solution to Exercise 24.1. For the first scenario, let Ck be the event the k-th urn was

chosen. Independence gives us that

P[A|Ck] =(kN

)n P[B ∩A|Ck] =(kN

)n · N−kN .

Thus by the law of total probability,

P[A] =1

Nn+1

N∑

k=1

kn P[B ∩A] =1

Nn+2

N∑

k=1

kn(N − k).

Thus,

P[B|A] = 1− 1

N·∑N

k=1 kn+1

∑Nk=1 k

n.

In the second scenario, the probability that a white ball is chosen each time is

1

N

N∑

k=1

k

N=

1

N2·(N + 1

2

)=N + 1

2N:= p.

160

All balls are independent, so

P[A] = pn P[B ∩A] = pn(1− p) P[B|A] = (1− p).

ut

Exercise 24.2. Random variables X,Y are equally distributed if FX = FY . Show that

if X,Y are equally distributed then E[X] = E[Y ] if the expectations exist, and E[X]

does not exist if and only if E[Y ] does not exist. (Hint: Start with simple, go through

non-negative, continue to general. First show that PX = PY .)

Solution to Exercise 24.2. The first observation is that P[X ∈ B] = P[Y ∈ B] for any

Borel set B ∈ B. Indeed, the probability measures PX ,PY agree on the π-system

(−∞, t] : t ∈ R by definition, and this π-system generates B, so PX ,PY agree on all

B.

Now, note that if g : R → R is a measurable function, then if X,Y are equally

distributed, then so are g(X), g(Y ). This is because

P[g(X) ≤ t] = P[X ∈ g−1(−∞, t]] = P[Y ∈ g−1(−∞, t]] = P[g(Y ) ≤ t].

The next step is to prove the claim for simple random variables: If X,Y are simple

and equally distributed then

P[X = r] = FX(r)− FX(r−) = FY (r)− FY (r−) = P[Y = r].

Let R be a range for X and Y . Linearity of expectation now gives us that

E[X] =∑

r∈Rr P[X = r] =

∑

r∈Rr P[Y = r] = E[Y ].

Now assume that X,Y ≥ 0. Let gn(r) = min 2−nb2nrc, n. So 0 ≤ gn(X) X and

0 ≤ gn(Y ) Y . Monotone convergence gives that

E[X] = limn

E[gn(X)] = limn

E[gn(Y )] = E[Y ],

where we have used the fact that gn(X), gn(Y ) are equally distributed simple random

variables for all n. Note that the above also holds if E[X] = E[Y ] =∞.

161

Finally, for general equally distributed X,Y : X+, Y + are equally distributed and

non-negative. Thus, E[X+] = E[Y +]. Similarly for E[X−] = E[Y −]. Thus,

E[X] = E[X+]− E[X−] = E[Y +]− E[Y −] = E[Y ],

if at least one of E[X+] = E[Y +] or E[X−] = E[Y −] is finite. If they are both infinite,

then both E[X] and E[Y ] do not exist. ut

Exercise 24.3. X ∼ U [−1, 1]. Calculate:

• P[|X| > 1/2] .

• P[sin(πX/2) > 1/2].

Solution to Exercise 24.3. Note that |X| > 1/2 = X > 1/2 ] X < −1/2. So,

P[|X| > 1/2] = P[X > 1/2] + P[X < −1/2] = 14 + 1

4 = 12 .

Since sin is increasing in [−π/2, π/2] and since sin(π/6) = 1/2, we get that

P[sin(πX/2) > 1/2] = P[πX/2 > π/6] = P[X > 1/3] = 1/3.

ut

Exercise 24.4. B,C are independent and B ∼ Exp(λ), C ∼ U [0, 1]. What is the

probability that the equation x2 + 2Bx+ C = 0 has two different real solutions.

Solution to Exercise 24.4. We are looking for the probability that

P[4B2 − 4C > 0] = P[B2 > C].

For any t > 0,

P[B2 ≤ t] = P[B ≤√t] =

∫ √t

0λe−λsds =

∫ t

0

1

2√uλe−λ

√udu,

so B2 is absolutely continuous with this density. Note that for all s > 0,∫ ∞

sfB2(u)du =

∫ ∞

s

1

2√uλe−λ

√udu =

∫ ∞√sλe−λtdt = e−λ

√s.

Since B2, C are independent, we have that

fB2|C(t|s) = fB2(t).

162

Thus,

P[B2 > C] =

∫ 1

0

∫ ∞

sfB2,C(t, s)dtds =

∫ 1

0

∫ ∞

sfB2|C(t|s)dtds =

∫ 1

0e−λ√sds

= 2

∫ 1

0ue−λudu = −2u

λ e−λu∣∣∣1

0+ 2

λ

∫ 1

0e−λudu

= − 2λe−λ − 2

λ2(e−λ − 1) = 2

λ2·(

1− (1 + λ)e−λ).

We have now a non-trivial inequality: 2λ2·(1− (1 + λ)e−λ

)≤ 1, which is equivalent to

eλ − (1 + λ) ≤ λ2

2 · eλ.

This indeed holds as 2k! ≤ (k + 2)! for all k and

eλ − (1 + λ) =

∞∑

k=0

λk+2

(k + 2)!≤∞∑

k=0

λk

k!· λ

2

2.

ut

Exercise 24.5. Let (X,Y ) be jointly absolutely continuous with

fY (s) =

C(s3 − s− 1) s ∈ [2, 4]

0 otherwise

and for all s ∈ [2, 4],

fX|Y (t|s) =

C(s) cos( π2s t) t ∈ [0, s]

0 otherwise

Calculate:

• C and C(s) for all s ∈ [2, 4].

• E[Y ],E[X|Y = s],E[X].

• Var[Y ],Var[X].

• Cov(X,Y ) and Var[X + Y ].

Solution to Exercise 24.5. For C:

1 = C

∫ 4

2(s3 − s− 1)ds = C ·

(44

4 − 42

2 − 4− 24

4 + 22

2 + 2)

= C · (60− 6− 2) = C · 52.

163

For C(s):

1 = C(s)

∫ s

0cos( π2s t)dt = C(s) · 2s

π sin( π2s t)∣∣∣s

0

= C(s) · 2sπ ,

So C(s) = π2s .

We also have

E[Y ] = 52−1

∫ 4

2(s4 − s2 − s)ds =

45 − 25

5 · 52− 43 − 23

3 · 52− 42 − 22

2 · 52=

5212

30 · 52.

For s ∈ [2, 4], using the fact that (x sinx+ cosx)′ = sinx+ x cosx− sinx = x cosx we

have

E[X|Y = s] = C(s)

∫ s

0t cos(C(s)t)dt = C(s)−1

∫ π/2

0t cos tdt

= C(s)−1 ·(π2 − 1

)=(1− 2

π

)· s.

To calculate E[X] we use the fact that fX,Y (t, s) = fY (s)fX|Y (t|s) for all s ∈ [2, 4]

and fX,Y (t, s) = 0 otherwise. Thus, with the function (t, s) 7→ t,

E[X] =

∫ 4

2fY (s)

∫ s

0tfX|Y (t|s)dtds =

(1− 2

π

)· 52−1 ·

∫ 4

2sfY (s)ds =

(1− 2

π

)· E[Y ].

Now, similarly

E[XY ] =

∫ 4

2sfY (s)

∫ s

0tfX|Y (t|s)dtds =

(1− 2

π

)· E[Y 2],

so we get that

Cov(X,Y ) =(1− 2

π

)· E[Y 2]−

(1− 2

π

)· E[Y ] · E[Y ] =

(1− 2

π

)·Var[Y ].

ut

Exercise 24.6. A chicken lays an egg of random size. The size of the egg has the

distribution of |N | where N ∼ N(0, 5). For any s > 0, given that an egg is of size s, the

time until the egg hatches is distributed Exp(s−2). What is the expected time for an egg

to hatch?

164

Solution to Exercise 24.6. Let S be the size and T the time. We are given that S ∼|N(0, 5)|. That is, for any s > 0,

P[S ≤ s] = P[N(0, 5) ∈ [−s, s]] =

∫ s

−s

1√2π5

e−t2/50dt = 2

∫ s

0

1√2π5

e−t2/50dt.

So S is absolutely continuous with density

fS(s) =

√2

5√πe−s

2/50 s > 0

0 s ≤ 0

So we can calculate the expectation of T :

E[T ] =

∫ ∫tfT,S(t, s)dtds =

∫ ∞

0fS(s)

∫ ∞

0ts−2e−s

−2tdtds

=

∫ ∞

0s2fY (s)ds = E[S2] = E[N(0, 5)2] = 25.

ut

Exercise 24.7. Let X be a discrete random variable with range Q. Suppose that E[X] =

µ, E[X2] = σ and E[X3] = ρ. Let Y be a discrete random variable defined by Y |X =

q ∼ Poi(q2).

Calculate Cov(X,Y ).

Solution to Exercise 24.7. Note that for any q ∈ Q,

q2 = E[Y |X = q] =∞∑

k=0

ke−q2 q2k

k!.

Note that for q ∈ Q and k ∈ N,

fY,X(k, q) = fX(q)e−q2 q2k

k!.

So

E[XY ] =∑

q∈Q

∞∑

k=0

qkfY,X(k, q) =∑

q∈Qq3fX(q) = ρ,

and

E[Y ] =∑

q∈Q

∞∑

k=0

kfY,X(k, q) =∑

q∈Qq2fX(q) = σ.

Thus, Cov(X,Y ) = ρ− σµ. ut

introduction to probability - bgu mathyadina/teaching/probability/prob_notes.… · introduction to...

Documents