introduction to probability - bgu mathyadina/teaching/probability/prob_notes.… · introduction to...
TRANSCRIPT
INTRODUCTION TO PROBABILITY
ARIEL YADIN
Course: 201.1.8001 Fall 2013-2014
Lecture notes updated: December 24, 2014
Contents
Lecture 1. 3
Lecture 2. 9
Lecture 3. 18
Lecture 4. 24
Lecture 5. 33
Lecture 6. 38
Lecture 7. 44
Lecture 8. 51
Lecture 9. 57
Lecture 10. 66
Lecture 11. 73
Lecture 12. 76
Lecture 13. 81
Lecture 14. 87
Lecture 15. 92
Lecture 16. 1011
2
Lecture 17. 105
Lecture 18. 113
Lecture 19. 118
Lecture 20. 131
Lecture 21. 135
Lecture 22. 140
Lecture 23. 156
Lecture 24. 159
3
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 1
1.1. Example: Bertrand’s Paradox
We begin with an example [this is known as Bertrand’s paradox].
Joseph Louis Francois
Bertrand (1822–1900)Question 1.1. Consider a circle of radius 1, and an equilateral triangle bounded in the
circle, say ABC. (The side length of such a triangle is√
3.) Let M be a randomly
chosen chord in the circle.
What is the probability that the length of M is greater than the length of a side of the
triangle (i.e.√
3)?
Solution 1. How to chose a random chord M?
One way is to choose a random angle, let r be the radius at that angle, and let M be
the unique chord perpendicular to r. Let x be the intersection point of M with r. (See
Figure 1, left.)
Because of symmetry, we can rotate the triangle so that the chord AB is perpendicular
to r. Since the sides of the triangle intersect the perpendicular radii at distance 1/2
from 0, M is longer than AB if and only if x is at distance at most 1/2 from 0.
r has length 1, so the probability that x is at distance at most 1/2 is 1/2. ut
Solution 2. Another way: Choose two random points x, y on the circle, and let M be
the chord connecting them. (See Figure 1, right.)
Because of symmetry, we can rotate the triangle so that x coincides with the vertex
A of the triangle. So we can see that y falls in the arc BC on the circle if and only if
M is longer than a side of the triangle.
4
B C
M
A
x
r
B C
A
M
y
Figure 1. How to choose a random chord.
The probability of this is the length of the arc BC over 2π. That is, 1/3 (the arc BC
is one-third of the circle). ut
Solution 3. A different way to choose a random chord M : Choose a random point x in
the circle, and let r be the radius through x. Then choose M to be the chord through
x perpendicular to r. (See Figure 1, left.)
Again we can rotate the triangle so that r is perpendicular to the chord AB.
Then, M will be longer than AB if and only if x lands inside the triangle; that is if
and only if the distance of x to 0 is at most 1/2.
Since the area of a disc of radius 1/2 is 1/4 of the disc of radius 1, this happens with
probability 1/4. ut
How did we reach this paradox?
The original question was ill posed. We did not define precisely what a random chord
is.
The different solutions come from different ways of choosing a random chord - these
are not the same.
We now turn to precisely defining what we mean by “random”, “probability”, etc.
5
1.2. Sample spaces
When we do some experiment, we first want to collect all possible outcomes of the
experiment. In mathematics, a collection of objects is just a set.
The set of all possible outcomes of an experiment is called a sample space. Usually
we denote the sample space by Ω, and its elements by ω.
Example 1.2.
• A coin toss. Ω = H,T. Actually, it is the set of heads and tails, maybe ,~,but H,T are easier to write.
• Tossing a coin twice. Ω = H,T2. What about tossing a coin k times?
• Throwing two dice. Ω = , , , , , 2. It is probably easier to use
1, 2, 3, 4, 5, 62.
• The lifetime of a person. Ω = R+. What if we only count the years? What if
we want years and months?
• Bows and arrows: Shooting an arrow into a target of radius 1/2 meters. Ω =
(x, y) : x2 + y2 ≤ 1/4. What about Ω = (r, θ) : r ∈ [0, 1/2] , θ ∈ [0, 2π).What if the arrow tip has thickness of 1cm? So we don’t care about the point
up to radius 1cm? What about missing the target altogether?
• A random real valued continuously differentiable function on [0, 1]. Ω = C1([0, 1]).
• Noga tosses a coin. If it comes out heads she goes to the probability lecture,
and either takes notes or sleeps throughout. If the coin comes out tails, she goes
running, and she records the distance she ran in meters and the time it took.
Ω = H × notes, sleep ⋃ T × N× R+.
454
1.3. Events
Suppose we toss a die. So the sample space is, say, Ω = 1, 2, 3, 4, 5, 6. We want to
ecode the outcome “the number on the die is even”.
6
This can be done by collecting together all the relevant possible outcomes to this
event. That is, a sub-collection of outcomes, or a subset of Ω. The event “the number
on the die is even” corresponds to the subset 2, 4, 6 ⊂ Ω.
What do we want to require from events? We want the following properties:
• ”Everything” is an event; that is, Ω is an event.
• If we can ask if an event occured, then we can also ask if it did not occur; that
is, if A is an event, then so is Ac.
• If we have many events of which we can ask if they have occured, then we can
also ask if one of them has occured; that is, if (An)n∈N is a sequence of events,
then so is⋃nAn.
X A word on notation: Ac = Ω \ A. One needs to be careful in which space we are
taking the complement. Some books use A.
Thus, if we want to say what events are, we have: If F is the collection of events on
a sample space Ω, then F has the following properties:
• The elements of F are subsets of Ω (i.e. F ⊂ P(Ω)).
• Ω ∈ F .
• If A ∈ F then Ac ∈ F .
• If (An)n∈N is a sequence of elements of F , then⋃nAn ∈ F .
Definition 1.3. F with the properties above is called a σ-algebra (or σ-field).
X Explain the name: algebra, σ-algebra.
Example 1.4. Let Ω be any set (sample space). Then,
• F = ∅,Ω• G = 2Ω = P(Ω)
are both σ-algebras on Ω. 454
When Ω is a countable sample space, then we will always take the full σ-algebra 2Ω.
(We will worry about the uncountable case in the future.)
7
To sum up: For a countable sample space Ω, any subset A ⊂ Ω is an event.
Example 1.5.
• The experiment is tossing three coins. The event A is “second toss was heads”.
Ω = H,T3. A = (x,H, y) : x, y ∈ H,T.• The experiment is “how many minutes passed until a goal is scored in the Manch-
ester derby”. The event A is “the first goal was scored after minute 23 and before
minute 45”.
Ω = 0, 1, . . . , 90 (maybe extra-time?). A = 23, 24, . . . , 44.
454
Example 1.6. We are given a computer code of 4 letters. Ai is the event that the i-th
letter is ‘L’.
What are the following events?
• A1 ∩A2 ∩A3 ∩A4
• A1 ∪A2 ∪A3 ∪A4
• Ac1 ∩Ac2 ∩Ac3 ∩Ac4• Ac1 ∪Ac2 ∪Ac3 ∪Ac4• A3 ∩A4
• A1 ∩Ac2What are the events in terms of Ai’s?
• There are at least 3 L’s
• There is exactly one L
• There are no two L’s in a row
454
Example 1.7. Every day products on a production line are chosen and labeled ‘-’ if
damaged or ‘+’ if good. What are the complements of the following events?
• A = at least two products are damaged
• B = at most 3 products are damaged
8
• C = there are at least 3 more good products than damaged products
• D = there are no damaged products
• E = most of the products are damaged
To which of the above events does the string −−−+ +−+ +−− belong to?
What are the events A ∩B,A ∪B,C ∩ E,B ∩D,B ∪D. 454
9
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 2
2.1. Probability
Remark 2.1. Two sets A,B are called disjoint if A ∩ B = ∅. For a sequence (An)n∈N,
we say that (An) are mutually disjoint if any two are disjoint; i.e. for any n 6= m,
An ∩Am = ∅.
What is probability? It is the assignment of a value to each event, between 0 and
1, saying how likely this event is. That is, if F is the collection of all events on some
sample space, then probability is a function from the events to [0, 1]; i.e. P : F → [0, 1].
Of course, not every such assignment is good, and we would like some consistency.
What would we like of P? For one thing, we would like that the likelyhood of every-
thing is 100%. That is, P(Ω) = 1. Also, we would like that if A,B ∈ F are two events
that are disjoint, i.e. A∩B = ∅, then the likelyhood that one of A,B occurs is the sum
of their likelyhoods; i.e. P(A ∪B) = P(A) + P(B).
This leads to the following definition:
Andrey Kolmogorov
(1903–1987)
Definition 2.2 (Probability measure). Let Ω be a sample space, and F a σ-algebra on
Ω. A probability measure on (Ω,F) is a function P : F → [0, 1] such that
• P(Ω) = 1.
• If (An) is a sequence of mutually disjoint events in F , then
P(⋃
n
An) =∑
n
P(An).
(Ω,F ,P) is called a probability space.
Example 2.3.
10
• A fair coin toss: Ω = H,T, F = 2Ω = ∅, H , TΩ. P(∅) = 0,P(H) =
P(T) = 1/2,P(Ω) = 1.
• Unfair coin toss: P(∅) = 0,P(H) = 3/4,P(T) = 1/4,P(Ω) = 1.
• Fair die: Ω = 1, 2, . . . , 6 ,F = 2Ω. For A ⊂ Ω let P(A) = |A|/6.
• Let Ω be any countable set. Let F = 2Ω. Let p : Ω → [0, 1] be a function such
that∑
ω∈Ω p(ω) = 1. Then, for any A ⊂ Ω define
P(A) =∑
ω∈Ap(ω).
P is a probability measure on (Ω,F).
• Let γ =∑
n≥1 n−2. For A ⊂ 1, 2, . . . , define
P(A) = 1γ ·∑
a∈Aa−2.
P is a probability measure.
454
Some properties of probability measures:
Proposition 2.4. Let (Ω,F ,P) be a probability space. Then,
(1) ∅ ∈ F and P(∅) = 0.
(2) If A1, . . . , An are a finite number of mutually disjoint events, then
A :=
n⋃
j=1
Aj ∈ F and P(A) =
n∑
j=1
P(Aj).
(3) For any event A ∈ F , P(Ac) = 1− P(A).
(4) If A ⊂ B are events, then P(B \A) = P(B)− P(A).
(5) If A ⊂ B are events, then P(A) ≤ P(B).
(6) (Inclusion-Exclusion Principle.) For events A,B ∈ F ,
P(A ∪B) = P(A) + P(B)− P(A ∩B).
Proof. Let A,B, (An)n∈N be events in F .
11
(1) ∅ = Ωc ∈ F . Set Aj = ∅ for all j ∈ N. This is a sequence of mutually disjoint
events. Since ∅ =⋃j Aj , we have that P(∅) =
∑j P(∅) implying that P(∅) = 0.
(2) For j > n let Aj = ∅. Then (Aj)j∈N is a sequence of mutually disjoint events, so
P(n⋃
j=1
Aj) = P(⋃
j
Aj) =∑
j
P(Aj) =n∑
j=1
P(Aj).
(3) A ∪ Ac = Ω and A ∩ Ac = ∅. So the additivity of P on disjoint sets gives that
P(A) + P(Ac) = P(Ω) = 1.
(4) If A ⊂ B, then B = (B \ A) ∪ A, and (B \ A) ∩ A = ∅. Thus, P(B) =
P(B \A) + P(A).
(5) Since A ⊂ B, we have that P(B)− P(A) = P(B \A) ≥ 0.
(6) Let C = A ∩B. Note that A ∪B = (A \C) ∪ (B \C) ∪C, where all sets in the
union are mutually disjoint. Since C ⊂ A,C ⊂ B we get
P(A∪B) = P(A\C)+P(B\C)+P(C) = P(A)−P(C)+P(B)−P(C)+P(C) = P(A)+P(B)−P(C).
ut
George Boole
(1815–1864)Proposition 2.5 (Boole’s inequality). Let (Ω,F ,P) be a probability space. If (An)n∈N
are events, then
P(⋃
n∈NAn) ≤
∑
n∈NP(An).
Proof. For every n let B0 = ∅, and
Bn =n⋃
j=1
Aj and Cn = An \Bn−1.
Claim 1. For any n > m ≥ 1, we have that Cn ∩ Cm = ∅.Proof. Since m ≤ n− 1, we have that Am ⊂ Bn−1. Since Cm ⊂ Am,
Cn ∩ Cm ⊂ (An \Bn−1) ∩Bn−1 = ∅.
Claim 2.⋃nAn ⊂
⋃nCn.
Proof. Let ω ∈ ⋃nAn. So there exists n such that ω ∈ An. Let m be the smallest
integer such that ω ∈ Am and ω 6∈ Am−1 (if ω ∈ A1 let m = 1). Then, ω ∈ Am, and
12
ω 6∈ Ak for all 1 ≤ k < m. In other words: ω ∈ Am and ω 6∈ ⋃m−1k=1 Ak = Bm−1. Or:
ω ∈ Am\Bm−1 = Cm. Thus, there exists some m ≥ 1 such that ω ∈ Cm; i.e. ω ∈ ⋃nCn.
We conclude that (Cn)n≥1 is a collection of events that are pairwise disjoint, and⋃nAn ⊂
⋃nCn. Using # 5 from Proposition 2.4
P(⋃
n
An) ≤ P(⋃
n
Cn) =∑
n
P(Cn) ≤∑
n
P(An).
ut
Example 2.6 (Monty Hall paradox). In a famous gameshow, the constestant is given
three boxes, two of them contain nothing, but one contains the keys to a new car. All
possibilities of key placements are possible. The contestant chooses a box. Then the host
reveals one of the other two boxes that does not contain anything, and the contestant
is given a choice whether to switch her choice or not.
What should she do?
A = an empty box is chosen in the first choice . B = the keys are in the other
box (she should switch) .One checks that
B ∩Ac = ∅ and A ∩Bc = ∅
so A = B. Since all key placements are equally likely, the probability the key is in the
chosen box is 1/3. That is, P(Ac) = 1/3, and so P(B) = P(A) = 2/3. 454
2.2. Discrete Probability Spaces
Discrete probability spaces are those for which the sample space is countable. We
have already seen that in this case we can take all subsets to be events, so F = 2Ω. We
have also implicitly seen that due to additivity on disjoint sets, the probability measure
P is completely determined by its value on singletons.
That is, let (Ω, 2Ω,P) be a probability space where Ω is countable. If we denote
p(ω) = P(ω) for all ω ∈ Ω, then for any event A ⊂ Ω we have that A =⋃ω∈A ω,
and this is a countable disjoint union. Thus,
P(A) =∑
ω∈AP(ω) =
∑
ω∈Ap(ω).
13
Exercise 2.7. Let Ω = 1, 2, . . . , and define a probability measure on (Ω, 2Ω) by
(1) P(ω) = 1(e−1)ω! .
(2) P′(ω) = 13 · (3/4)ω.
(Extend using additivity on disjoint unions.) Show that both uniquely define probability
measures.
Solution. We need to show that P(Ω) = 1, and that P is additive on disjoint unions.
Indeed,
P(Ω) =
∞∑
ω=1
P(ω) =∞∑
ω=1
1(e−1)ω! = 1.
P′(Ω) = 13 ·
∞∑
ω=1
(3/4)ω = 1.
Now let (An) be a sequence of mutually disjoint events and let A =⋃nAn. Using the
fact that any subset is the disjoint union of the singlton composing it, we have that
ω ∈ A if and only if there exists a unique n(ω) such that ω ∈ An. Thus,
P(A) =∑
ω∈AP(ω) =
∑
n
∑
ω∈An
P(ω) =∑
n
P(An).
The same for P′. ut
The next proposition generalizes the above examples, and characterizes all discrete
probability spaces.
Proposition 2.8. Let Ω be a countable set. Let p : Ω→ [0, 1], such that∑
ω∈Ω p(ω) = 1.
Then, there exists a unique probability measure P on (Ω, 2Ω) such that P(ω) = p(ω)
for all ω ∈ Ω. (p as above is somtimes called the density of P.)
Proof. Let A ⊂ Ω. Define
P(A) =∑
ω∈Ap(ω).
We have by the assumption on p that P(Ω) =∑
ω∈Ω p(ω) = 1. Also, if (An) is a sequence
of events that are mutually disjoint, and A =⋃nAn is their union, then, for any ω ∈ A
14
there exists n such that ω ∈ An. Moreover, this n is unique, since An ∩ Am = ∅ for all
m 6= n, so ω 6∈ Am for all m 6= n. So
P(A) =∑
ω∈Ap(ω) =
∑
n
∑
ω∈An
p(ω) =∑
n
P(An).
This shows that P is a probability measure on (Ω, 2Ω).
For uniquness, let P : 2Ω → [0, 1] be a probability measure such that P (ω) = p(ω)
for all ω ∈ Ω. Then, for any A ⊂ Ω,
P (A) =∑
ω∈AP (ω) =
∑
ω∈Ap(ω) = P(A).
utExample 2.9.
(1) An simple but important example is for finite sample spaces. Let Ω be a finite set.
Suppose we assume that all outcomes are equally likely. That is P(ω) = 1/|Ω|for all ω ∈ Ω, and so, for any event A ⊂ Ω we have that P(A) = |A|/|Ω|.It is simple to verify that this is a probability measure. This is known as the
uniform measure on Ω.
(2) We throw two fair dice. What is the probability the sum is 7? What is the
probability the sum is 6?
Solution. The sample space here is Ω = 1, 2, . . . , 62 = (i, j) : 1 ≤ i, j ≤ 6.Since we assume the dice are fair, all outcomes are equally likely, and so the prob-
ability measure is the uniform measure on Ω.
Now, the event that the sum is 7 is
A = (i, j) : 1 ≤ i, j ≤ 6 , i+ j = 7 = (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1) .
So P(A) = |A|/|Ω| = 6/36 = 1/6.
The event that the sum is 6 is
B = (i, j) : 1 ≤ i, j ≤ 6 , i+ j = 6 = (1, 5), (2, 4), (3, 3), (4, 2), (5, 1) .
So P(B) = 5/36.
15
(3) There are 15 balls in a jar, 7 black balls and 8 white balls. When removing a
ball from the jar, any ball is equally likely to be removed. Shir puts her hand in
the jar and takes out two balls, one after the other. What is the probablity Shir
removes one black ball and one white ball.
Solution. First, we can think of the different balls being represented by the
numbers 1, 2, . . . , 15. So the sample space is Ω = (i, j) : 1 ≤ i 6= j ≤ 15. Note
that the size of Ω is |Ω| = 15 · 14. Since it is equally likely to remove any ball,
we have that the probability measure is the uniform measure on Ω.
Let A be the event that one ball is black and one is white. How can we compute
P(A) = |A|/|Ω|?We can split A into two disjoint events, and use additivity of probability mea-
sures:
Let B be the event that the first ball is white and the second ball is black. Let B′
be the event that the first ball is black and the second ball is white. It is imme-
diate that B and B′ are disjoint. Also, A = B∪B′. Thus, P(A) = P(B)+P(B′),
and we only need to compute the sizes of B,B′.
Note that the number of pairs (i, j) such that ball i is black and ball j is white
is 7 · 8. So |B| = 7 · 8. Similarly, |B′| = 8 · 7. All in all,
P(A) = P(B) + P(B′) =7 · 8
15 · 14+
8 · 715 · 14
=8
15.
(4) A deck of 52 cards is shuffelled, so that any order is equally likely. The top 10
cards are distributed amoung 2 players, 5 to each one. What is the probability
that at least one player has a royal flush (10-J-Q-K-A of the same suit)?
Solution. Each player receives a subset of the cards of size 5, so the sample
space is
Ω = (S1, S2) : S1, S2 ⊂ C, |S1| = |S2| = 5, S1 ∩ S2 = ∅ ,
where C is the set of all cards (i.e. C = A♠, 2♠, . . . , A♦, 2♦, . . . , ). There are(
525
)ways to choose S1 and
(475
)ways to choose S2, so |Ω| =
(525
)·(
475
). Also,
every choice is equally likely, so P is the uniform measure.
16
Let Ai be the event that player i has a royal flush (i = 1, 2). For s ∈♠,♦,♥,♣, let B(i, s) be the event that player i has a royal flush of the suit
s. So Ai =⊎sB(i, s). B(i, s) is the event that Si is a specific set of 5 cards, so
|B(i, s)| =(
475
)for any choice of i, s. Thus,
P(Ai) =∑
s
|B(i, s)||Ω| =
4(525
) .
Now we use the inclusion-exclusion principle:
P(A) = P(A1 ∪A2) = P(A1) + P(A2)− P(A1 ∩A2).
The event A1∩A2 is the event that both players have a royal flush, so |A1∩A2| =4 ·3 since there are 4 options for the first player’s suit, and then 3 for the second.
Altogether,
P(A) =4(525
) +4(525
) − 4 · 3(525
)·(
475
) =8(525
) ·(
1− 3
2 ·(
475
)).
454
Exercise 2.10. Prove that there is no uniform probability measure on N; that is, there
is no probability measure on N such that P(i) = P(j) for all i, j.
Solution. Assume such a probability measure exists. By the defining properties of prob-
ability measures,
1 = P(N) =∞∑
j=0
P(j) =
∞∑
j=1
p =
∞ p > 0
0 p = 0.
A contradiction. ut
Exercise 2.11. A deck of 52 card is ordered randomly, all orders equally likely. what
is the probability that the 14th card is an ace? What is the probability that the first ace
is at the 14th card?
Solution. The sample space is all the possible orderings of the set of cards C = A♠, 2♠, . . . , A♦, 2♦, . . . , .So Ω is the set of all 1-1 functions f : 1, . . . , 52 → C, where f(1) is the first card, f(2)
the second, and so on. So |Ω| = 52!. The measure is the uniform one.
17
Let A be the event that the 14th card is an ace. Let B be the event that the first ace
is at the 14th card.
A is the event that f(14) is an ace, and there are 4 choices for this ace, so if A is the
set of aces, A = f ∈ Ω : f(14) ∈ A, so |A| = 4 · 51!. Thus, P(A) = 452 = 1
13 .
B is the event that f(14) is an ace, but also f(j) is not an ace for all j < 14.
Thus, B = f ∈ Ω : f(14) ∈ A , ∀ j < 14 f(j) 6∈ A. So |B| = 4 · 48 · 47 · · · 36 · 38! =
4 · 48! · 38 · 37 · 36, and P(B) = 4·38·37·3652·51·50·49 = 0.031160772. ut
18
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 3
3.1. Some set theory
Recall the inclusion-exclusion principle:
P(A ∪B) = P(A) + P(B)− P(A ∩B).
This can be demonstrated by the Venn diagram in Figure 2.
John Venn (1834–1923)
A B
A ∩B
A ⋃B
Figure 2. The inclusion-exclusion principle.
This diagram also illustrates the intuitive meaning of A ∪ B; namely, A ∪ B is the
event that one of A or B occurs. Similarly, A∩B is the event that both A and B occur.
Let (An)∞n=1 be a sequence of events in some probability space. We have
⋃
n≥kAn = ω ∈ Ω : ω ∈ An for at least one n ≥ k
= ω ∈ Ω : there exists n ≥ k such that ω ∈ An .
So⋂∞k=1
⋃n≥k An is the set of ω ∈ Ω such that for any k, there exists n ≥ k such that
ω ∈ An; i.e.
∞⋂
k=1
⋃
n≥kAn = ω ∈ Ω : ω ∈ An for infinitely many n .
19
Definition 3.1. Define the set lim supAn as
lim supAn :=
∞⋂
k=1
⋃
n≥kAn.
The intuitive meaning of lim supnAn is that infinitely many of the An’s occur.
Similarly, we have that
∞⋃
k=1
⋂
n≥kAn = ω ∈ Ω : there exists n0 such that for all n > n0, ω ∈ An
= ω ∈ Ω : ω ∈ An for all large enough n .
Definition 3.2. Define
lim inf An :=∞⋃
k=1
⋂
n≥kAn.
lim inf An means that An will occur from some large enough n onward; that is, even-
tually all An occur.
It is easy to see that
lim inf An ⊆ lim supAn.
This also fits the intuition, as if all An occur eventually, then they occur infinitely many
times.
Definition 3.3. If (An)∞n=1 is a sequence of events such that lim inf An = lim supAn
(as sets) then we define
limAn := lim inf An = lim supAn.
Example 3.4. Consider Ω = N, and the sequence
An =nj : j = 0, 1, 2, . . . ,
.
If m < n and m ∈ An, then it must be that m = n0 = 1. Thus,⋂n≥k An = 1 and
lim infn
An =
∞⋃
k=1
⋂
n≥kAn = 1 .
20
Also, if m < k ≤ n, and m ∈ An, then again m = 1, so⋃n≥k An does not contain any
1 < m < k; i.e.⋃n≥k An ⊂ 1, k, k + 1, . . .. Hence,
lim supAn =∞⋂
k=1
⋃
n≥kAn = 1 .
Thus, the limit exists and limnAn = 1. 454
Definition 3.5. A sequence of events is called increasing if An ⊂ An+1 for all n, and
decreasing if An+1 ⊂ An for all n.
Proposition 3.6. Let (An) be a sequence of events. Then,
(lim inf An)c = lim supAcn.
Moreover, if (An) is increasing (resp. decreasing) then limAn =⋃nAn (resp. limAn =
⋂nAn).
Proof. The first assertion is de-Morgan.
For the second assertion, note that if (An) is increasing, then
⋃
n≥kAn =
⋃
n≥1
An and⋂
n≥kAn = Ak.
So
lim supAn =⋂
k
⋃
n≥kAn =
⋂
k
⋃
n
An =⋃
n
An
lim inf An =⋃
k
⋂
n≥kAn =
⋃
k
Ak.
If (An) is decreasing then (Acn) is increasing. So
lim supAn = (lim inf Acn)c =(⋃
n
Acn)c
= (lim supAcn)c = lim inf An,
and limAn =(⋃
nAcn
)c=⋂nAn. ut
Exercise 3.7. Show that if F is a σ-algebra, and (An) is a sequence of events in F ,
then lim inf An ∈ F and lim supAn ∈ F .
21
3.2. Continuity of Probability Measures
The goal of this section is to prove
Theorem 3.8 (Continuity of Probability). Let (Ω,F ,P) be a probability space. Let (An)
be a sequence of events such that limAn exists. Then,
P(limAn) = limn→∞
P(An).
We start with a restricted version:
Proposition 3.9. Let (Ω,F ,P) be a probability space. Let (An) be a sequence of in-
creasing (resp. decreasing) events. Then,
P(limAn) = limn→∞
P(An).
Proof. We start with the increasing case. Let A =⋃nAn = limAn. Define A0 = ∅, and
Bk = Ak \Ak−1 for all k ≥ 1. So (Bk) are mutually disjoint and
n⊎
k=1
Bk =n⋃
k=1
Ak = An
so⊎nBn = A. Thus,
P(limAn) = P(⊎
n
Bn) =∑
n
P(Bn) = limn
n∑
k=1
P(Bk)
= limn
P(
n⊎
k=1
Bk) = limn
P(An).
The decreasing case follows from noting that if (An) is decreasing, then (Acn) is in-
creasing, so
P(limAn) = P(⋂
n
An) = 1− P(⋃
n
Acn) = 1− limn
P(Acn) = limn
P(An).
ut
Pierre Fatou (1878–1929)
Lemma 3.10 (Fatou’s Lemma). Let (Ω,F ,P) be a probability space. Let (An) be a
sequence of events (that may not have a limit). Then,
P(lim inf An) ≤ lim infn
P(An) and lim supn
P(An) ≤ P(lim supAn).
22
Proof. For all k let Bk =⋂n≥k An. So lim infnAn =
⋃k Bk. Note that (Bk) is an
increasing sequence of events (Bk = Bk+1 ∩ Ak). Also, for any n ≥ k, Bk ⊂ An, so
P(Bk) ≤ infn≥k P(An). Thus,
P(lim inf An) = P(⋃
n
Bn) = limn
P(Bn) ≤ limn
infk≥n
P(Ak) = lim infn
P(An).
For the lim sup,
lim supn
P(An) = lim supn
(1− P(Acn)) = 1− lim infn
P(Acn)
≤ 1− P(lim inf Acn) = 1− P((lim supAn)c) = P(lim supAn).
ut
Fatou’s Lemma immediately proves Theorem 3.8.
Proof of Theorem 3.8. Just note that
lim supn
P(An) ≤ P(lim supAn) = P(limAn) = P(lim inf An) ≤ lim infn
P(An),
so equality holds, and the limit exists. ut
As a consequence we get
Lemma 3.11 (First Borel-Cantelli Lemma). Let (Ω,F ,P) be a probability space. Let
(An) be a sequence of events. If∑
n P(An) <∞, then
P( An occurs for infinitely many n ) = P(lim supAn) = 0.
Emile Borel (1871–1956)Proof. Let Bk =
⋃n≥k An. So (Bk) is decreasing, and so the decreasing sequence P(Bk)
converges to P(lim supAn). Thus, for all k,
P(lim supAn) ≤ P(Bk) ≤∑
n≥kP(An).
Since the right hand side converges to 0 as k →∞ by the assumption that the series is
convergent, we get P(lim supAn) ≤ 0. ut
Francesco Cantelli
(1875–1966)
23
Example 3.12. We have a bunch of bacteria in a patry dish. Every second, the bacteria
give off offspring randomly, and then the parents die out. Suppose that for any n, the
probability that there are no bacteria left by time n is 1− exp(−f(n)).
What is the probability that the bacteria eventually die out if:
• f(n) = logn.
• f(n) = n2−72n2+3n+5
.
Let An be the event that the bacteria dies out by time n. So P(An) = 1− e−f(n), for
n ≥ 1.
Note that the event that the bacteria eventually die out, is the event that there exists
n such that An; i.e. the event⋃nAn. Since (An)n is an increasing sequence, we have
that P(⋃nAn) = limn P(An).
In the first case this is 1. In the second case this is
limn→∞
1− exp
(− n2 − 7
2n2 + 3n+ 5
)= 1− e−1/2.
454
Example 3.13. What if we take the previous example, and the information is that the
probability that the bacteria at generation n do not die out without offspring is at most
exp(−2 log n).
Then, if An is the event that the n-th generation has offspring, we have that P(An) ≤n−2. Since
∑n P(An) <∞, Borel-Cantelli tells us that
P(An occurs for infinately many n) = P(lim supAn) = 0.
That is,
P(∃ k : ∀ n ≥ k : Acn) = P(lim inf Acn) = 1.
So a.s. there exists k such that all generations n ≥ k do not have offspring – implying
that the bacteria die out with probability 1. 454
24
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 4
4.1. Conditional Probabilities
We start with a simple observation.
Proposition 4.1. Let (Ω,F ,P) be a probability space. Let F ∈ F be an event such that
P(F ) > 0. Define
F|F = A ∩ F : A ∈ F ,
and define P : F|F → [0, 1] by P (B) = P(B)/P(F ) for all B ∈ F|F , and Q : F → [0, 1]
by Q(A) = P(A ∩ F )/P(F ) for all A ∈ F . Then, F|F is a σ-algebra on F , and P is a
probability measure on (F,F|F ), and Q is a probability measure on (Ω,F).
Notation. The probability measure Q above is usually denoted P(·|F ).
Proof. The σ-algebra part just follows from F = F ∩ F and⋃n(F ∩An) = F ∩⋃nAn.
P is a probability measure since P (F ) = P(F )/P(F ) = 1, and if (Bn) is a sequence
of mutually disjoint events in FF , then
P (⋃
n
Bn) = P(⋃
n
Bn)/P(F ) =1
P(F )
∑
n
P(Bn) =∑
n
P (Bn).
Q is a probability measure since Q(Ω) = P(Ω∩F )/P(F ) = 1 and if (An) is a sequence
of disjoint events in F , then
Q(⋃
n
An) =1
P(F )
∑
n
P(An ∩ F ) =∑
n
Q(An).
ut
What is the intuitive meaning of the above measure Q? What we did is restrict all
events to there intersection with F , and look at them only in F , normalized to have
25
total probability 1. This can be thought of as the probability of an event, given that we
know that the outcome is in F .
Definition 4.2 (Conditional Probability). Let (Ω,F ,P) be a probability space, and let
F ∈ F be an event such that P(F ) > 0.
For any event A ∈ F , the quantity P(A ∩ F )/P(F ) is called the conditional prob-
ability of A given F .
The probability measure P(·|F ) := P(· ∩ F )/P(F ) is called the conditional proba-
bility given F .
!!! One must be careful with conditional probabilities - intuition here is many
times misleading.
Example 4.3. An urn contains 10 white balls, 5 yellow balls and 10 black balls. Shir
takes a ball out of the urn, all balls equally likely.
• What is the probability that the ball removed is yellow?
• What is the probability the ball removed is yellow given it is not black?
454
Solution. A = the ball is yellow. B = the ball is not black.
P(A) = 5/25 = 1/5. P(B) = 15/25 = 3/5. P(A ∩ B) = P(A) = 1/5. So P(A|B) =
(1/5)/(3/5) = 1/3. ut
Example 4.4. Noga has the heart of an artist and the mind of a mathematician. Given
she takes a course in probability she will pass with probability 1/3. Given she take a
course in contemporary art she will pass with probability 1/2. She chooses which course
to take using a fair coin toss - heads for probability and tails for art.
What is the probability that Noga takes probability and passes?
What is the probability she passes no matter what course she takes? 454
Solution. The sample space here is
Ω = H,T × pass, fail .
26
The events we are interested in are A = Noga passes the course = H,T×pass. B =
Noga takes probability = H × pass, fail. Bc = Noga takes art = H × pass, fail.So the event that Noga takes and passes probability is A ∩B.
The information in the question tells us that P(A|B) = 1/3 and P(A|Bc) = 1/2.
Thus,
P(A ∩B) = P(A|B)P(B) = 1/6,
and
P(A) = P(A ∩B) + P(A ∩Bc) =1
6+
1
4=
5
12.
ut
Example 4.5. An urn contains 8 black balls and 4 white balls. Two balls are removed
one after the other, any ball being equally likely.
What is the probability that the second ball is black, given that the first is black?
What is the probability that the first is black, given that the second is black? 454
Solution. A = first ball is black. B = second ball is black. We know that P(A) = 8/12,
and that P(A ∩ B) = 8·712·11 . So (as is intuitive) P(B|A) = 7/11. However, maybe
somewhat less intuitive is P(A|B). Note that P(B ∩Ac) = 4·812·11 . So
P(B) = P(B ∩A) + P(B ∩Ac) =8 · (4 + 7)
12 · 11= 8/12.
Thus, P(A|B) = P(A ∩B)/P(B) = 7/11. ut
Exercise 4.6. Prove that for all events A1, . . . , An such that P(A1 ∩ · · · ∩An) > 0
P(A1 ∩ · · · ∩An) = P(An|A1 ∩ · · · ∩An−1) · P(An−1|A1 ∩ · · · ∩An−2) · · ·P(A2|A1) · P(A1)
= P(A1) ·n−1∏
j=1
P(Aj+1|A1, . . . , Aj).
Solution. By induction. The base is simple. The induction step follows from
P(A1 ∩ · · · ∩An+1) = P(An+1|A1 ∩ · · · ∩An) · P(A1 ∩ · · · ∩An).
ut
27
4.2. Bayes’ Rule
Let (Ω,F ,P) be a probability space. Let (An)n be a partition of Ω; that is, (An)
are mutually disjoint and⋃nAn = Ω. Assume further that P(An) > 0 for all n. Let
A,B be events of positive probability. Then, since (B ∩ An)n are mutually disjoint, by
additivity we have
P(B) =∑
n
P(B ∩An) =∑
n
P(B|An)P(An).
This is called the law of total probability.
A1 A2
A3
A4
A5A6
A7
A8
A9
A10
A11
A12
Ω
B
Figure 3. The law of total probability. In this case Ω = A1 ] · · · ]A12,
and
P(B) = P(B|A1)P(A1) + P(B|A2)P(A2) + P(B|A5)P(A5) + P(B|A6)P(A6)
+P(B|A7)P(A7) + P(B|A9)P(A9) + P(B|A10)P(A10).
Thomas Bayes
(1701–1761)
Another observation is Bayes’ rule:
P(B|A) = P(A|B) · P(B)
P(A).
Combining the two, we have that
P(An|B) = P(B|An) · P(An)∑n P(B|An)P(An)
.
28
Exercise 4.7. A new machine is invented that determines if a student will become a
millionaire in the next ten years or not. Given that a student will become a millionaire in
ten years, the machine succeeds in predicting this 95% of the time. Given that the student
will not become a millionaire in the next ten years, the machine predics (wrongly) that
he will become a millionaire in 1% of the cases (this is known as a false positive). Only
0.5% of the students will become millionaires in the next ten years. Given that Zuckerberg
has been predicted to become a millionaire by the machine, what is the probability that
Zuckerberg will actually become one?
Solution. A = Zuckerberg will become a millionaire in the next ten years . B = Zuckerberg is predicted to become a millionaire by the machine . The information is:
P(A) = 0.005 P(B|A) = 0.95 P(B|Ac) = 0.01.
So,
P(A|B) = P(B|A) · P(A)
P(B|A)P(A) + P(B|Ac)P(Ac)=
0.95 · 0.005
0.95 · 0.005 + 0.01 · 0.995
=95
95 + 199= 0.319727891 < 1/3.
Not such a great machine after all... ut
George Polya
(1887–1985)
Exercise 4.8 (Polya’s Urn). An urn contains one black ball and one white ball. At
every time step, a ball is chosen randomly from the urn, and returned with another new
ball of the same color. (So at time t there are a total of t+ 2 balls.) Calculate
p(t, b) = P( there are b black balls at time t ).
Solution. First let’s see how p(t, b) developes when we change t and b by 1.
If at time t there are b black balls out of a total of t+ 2, then at time t+ 1 there are
b+ 1 black balls with probability b/t+ 2 and b black balls with probability 1− b/t+ 2.
Thus,
p(t+ 1, b) =b− 1
t+ 2· p(t, b− 1) +
(1− b− 1
t+ 2
)· p(t, b).
29
Also, the initial conditions are p(0, 1) = 1. Let q(t, b) = (t+1)! ·p(t, b). Then, q(0, 1) = 1
and for t > 0,
q(t, b) = (b− 1)q(t− 1, b− 1) + (t+ 1− b)q(t− 1, b).
Check that q(t, b) = t! solves this. So p(t, b) = 1/(t+ 1). ut
Example 4.9. The probability a family has n children is pn, where∑∞
n=0 pn = 1. Given
that a family has n children, all possibilities for the sex of these children are equally
likely. What is the probability the a family has one child, given that the family has no
girls? 454
Solution. The natural sample space to take here is
Ω = (s1, s2, . . . , sn) : sj ∈ boy, girl , n ≥ 1 ∪ no children .
An = n children , for n ≥ 0. B = no girls .The information in the question is:
For all n ≥ 1 P((s1, . . . , sn) |An) = 2−n,
for any combination of sexes s1, . . . , sn ∈ boy, girl. Thus, P(B|An) = 2−n for all n ≥ 1.
Also, P(B|A0) = 1 = 2−0. The law of total probability gives,
P(B) =∑
n
P(B|An)P(An) =
∞∑
n=0
2−npn.
Using Bayes’ rule
P(A1|B) = P(B|A1) · P(A1)
P(B)=
1
2· p1∑∞
n=0 2−npn.
ut
Example 4.10. There are 6 machines M1, . . . ,M6 in the basement of the math depart-
ment. When pressing the red button on machine Mj , the machine outputs a random
number in N, with the probability that number is k being jj+1 · (j + 1)−k.
A bored mathematician has access to these machines. She tosses a die, and uses
machine Mj if the outcome of the die is j, and writes down the resulting number.
What is the probability the resulting number is greater than 0?
30
What is the probability the outcome of the die is 1 given that the resulting number
is greater than 3? 454
Solution. Aj = the die shows j , j = 1, . . . , 6. Bk = the resulting number is greater
than k . Ck = the resulting number is k . So (Ck) are mutually disjoint, and
Bk =⋃n>k Cn.
The information is:
P(Bk|Aj) =∑
n>k
P(Cn|Aj) =∑
n>k
j
j + 1·(j+1)−n =
j
j + 1·(j+1)−(k+1)·
∞∑
n=0
(j+1)−n = (j+1)−(k+1).
Also, P(Aj) = 1/6 for all j.
By the law of total probability,
P(Bk) =1
6
6∑
j=1
(j + 1)−(k+1) =1
6(2−(k+1) + 3−(k+1) + · · ·+ 7−(k+1)).
So P(B0) = 16(1
2 + 13 + · · · 1
7).
Now, by Bayes
P(A1|B3) = P(B3|A1) · P(A1)
P(B3)= 2−4 · 1
2−4 + 3−4 + · · ·+ 7−4.
ut
4.3. The Erdos-Ko-Rado Theorem
Consider n elements, say 0, 1, . . . , n− 1 without loss of generality. We want to
“collect” subsets of size k ≤ n/2 so that any two of these subsets intersect one another.
We want to collect as many such subsets as possible.
One strategy is as follows: Suppose w.l.o.g. A = 0, 1, . . . , k is our first set. Any
other set that we collect must intersect it, and we want to give ourselves as much freedom
as possible. So lets take all subsets of 0, 1, . . . , n− 1 that contain 0. These all must
intersect pairwise.
How many sets did we collect?(n−1k−1
)because we are just actually choosing k − 1
elements out of 1, . . . , n− 1 for every such subset.
Is this the best possible? The Erdos-Ko-Rado Theorem tells us that it is.
31
Theorem 4.11 (Erdos-Ko-Rado). Let n ≥ 2k. If A ⊂ 20,...,n−1 is a collection of
subsets such that for all A,B ∈ A: |A| = k and A ∩ B 6= ∅. Then, there are at most(n−1k−1
)sets in A; that is, |A| ≤
(n−1k−1
).
This elementary proof is by Katona.
Proof. Let A be as in the theorem.
Let us choose a uniform element σ from Sn the set of all permutations on 0, 1, . . . , n− 1,and also a uniform index I from 0, 1, . . . , n− 1. SetA = σ(I), σ(I + 1), . . . , σ(I + k − 1)where addition is modulo n. That is, P[σ = s, I = i] = 1
n! · 1n .
For any given set B = b0, . . . , bk−1 ⊂ 20,...,n−1 with |B| = k, by the law of total
probability,
P[A = B, I = j] =∑
π∈Sn
P[σ = π, I = j , π(j), π(j + 1), . . . , π(j + k − 1) = B]
= 1n!·n ·# π ∈ Sn : π(j), π(j + 1), . . . , π(j + k − 1) = B .
This last set is of size k!(n− k)! because there are k choices for π(j), then k− 1 choices
for π(j + 1), etc. until only 1 choice for π(j + k − 1), and then n− k choices for j + k,
n− k − 1 choices for j + k + 1, and so on for all j + k, . . . , j + n− 1. Thus,
P[A = B] =
n−1∑
j=0
P[A = B, I = j] =
n−1∑
j=0
(n
k
)−1
· 1
n=
(n
k
)−1
.
We have found an alternative description to choosing a subset of 20,...,n−1 of size
exactly k uniformly.
Now the second part: For π ∈ Sn and an integer s consider the subsets Xπ,s :=
π(s), π(s+ 1), . . . , π(s+ k − 1), where as usual addition is modulo n.
Note that if 0 ≤ m ≤ k − 1 then Xπ,m ∩ Xπ,m+k = ∅ (here we use the assumption
that 2k ≤ m). Indeed if x ∈ Xπ,m ∩Xπ,m+k then there exist 0 ≤ i, j ≤ k − 1 such that
π(m + j) = x = π(m + k + i). Since π is a permutation, this implies that j = k + i
modulo n and this is impossible because 0 ≤ i, j ≤ k − 1 and k ≤ n/2.
Now, for a given π, s, the subsets Xπ,m that intersect Xπ,s are (Xπ,m : m = s −k + 1, . . . , s + k − 1). Without the case m = 0, these can be written in k − 1 pairs:
32
(Xπ,s−m, Xπ,s−m+k)k−1m=1. For each pair, since Xπ,s−m ∩Xπ,s−m+k = ∅, only one of the
pair can be in the intersecting family A.
We conclude that for any 0 ≤ j ≤ n − 1, if Xπ,j ∈ A then Xπ,s ∩ Xπ,j 6= ∅ so
s− k + 1 ≤ j ≤ s+ k − 1, and Xπ,s−m ∈ A if and only if Xπ,s−m+k 6∈ A. Thus,
n−1∑
j=0
1Xπ,j∈A ≤ 1Xπ,s∈A +
k−1∑
m=1
1Xπ,s−m∈A + 1Xπ,s−m+k∈A
≤ 1 + k − 1 = k.
Since all subsets of A must intersect one another, this gives that for any π ∈ Sn,
# 0 ≤ j ≤ n− 1 : Xπ,j ∈ A ≤ k.
Now, if σ = π and A ∈ A then it must be that I = j for some j such that Xπ,j ∈ A.
So, by Boole’s inequality,
P[A ∈ A, σ = π] ≤ P[σ = π, I ∈ 0 ≤ j ≤ n− 1 : Xπ,j ∈ A]
≤n−1∑
j=0
1Xπ,j∈A P[σ = π, I = j] ≤ k
n! · n.
Summing over all π ∈ Sn, by the law of total probability,
P[A ∈ A] ≤ k
n.
We now combine this bound with the fact that A is uniform over all k-subsets of
0, . . . , n− 1. Thus,|A|(nk
) = P[A ∈ A] ≤ k
n.
This is exactly |A| ≤(nk
)· kn =
(n−1k−1
). ut
33
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 5
5.1. Independence
Until now, we have only used the measure theoretic properties of probability mea-
sures. The main difference between probability and measure theory is the notion of
independence.
This is perhaps the most important definition in probability.
Definition 5.1. Let A,B be events in a probability space (Ω,F ,P). We say that A and
B are independent if
P(A ∩B) = P(A)P(B).
If (An)n is a collection of events, we say that (An) are mutually independent if for any
finite number of events in the collection, An1 , An2 , . . . , Ank we have that
P(An1 ∩An2 ∩ · · · ∩Ank) = P(An1) · P(An2) · · ·P(Ank).
Example 5.2. A card is chosen randomly out of a deck of 52, all cards equally likely.
A is the event that the card is an ace. B is the event that the card is a spade. C is the
event that the card is a jack of diamonds or a 2 of hearts. D is the event that the card
is an even number.
Then, A,B are independent, since
P(A ∩B) =1
52=
4
52· 13
52= P(A) · P(B).
C,D are not independent since
P(C ∩D) =1
526= P(C) · P(D) =
2
52· 20
52.
34
454
Example 5.3. Two dice are tossed, all outcomes equally likely. A = the first die is 4
. B = the sum of the dice is 6 . C = the sum of the dice is 7 .A,B are not independent.
P(A ∩B) =1
366= P(A) · P(B) =
1
6· 5
36.
However, A,C are independent,
P(A ∩ C) =1
36=
1
6· 6
36= P(A) · P(C).
454
Proposition 5.4. Let (Ω,F ,P) be a probability space.
(1) Any event A is independent of Ω and of ∅.(2) If A,B are independent, then Ac, B are independent.
(3) If A,B are independent and P(B) > 0 then P(A|B) = P(A).
Proof. If A,B are independent,
P(B ∩Ac) = P(B)− P(B ∩A) = P(B)(1− P(A)) = P(B)P(Ac).
If also P(B) > 0, then P(A|B) = P(A)P(B)/P(B) = P(A). ut
Exercise 5.5. Prove that if A,B,C are mutually independent, then
(1) A and B ∩ C are independent.
(2) A and B \ C are independent.
Solution.
P(A ∩B ∩ C) = P(A)P(B)P(C) = P(A)P(B ∩ C).
This proves the first assertion. For the second assertion, since B \C = B ∩Cc and since
A ∩B is independent of C, then also A ∩B and Cc are independent. So,
P(A ∩ (B \ C)) = P(A ∩B ∩ Cc) = P(A ∩B)P(Cc) = P(A)P(B)(1− P(C)).
35
Finally,
P(B \ C) = P(B \ (B ∩ C)) = P(B)− P(B ∩ C) = P(B)(1− P(C)),
since B,C are independent. ut
Exercise 5.6. Show that if B is an event with P[B] = 0 then for any event A, A and
B are independent.
Solution. 0 ≤ P[B ∩A] ≤ P[B] = 0 and P[B] · P[A] = 0. ut
Example 5.7. Consider the uniform probability measure on Ω = 0, 12. Let
A = (x, y) ∈ Ω : x = 1 B = (x, y) ∈ Ω : y = 1,
and
C = (x, y) ∈ Ω : x 6= y .
Any two of these events are independent; indeed,
P(A ∩B) = P(A ∩ C) = P(B ∩ C) =1
4,
P(A) = P(B) = P(C) =1
2,
(because C = (0, 1), (1, 0)). However, it is not the case that (A,B,C) are mutually
independent, since
P(A ∩B ∩ C) = 0 6= 1
8= P(A) · P(B) · P(C).
454
5.2. An example using independence: Riemann zeta function and Euler’s
formula
Some number theory via probability:
Bernhard Riemann
(1826-1866)
Let 1 < s ∈ R be a parameter. Let
ζ(s) =
∞∑
n=1
n−s.
36
Let X be a random variable with range 1, 2, . . . , and distribution given by
P(X = n) =1
ζ(s)· n−s.
Let Dm = mZ = the set of positive integers devisible by m. So
PX(Dm) =
∑n(mn)−s∑n n−s = m−s.
Now, let p1, p2, . . . , pk be k different prime numbers. Then,
Dp1 ∩Dp2 ∩ · · · ∩Dpk = Dp1p2···pk .
So
PX(Dp1 ∩Dp2 ∩ · · · ∩Dpk) = (p1p2 · · · pk)−s = PX(Dp1) · PX(Dp2) · · ·PX(Dpk).
Since this is true for any finite number of primes, we have that the collection (Dp)p∈P
are mutually independent, where P is the set of primes.
Since the only number that is not divisible by any prime is 1, we have that 1 =⋂p∈P D
cp, and so
P(X = 1) = PX(1) =∏
p∈P(1− PX(Dp)) =
∏
p∈P(1− p−s).
So we conclude Euler’s formula: for any s > 1,
∞∑
n=1
n−s =∏
p∈P(1− p−s)−1.
Leonhard Euler
(1707–1783)
We can also let s→ 1 from above, and we get
0 = lims→1
ζ(s)−1 =∏
p∈P(1− p−1).
Taking logs∑
p∈Plog(1− p−1) = −∞.
Since log(1− x) ≥ −3x/2 for 0 < x < 1, we have that
−∞ ≥ −32
∑
p∈Pp−1,
37
and so∑
p∈P
1
p=∞.
This is a non-trivial result.
38
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 6
6.1. Uncountable Sample Spaces
Up to now we have dealt with the case of countable sample spaces, but there is no
reason to restrict to those. For example, what if we wish to experiment by throwing a
dart at a target? The collection of possible outcomes is the collection of points in the
target, which is uncountable.
The basic theory to define such spaces is measure theory.
Consider the sample space Ω = [0, 1]. Why not take a probability measure on all
subsets of [0, 1]?
Well, for example, we would like to consider the experiment of sampling a number from
[0, 1] such that it is uniform over the interval; that is, such that for any 0 ≤ a < b ≤ 1
the probability the number is in (a, b) is b− a.
How can we do this?
The following proposition shows that this cannot be done using all subsets of [0, 1]
(!).
Proposition 6.1. Let Ω = [0, 1]. Let P be a probability measure on (Ω,F), where F is
some σ-algebra on Ω, such that for any x ∈ [0, 1] and A ∈ F , P(A) = P(A + x), where
A+x = a+ x (mod 1) : a ∈ A. Then, there exists a subset S ⊂ Ω such that S 6∈ F .
Proof. Define the equivalence relation x ∼ y iff x− y ∈ Q. So we can choose a represen-
tative for each different equivalence class (using the axiom of choice!). Let R be the set
of these representatives.
Define Rq = R + q = r + q (mod 1) : r ∈ R, for all q ∈ Q ∩ [0, 1). For q 6= p ∈Q∩[0, 1) we have that Rq∩Rp = ∅, since if x ∈ Rp∩Rq then x = r+q = r′+p (mod 1), for
39
r, r′ ∈ R, and this implies that r ∼ r′, a contradiction. Thus, (Rq)q∈Q∩[0,1) is a countable
collection of mutually disjoint events, so 1 = P([0, 1)) = P(⋃q Rq) =
∑q P(Rq) =
∑q P(R) =∞! ut
Thus, we have to develop a theory of sets which we allow to be events, so that
everything is well defined etc.
6.2. σ-algebras Revisited
Recall the definition of a σ-algebra:
Definition 6.2. Let Ω be a sample space. A collection of subsets F ⊂ 2Ω is called a
σ-algebra if
• Ω ∈ F .
• If A ∈ F then Ac ∈ F .
• If (An) is a collection of events in F then⋃nAn ∈ F .
These were the most basic properties we wanted of events: Everything can occur, if
something can occur, it can also not occur, and if there are many possible events, we
can ask if there is one of them that has occured.
Exercise 6.3. Show that F ⊂ 2Ω is a σ-algebra if and only if it has the following
properties.
• ∅ ∈ F .
• If A ∈ F then Ac ∈ F .
• If (An) is a collection of events in F then⋂nAn ∈ F .
Solution. Use de-Morgan. ut
Suppose we have a collection of subsets of Ω. We want the smallest σ-algebra con-
taining these subsets.
Proposition 6.4. Let Ω be a sample space, and let K ⊂ 2Ω be a collection of subsets.
There exists a unique σ−algebra on Ω, denoted by σ(K), such that
• K ⊂ σ(K).
40
• If F is a σ-algebra such that K ⊂ F , then σ(K) ⊂ F
So σ(K) is the smallest σ-algebra containing K. Moreover, σ(K) = K if and only if Kis itself a σ−algebra.
Proof. Let
Γ(K) = F : F is a σ−algebra such that K ⊂ F ,
and define σ(K) :=⋂
Γ(K). So of course the second property holds. σ(K) is a σ-algebra
since Ω ∈ F for all F ∈ Γ(K) in the above set. Also, if (An) is a collection of events in
σ, then An ∈ F for all n and all F ∈ Γ(K). Thus,⋃nAn ∈ F for every F ∈ Γ(K), and
so⋃nAn ∈ σ(K).
As for uniquenes, if G is another σ-algebra with the above properties, then G ⊂ σ(K)
because K ⊂ G and similarly σ(K) ⊂ G. ut
Definition 6.5. σ(K) as above is called the σ-algebra generated by K.
σ(K) is intuitively all the information carried by the events in K.
Example 6.6. Let Ω be a sample space and A ⊂ Ω. Then, σ(A) = ∅, A,AcΩ.454
Definition 6.7. Let (Ω,F ,P) be a probability space, and let G ⊂ F be a sub-σ-algebra
of F . We say that an event A ∈ F is independent of G if for any B ∈ G, we have that
A,B are independent.
Let (Bα)α be a collection of events. We say that A is independent of (Bn)n if A is
independent of the σ-algebra σ((Bn)n).
Example 6.8. If A,B are independent, then we have already seen that A,Bc are also
independent. Since A, ∅ are always independent, and also A,Ω are always independent,
we have that A is independent of σ(B) = ∅, B,Bc,Ω. 454
6.3. Borel σ-algebra
Let E be a topological space. We can consider the σ-algebra generated by all open
sets. This is called the Borel σ-algebra on E, and an event in this σ-algebra is call a
Borel set.
41
A special case is the case E = Rd, and more specifically E = R. In this case we denote
the Borel σ-algebra by B = B(R), and it can be shown that B = σ((a, b) : a < b ∈ R);
that is B is generated by all intervals.
Exercise 6.9. Show that
B(R) = σ((a, b] : a < b ∈ R) = σ([a, b) : a < b ∈ R) = σ([a, b] : a < b ∈ R).
Solution. Let a < b ∈ R. Since (a, b) ∩ [0, 1] =⋂n((a, b − 1/n] ∩ [0, 1]), we get that
(a, b) ∩ [0, 1] ∈ σ((a, b] ∩ [0, 1] : a < b ∈ R). Since this holds for all a < b ∈ R, we have
that σ((a, b) ∩ [0, 1] : a < b ∈ R) ⊂ σ((a, b] ∩ [0, 1] : a < b ∈ R).
The other inclusions are similar, so all the mentioned σ−algebras are equal. ut
6.4. Lebesgue Measure
The following basic theorem, due to Lebesgue, is shown in measure theory.
Henri Lebesgue
(1875–1941)
Theorem 6.10 (Lebesgue). Let Ω = [0, 1], and let B be the Borel σ-algebra on Ω. Let
F : [0, 1]→ [0, 1] be a right-continuous, non-decreasing function such that F (1) = 1 and
F (0) = 0. Then, there exists a unique probability measure, denoted P = dF , on (Ω,B)
such that for any 0 ≤ a < b ≤ 1, P((a, b]) = F (b)− F (a).
We will not go into the proof of this theorem, but it gives us many new probability
measures on non-discrete sample spaces that we can define.
Example 6.11. Take F (x) = x in the above theorem. So, the resulting probability
measure, sometimes denoted P = dx, has the property that for any interval 0 ≤ a <
b ≤ 1, P((a, b]) = b − a. One can also check that this measure is translation invariant,
i.e. for any x ∈ [0, 1],
P(x+ (a, b] (mod 1)) = P(x+ y (mod 1) : a < y ≤ b) = b− a.
This is also sometimes called the uniform measure on [0, 1]. 454
Exercise 6.12. Show that for the uniform measure on [0, 1], for any a ∈ [0, 1], dx(a) =
0.
42
Deduce that for all 0 ≤ a ≤ b ≤ 1,
dx((a, b)) = dx([a, b]) = dx([a, b)) = dx((a, b]) = b− a.
Solution. Let a ∈ [0, 1]. For all n, let An = (a − 1n , a]. So P(An) = 1/n. Since (An) is
a decreasing sequence of events, and since a =⋂nAn, we have by the continuity of
probability measures
P(a) = limP(An) = 0.
We can now note that
P([a, b]) = P((a, b) ] a ] b) = P((a, b)) + P(a) + P(b) = P((a, b)).
Similarly for the other types of intervals. ut
Example 6.13. Let α < β ∈ R. Let Ω = [α, β]. For a set A ∈ B([0, 1]) let
(β − α)A+ α = (β − α)a+ α : a ∈ A .
Define F = (β − α)A+ α : A ∈ B([0, 1]).We have that (β − α)[0, 1] + α = Ω, ((β − α)A+ α)c = (β − α)Ac + α, and
⋃
n
((β − α)An + α) = (β − α)(⋃
n
An)
+ α.
So F is a σ-algebra on Ω. Also, if dx is the uniform measure on [0, 1] then
P((β − α)A+ α) := dx(A)
define a probability measure on (Ω,F), because A∩B = ∅ if and only if (β−α)A+α∩(β − α)B + α = ∅.
In this case, note that for any α ≤ a < b ≤ β,
P((a, b]) = P((β − α)( 1β−α(a− α), 1
β−αb] + α) = dx(( 1β−α(a− α), 1
β−αb]) =b− aβ − α.
454
Example 6.14. Let Ω be a sample space, and let f : Ω→ [0, 1] be a 1-1 function. Let
F =f−1(B) : B ∈ B([0, 1])
. This is a σ-algebra on Ω since f−1([0, 1]) = Ω and
⋃
n
f−1(Bn) = f−1(⋃
n
Bn).
43
For A ∈ F , since B = f(f−1(B)), we have that f(A) ∈ B, and we can define P(A) =
dx(f(A)). This results in a probability measure on (Ω,F) since P(Ω) = 1 and if (An)
are mutually disjoint events then so are (f(An)),
P(⋃
n
An) = dx(f(⋃
n
An)) = dx
(⋃
n
f(An))
=∑
n
dx(f(An)) =∑
n
P(An).
454
Example 6.15. A dart is thrown to a target of radius 1 meter. Points are awarded
according to the distance to the center of the target. The probability of being between
distance a and distance b is b2 − a2.
What is the probability of hitting the inner circle of radius 1/2? What about radius
1/4? 454
Solution. Let F : [0, 1] → [0, 1] be F (x) = x2. Then, dF is a probability measure
on ([0, 1],B) such that dF ((a, b)) = F (b) − F (a) = b2 − a2. Thus, we can model the
dart throwing experiment by the probability space ([0, 1],B, dF ), where the outcome
ω ∈ [0, 1] is the distance to the target.
Now, the probability of hitting the inner circle of radius 1/2 is dF ([0, 1/2]) = (1/2)2 =
1/4.
Similarly the probability of hitting the circle of radius 1/4 is 1/16. ut
44
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 7
7.1. Random Variables
Up until now we dealt with abstract spaces (such as dice, cards, etc.). In order to
say things about objects, we prefer to map them to numbers (or vectors) so that we can
perform calculations.
For example, maybe some biological experiment is being carried out in the lab, but
in the end, the biologist may only be interested on the number of bacteria in the dish,
or other such measurements.
Measurements of experiments are carried out via random variables.
Suppose we have an experiment described by a probability space (Ω,F ,P). A mea-
surement is just a function taking outcomes to numbers; i.e. a function X : Ω → R.
The natural way to define a probability space on R, would now be to take the natural
σ-algebra on R, i.e. the Borel σ-algebra B, and then define the measure of a Borel set
B to be the probability that the outcome ω is mapped into B:
∀ B ∈ B PX(B) := P(ω : X(ω) ∈ B) = P(X−1(B)).
But wait! What if the set X−1(B) is not in F? Then P is not defined on that set.
This is a technical detail, but one that needs to be addressed, otherwise we will run into
paradoxes as in the case of the unmeasurable sets on [0, 1].
Definition 7.1. Let (Ω,F), (Ω′,F ′) be a measurable spaces. A function X : Ω→ Ω′ is
called a measurable map from (Ω,F) to (Ω′,F ′), if for all A′ ∈ F ′, X−1(A′) ∈ F .
We usually denote this by X : (Ω,F) → (Ω′,F ′) to stress the dependence on F and
F ′.
45
Exercise 7.2. Show that if X : (Ω,F) → (Ω′,F ′) and Y : (Ω′,F ′) → (Ω′′,F ′′) are
measurable functions, then Y X : Ω→ Ω′′ is measurable with respect to F ,F ′′.
To determine if a function X : (Ω,F)→ (Ω′,F ′) is measurable, it is enough to check
for a family of sets generating F ′:
Proposition 7.3. Let (Ω,F), (Ω′,F ′) be two measurable spaces, such that F ′ = σ(Π′)
for some Π′ ⊂ 2Ω′. Then, X : Ω → Ω′ is measurable if and only if for any A′ ∈ Π′,
X−1(A′) ∈ F .
Proof. The “only if” part is clear. For the “if” part, consider
G′ =A′ ⊂ Ω′ : X−1(A′) ∈ F
.
We are given that Π′ ⊂ G′. If we show that G′ is a σ-algebra, then F ′ = σ(Π′) ⊂ G′, so
for any A′ ∈ F ′ we have that X−1(A′) ∈ F , and X is measurable.
So we only need to show that G′ is a σ-algebra. Indeed, since X−1(∅) = ∅ ∈ F we
have that ∅ ∈ G′.If A′ ∈ G′ then X−1(A′) ∈ F so also X−1(Ω′ \ A′) = Ω \ X−1(A′) ∈ F , and thus
Ω′ \A′ ∈ G′. So G′ is closed under complements.
If (A′n)n is a sequence in G′, then for every n we have that X−1(A′n) ∈ F , and so
X−1(⋃nA′n) =
⋃nX
−1(A′n) ∈ F . Thus,⋃nA′n ∈ G′ and G′ is closed under countable
unions. ut
Corollary 7.4. If Ω,Ω′ are topological spaces, and B(Ω),B(Ω′) are the Borel σ-algebras,
then any continuous map f : Ω→ Ω′ is measurable.
Proof. f is continuous implies that for any open set O′ ⊂ Ω′, f−1(O′) is an open set in
Ω. Thus, since B(Ω′) = σ(O′), where O′ is the set of open sets in Ω′, we have that for
any O′ ∈ O′, f−1(O′) ∈ B(Ω). So f is measurable. ut
Definition 7.5. A (real valued) random variable on a probability space (Ω,F ,P)
is a measurable function X : (Ω,F)→ (R,B).
The function PX(B) = P(X−1(B)) is a probability measure on (R,B), and is called
the distribution of X.
46
Example 7.6. Let (Ω,F ,P) be a probability space. Let A ∈ F be an event. This is
not a random variable, but there is a natural random variable connected to A: Define a
function
IA(ω) =
1 ω ∈ A0 ω 6∈ A.
Note that for any B ∈ B, if 1 6∈ B and 0 6∈ B then I−1A (B) = ∅. If 1 ∈ B and 0 6∈ B
then I−1A (B) = A. If 1 6∈ B and 0 ∈ B then I−1
A (B) = Ac. If 1 ∈ B and 0 ∈ B then
I−1A (B) = Ω. So IA is measurable.
IA is known as the indicator of the event A. Indicators are always measurable, and
they are the simplest kind of random variables, taking only two values 0, 1. We denote
the indicator of an event A by 1A. 454
Exercise 7.7. Let (Ω,F ,P) be a probability space. Prove that if X : (Ω,F) → R is a
measurable function then indeed PX(B) = P(X−1(B)) is a probability measure on (R,B).
Proof. PX(R) = P(X−1(R)) = P(Ω) = 1. If (Bn) is a sequence of mutually disjoint
events, then (X−1(Bn)) are also mutually disjoint and
PX(⋃
n
Bn) = P(⋃
n
X−1(Bn))
=∑
n
PX(Bn).
ut
Notation. To simplify the notation, we will ommit the ω’s. e.g. instead of writing
P(ω : X(ω) ∈ A) we write P(X ∈ A).
Example 7.8. Three balls are removed from an urn containing 20 balls numbered 1 to
20. All possibilities are equally likely. What is the probability the at least one has a
number 17 or higher?
The sample space here is Ω = S ⊂ 1, 2, . . . , 20 : |S| = 3, and the measure is the
uniform measure on this set. We define the random variable X : Ω→ R by X(i, j, k) =
max i, j, k.Why is X a measurable function? For any Borel set B ∈ B, we have that X−1(B) is
a subset of Ω, and the σ-algebra here is 2Ω.
47
What is the probability that there exists a ball numbered 17 or higher? This is
P(ω ∈ Ω : X(ω) ≥ 17) = P(X ≥ 17) = 1− P(X < 17).
Since the number of S ∈ Ω with maxS < 17 is(
163
), we have that P(X < 17) = 16·15·14
20·19·18 =
4·73·19 . 454
Example 7.9. Two dice are thrown. So Ω = (i, j) : 1 ≤ i, j ≤ 6, and P is the
uniform measure on this set. Define the random variable X(i, j) = i + j. Note that
2 ≤ X(i, j) ≤ 12 for any i, j, so
P(X ∈ [2, 12] ∩ N) = 1,
and X is a discrete random variable.
What is the probability that the sum is 7?
P(X = 7) =6
36=
1
6.
454
7.2. Distribution Function
Proposition 7.10. Let (Ω,F) be a measurable space, and let X : Ω → R. Then X is
measurable if and only if
X ≤ r = X−1(−∞, r] ∈ F ∀ r ∈ R.
Proof. The “only if” part is clear.
For the “if” part: Since X > r = X ≤ rc ∈ F , we have that for any a < b ∈ R
X−1((a, b]) = X−1((−∞, b] \ (−∞, a]) = a < X ≤ b = X ≤ b ∩ X > a ∈ F .
Let
G =B ∈ B : X−1(B) ∈ F
.
Since X−1(R) ∈ F and X−1(Bc) = (X−1(B))c and X−1(⋃nBn) =
⋃nX
−1(Bn), we
get that G is a σ-algebra.
From the above, (a, b] ∈ G for all a < b ∈ R. Thus, B = σ((a, b] : a < b ∈ R) ⊂ G. So
for any B ∈ B, we have that B ∈ G, and so X−1(B) ∈ F . ut
48
Definition 7.11. Let (Ω,F ,P) be a probability space, and let X be a random variable.
The (cumulative) distribution function of X is the function
FX : R→ [0, 1] FX(t) = P(X ≤ t).
Example 7.12. X is the random variable that is the sum of two dice. FX(t) =?
FX(t) = 0 for t < 2. FX(t) = 1 for t ≥ 12. FX(t) = 1/36 for t ∈ [2, 3). FX(t) = 2/36
for t ∈ [3, 4). In general, FX(t) = (k − 1)/36 for t ∈ [k, k + 1) and k ≤ 5. 454
Proposition 7.13. Let X be a random variable, and FX its distribution function. Then,
• FX is non-decreasing and right-continuous.
• FX(t) tends to 0 as t→ −∞.
• FX(t) tends to 1 as t→∞.
Proof.
Since for a ≤ b we have that (−∞, a] ⊂ (−∞, b], we get
FX(a) = PX((−∞, a]) ≤ PX((−∞, b]) = FX(b).
For right continuity, let hn → 0 monotonely from the right, so ((−∞, a + hn])n is a
decreasing sequence and
FX(a) = PX(⋂
n
(−∞, a+ hn]) = limn
PX((−∞, a+ hn]) = limnFX(a+ hn).
The limits at ∞ and −∞ are treated similarly. If an → ∞ then limn(−∞, an] =
(−∞,∞). So by the continuity of probability measures PX((−∞, an]) → PX(R) = 1.
Similarly for an → −∞. ut
Exercise 7.14. Give an example of a random variable X whose distribution function
FX is not left-continuous.
Solution. Let (Ω,F ,P) be a probability space, and let A ∈ F . Let X = 1A. What is
FX?
FX(t) =
0 t < 0
P(Ac) = 1− P(A) 0 ≤ t < 1
1 t ≥ 1.
49
So if 0 < P(A) < 1 then
lims→0−
FX(s) = 0 and FX(0) = 1− P(A) > 0,
lims→1−
FX(s) = 1− P(A) and FX(1) = 1 > 1− P(A).
So both 0 and 1 are points where FX is not left-continuous. ut
X Of course one sees that the right-continuity stems from the fact that we define
FX(t) = P(X ≤ t). We could have defined P(X ≥ t) which would have been left-
continuous. This is not important, however we stick to the classical distribution function.
Exercise 7.15. Let X be a random variable with distribution FX . Show that for any
a < b ∈ R,
• P(a < X ≤ b) = FX(b)− FX(a).
• P(a < X < b) = FX(b−)− FX(a).
• P(X = x) = FX(x)− FX(x−). Deduce that FX is continuous at x if and only if
P(X = x) = 0.
Solution.
• Since (a, b] = (−∞, b] \ (−∞, a], and (−∞, a] ⊂ (−∞, b],
FX(b)− FX(a) = PX((−∞, b])− PX((−∞, a]) = PX((a, b]) = P(a < X ≤ b).
• The events (a, b − 1/n] satisfy limn(a, b − 1/n] = (a, b), and by continuity of
probability
P(a < X < b) = PX(limn
(a, b− 1/n]) = limn
PX(a, b− 1/n]) = FX(b−)− FX(a).
• Similarly, x = limn(x− 1/n, x] so
P(X = x) = PX(x) = FX(x)− limnFX(x− 1/n) = FX(x)− FX(x−).
So, FX is left-continuous at x if and only if FX(x−) = FX(x), which is if and
only if P(X = x) = 0.
ut
50
Since segments of the form (a, b], (a, b), (−∞, a] and (−∞, a) are enough to generate
all sets in B(R) it can be shown that the distribution function uniquely determines the
probability measure of all Borel sets, and thus it determines the random variable X.
In the same way as Lebesgue measure is shown to exist, it is shown in measure
theory that there is a 1-1 correspondence between distribution functions and probability
measures on (R,B(R)). That is:
Thomas Stieltjes
(1856–1894)
Theorem 7.16 (Lebesgue-Stieltjes). Every function F : R → [0, 1] that is right-
continuous everywhere, non-decreasing and limt→∞ F (t) = 1, limt→−∞ F (t) = 0 gives
rise to a unique probability measure PF on (R,B(R)) satisfying PF ((a, b]) = F (b)−F (a)
for all a < b ∈ R. (That is, for such F there is always a random variable X such that
F = FX .)
Conversely, we have already seen that if X is a random variable, then PX((a, b]) =
FX(b)− FX(a) for all a < b ∈ R, and FX has the properties mentioned above.
51
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 8
8.1. Discrete Distributions
Definition 8.1. A random variable X is called discrete if there exists a countable set
R ⊂ R such that P(X ∈ R) = 1.
If X is such a random variable, then if we specify P(X = r) for all r ∈ R, then the
distribution of X can be calculated by
FX(t) = P(X ≤ t) =∑
R3r≤tP(X = r).
The set r ∈ R : P(X = r) > 0 is sometimes called the range ofX, and the function
fX(x) = P(X = x) is called the density of X (sometimes: probability density
function). Note that
FX(t) =∑
R3x≤tfX(x).
8.1.1. Bernoulli Distribution. Let 0 < p < 1. X has the Bernoulli-p distribution,
denoted X ∼ Ber(p) if:
FX(t) =
0 t < 0
1− p 0 ≤ t < 1
1 1 ≤ t.
That is, P(X ∈ 0, 1) = 1, and P(X = 0) = 1− p, P(X = 1) = p.
Jacob Bernoulli
(1654–1705)
The Bernoulli distribution models a coin toss of a biased coin - probability p to fall
head (or 1) and 1− p to fall tails (or 0).
52
8.1.2. Binomial Distribution. Suppose we have a biased coin with probability p to
fall heads (or 1). What if we toss that coin n times in a row, all tosses mutually
independent?
The sample space would be Ω = 0, 1n. Because of independence the measure would
be
P(ω) = p number of ones in ω · (1− p) number of zeros in ω.
So if X : Ω→ R is the number of ones, then since for 0 ≤ k ≤ n there are(nk
)different
ω ∈ Ω with exactly k ones,
0 ≤ k ≤ n fX(k) = P(X = k) =
(n
k
)pk(1− p)n−k.
Such an X is said to have Binomial-(n, p) distribution, denoted X ∼ Bin(n, p).
What is FX in this case?
FX(t) =∑
k≤tk=0,1,...,n
(n
k
)pk(1− p)n−k.
So FX(t) = 0 for t < 0 and FX(t) = 1 for t ≥ n.
Example 8.2. A mathematician sits at the bar. For the next hour, every 5 minutes he
orders a beer with probability 2/3 and drinks it. With the remaining probability of 1/3
he does nothing for those 5 minutes. All orders of beer are mutually independent.
What is the probability that he drinks no more than one beer, so he can still drive
home (legally)? 454
Solution. If X is the number of beers drunk, then X ∼ Bin(12, 2/3), since every beer is
drunk with probability 2/3 independently, in 12 = 60/5 trials. So
P(X ≤ 1) = P(X = 0) + P(X = 1) =
(12
0
)(2/3)0(1/3)12 +
(12
1
)(2/3)1(1/3)11
= (1/3)12 + 12 · 2 · (1/3)12 =25
312<
1
20, 000.
ut
53
8.1.3. Geometric Distribution.
Example 8.3. A student is watching television. Every minute she tosses a biased coin,
if it come up heads she goes to study for the probability exam, if tails she continues
watching TV. All coin tosses are mutually indpendent.
What is the probability she will continue watching TV forever?
What is the probability she watches for exactly k minutes and then starts studying?
The sample space here is Ω = N ∪ ∞.Let T be the number of minutes it takes to go study. The event that T = k is the event
that k − 1 coins are tails, and the k-th coin is heads, where all coins are independent.
So
P(T = k) = (1− p)k−1p.
Note that
P(T =∞) = 1− P(T <∞) = 1−∞∑
k=1
P(T = k) = 1− p∞∑
k=1
(1− p)k−1 = 0.
So the range of T is N. 454
The distribution of T above is called the Geometric-p distribution; that is, X has
the Geometric-p distribution, denoted X ∼ Geo(p), if
fX(k) = P(X = k) =
(1− p)k−1p k ∈ 1, 2, . . . , 0 otherwise.
What is FX?
FX(t) =∑
1≤k≤t : k∈N(1− p)k−1p = 1− (1− p)btc.
So the Geometric distribution is the number of trials of independent identical exper-
iments until the first success.
Example 8.4. An urn contains b black balls and w white balls. We randomly remove
a ball, all balls equally likely, and then return that ball.
Let T be the number of trials until we see a black ball.
What is the distribution of T?
54
Well, every time we try there is independently a probability of p := b/(b + w) to
remove a black ball. So the distribution of T is Geometric-p. 454
Proposition 8.5 (Memoryless property for geometric distribution). Let X be a Geometric-
p random variable. For any k > m ≥ 1,
P(X = k|X > m) = P(X = k −m).
Proof. By definition,
P(X = k|X > m) =P(X = k,X > m)
P(X > m)=
(1− p)k−1p
(1− p)m = (1− p)k−mp = P(X = k −m).
ut
That is, if we are given that we haven’t succeeded until time m, the probability of
succeeding in another k steps is the same as not succeeding for k steps to begin with.
8.1.4. Poisson Distribution. A random variable X has the Poisson-λ distribution
(0 < λ ∈ R) if the range is N and
fX(k) = P(X = k) = e−λ · λk
k!.
Thus,
FX(t) =
btc∑
k=0
e−λ · λk
k!.
Simeon Poisson
(1781–1840)
The importance of the Poisson distribution is in that it seems to occur naturally in
many practical situations (the number of phone calls arriving at a call center per minute,
the number of goals per game in football, the number of mutations in a given stretch of
DNA after a certain amount of radiation, the number of requests sent to a printer from
a network...).
The reason this may happen is that the Poisson serves is the limit of Binomial distri-
butions when the number of trials goes to infinity: Suppose Xn ∼ Bin(n, λ/n). Then,
P(Xn = k) =
(n
k
)(λn
)k ·(1− λ
n
)n−k
=λk
k!· (1− 1
n) · (1− 2n) · · · (1− k−1
n ) ·(1− λ
n
)n−k → e−λλk
k!.
55
So Poisson is like doing infinitely many trials of an experiment that has small proba-
bility of success, and asking how many successes are observed.
Example 8.6. Suppose the number of people at the checkout counter at the supermar-
ket has Poisson distribution. Checkout A has Poisson-5 and Checkout B has Poisson-2.
Both counters are independent.
Noga arrives at the counters and chooses the shorter line (and chooses A if they are
tied). What is the probability she chooses B?
What is the probability both checkouts are empty? 454
Solution. Let X be the number of people at A, and Y the number at B. We want the
probability of X > Y .
The information we have is that
P(Y = k,X = m) = P(Y = k)P(X = m),
for all k,m ∈ N .
So
P(Y < X) =∞∑
m=0
P(X = m)∑
k<m
P(Y = k) =∞∑
m=1
e−5 5m
m!
m−1∑
k=0
e−2 2k
k!.
The probability that both checkouts are empty is
P(X = 0, Y = 0) = P(X = 0)P(Y = 0) = e−5e−2 = e−7.
ut
8.1.5. Hypergeometric. Suppose we have an urn with N ≥ 1 balls, 0 ≤ m ≤ N are
black and the rest are white. If we choose 1 ≤ n ≤ N balls from the urn randomly and
uniformly, what is the number of black balls we get?
The space here is uniform measure on
Ω = ω ⊂ 1, . . . , N : |ω| = n =
(1, . . . , Nn
).
We ask for the random variable X(ω) := |ω ∩ 1, . . . ,m |.
56
A simple calculation reveals that for any 0 ≤ k ≤ min n,m,
P[X = k] =
(mk
)·(N−mn−k
)(Nn
) .
(We interpret(ab
)for b > a as 0, so actually, max 0,m+ n−N ≤ k ≤ min m,n.)
Such a random variable X is said to have hypergeometric distribution with param-
eters N,m, n, and we write X ∼ H(N,m, n).
Example 8.7. Chagai chooses 5 cards from a deck, all possibilities equal. How much
more likely is it for him to choose 3 aces than it is to choose 4 aces? How much more
likely is it to choose 4 diamonds than it is to choose 5 diamonds?
Let X = the number of aces chosen. The size of the deck is N = 52, the number of
cards chosen is n = 5 and the number of “special” cards is m = 4. So
P[X = 3]
P[X = 4]=
(m3
)·(N−mn−3
)(m4
)·(N−mn−4
) =4 · (N −m− n+ 4)
(m− 3) · (n− 3)=
4 · 47
1 · 2 = 94.
For Y = the number of diamonds chosen, we get N = 52, n = 5,m = 13. So
P[X = 4]
P[X = 5]=
(m4
)·(N−mn−4
)(m5
)·(N−mn−5
) =5 · (N −m− n+ 5)
(m− 4) · (n− 4)=
5 · 39
9 · 1 =5 · 13
3= 21 +
2
3.
454
8.1.6. Negative Binomial. We say that X ∼ NB(m, p), or X has negative binomial
distribution (or Pascal distribution) if X is the number of trials until the m-th success,
with each success with probability p.
Blaise Pascal (1623–1662) The event X = k is the event that there are m− 1 successes in the first k − 1 trials,
and then one success in the k-th trial. So
P[X = k] =
(k − 1
m− 1
)pm(1− p)(k−1)−(m−1) =
(k − 1
m− 1
)pm(1− p)k−m,
for k ≥ m.
Example 8.8. Let X ∼ NB(1, p). Then X ∼ Geo(p). Indeed, for any k ≥ 1,
P[X = k] =
(k − 1
0
)p(1− p)k−1 = (1− p)k−1p.
454
57
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 9
9.1. Continuous Random Variables
Definition 9.1. Let (Ω,F ,P) be a probability space. A random variable X : Ω→ R is
called continuous if the distribution function FX is continuous at all points.
Remark 9.2. Recall that we showed that for all x ∈ R,
P(X = x) = limn→∞
PX((x− 1/n, x]) = FX(x)− limn→∞
FX(x− 1/n) = FX(x)− FX(x−).
Thus, if FX is continuous at x, then P(X = x) = 0.
So, if X is a continuous random variable then P(X = x) = 0 for all x ∈ R.
This is very different than in the discrete case!
Continuous random variables can be very pathological, and we need measure theory
to deal with some aspects regarding these. We restrict ourselves for this course to a
“nicer” family of random variables.
Definition 9.3. Let (Ω,F ,P) be a probability space. A random variable X : Ω → R
is called absolutely continuous if there exists an integrable non-negative function
fX : R→ R+ such that for all x ∈ R,
P(X ≤ x) = FX(x) =
∫ x
−∞fX(t)dt.
Such a function fX is called the probability density function (or just density function,
or PDF) of X.
Remark 9.4. Recall that the fundamental theorem of calculus says that if fX is contin-
uous, then FX is differentiable and F ′X = f .
58
Also, a result of calculus tells us that the function x 7→∫ x−∞ f(t)dt is a continuous
function of x. So any absolutely continuous random variable is also continuous.
X Under the carpet: We don’t know how to prove that an integrable function fX :
R → R+ uniquely defines a probability measure PX on (R,B), since we would need
some measure theory for this (actually we could use the Lebesgue-Steiltjes Theorem).
However, given that such a measure is uniquely defined, we have the following properties:
Proposition 9.5. Let X be an absolutely continuous random variable with PDF fX .
(1) By definition,
PX((a, b]) = FX(b)− FX(a) =
∫ b
−∞fX(t)dt−
∫ a
−∞fX(t)dt =
∫ b
afX(t)dt.
(2) We don’t always know which sets we can integrate over, but if the integral exists
we have
PX(A) =
∫
AfX(t)dt.
For example, if A =⊎nj=1(aj , bj ] this holds.
(3) Continuity of probability gives us that PX(a) = 0 for all a (because X is a
continuous random variable, so FX is contiuous at all points). Thus,
PX((a, b]) = PX([a, b)) = PX((a, b)) = P([a, b]).
(4) We always have ∫
RfX(t)dt = PX(R) = 1.
Proof. Just calculus of the Riemann integral, and continuity of probability measures.
ut
Example 9.6. Let X be a random variable with PDF
fX(t) =
C(2t− t2) 0 ≤ t ≤ 2
0 otherwise.
What is C? What is the probability that X > 1?
59
To compute C we use the fact that∫R fX(t)dt = 1. So
1 =
∫ ∞
−∞fX(t)dt = C
∫ 2
0(2t− t2)dt = C(t2 − t3
3 )∣∣20
= C(4− 8/3) = C · 43 .
So C = 3/4.
The probability that X > 1 is
PX((1,∞)) = 1− PX((−∞, 1]) = 1−∫ 1
0C(2t− t2)dt = 1− 3
4 · 23 = 1
2 .
454
9.1.1. Uniform Distribution. An absolutely continuous random variable X has the
uniform disribution on [a, b] , denoted X ∼ U([a, b]) if X has PDF
fX(t) =
1b−a a ≤ t ≤ b0 otherwise.
Indeed∫R fX(t) = 1.
Example 9.7. A bus arrives at the station every 15 minutes, starting at 5:00. A
passenger arrives between 7:00 and 7:30, uniformly distributed. What is the probability
that he does not wait more than 5 minutes for a bus?
If we take X = the time the passenger arrives, then X has uniform distribution on
[0, 30]. Also, the event that the passenger waits at most 5 minutes is the disjoint union
X = 0 ] 10 ≤ X ≤ 15 ] 25 ≤ X ≤ 30 .
Thus the probability is
P(X = 0) + P(10 ≤ X ≤ 15) + P(25 ≤ X ≤ 30) = 0 +5
30+
5
30=
1
3.
454
The uniform distibution is very symmetric in the following sense:
Exercise 9.8. Let X ∼ U([a, b]). Show that Y = X + c has the uniform distribution on
[a+ c, b+ c].
60
Solution. We need to show that
FY (x) = P(ω : Y (ω) ≤ x) =
∫ x
−∞
1
b+ c− (a+ c)· 1[a+c,b+c]dt.
Indeed, since ω : Y (ω) ≤ x = ω : X(ω) ≤ x− c, we have that
FY (x) = FX(x− c) =
∫ x−c
−∞
1
b− a · 1[a,b]dt =1
b+ c− (a+ c)·∫ x
−∞1[a+c,b+c]du.
ut
Exercise 9.9. What is the distribution function of a U([a, b]) random variable?
Proof. For a ≤ x ≤ b we have
FX(x) =1
b− a
∫ x
adt =
x− ab− a .
For x < a, FX(x) = 0 and for x > b, FX(x) = 1. ut
9.1.2. Exponential Distribution. Let 0 < λ ∈ R. An absolutely continuous random
variable X is said to have exponential distribution of parameter λ, denoted X ∼Exp(λ), if it has PDF
fX(t) = 1t≥0 · λe−λt.
What is the distribution function in this case?
FX(x) = λ
∫ x
0e−λtdt = (−e−λt)
∣∣x0
= 1− e−λx,
for x ≥ 0. For x < 0, Fx(x) = 0.
Example 9.10. The duration of the wait in line at Kupat Cholim is exponentially
distributed with parameter 1/5. What is the probability that the wait is longer than 10
minutes?
Suppose we are given that the wait is longer than 20 minutes. What is the probability
given this information that the wait is longer than 30 minutes?
Let X = waiting time. So X ∼ Exp(1/5).
P(X > 10) =
∫ ∞
10
15e−t/5dt = (−e−t/5)
∣∣∞10
= e−2.
61
Now, using the definition of conditional probability,
P(X > 30|X > 20) =P(X > 30)
P(X > 20).
Since
P(X > x) = (−e−t/5)∣∣∞x
= e−x/5,
we get that
P(X > 30|X > 20) = e−6/e−4 = e−2.
454
Is it a coincidence that we got the same answer? No. Recall that the geometric
distribution had the memoryless property, that is a geometric random variableX satisfies
P(X > m+ n|X > m) = P(X > n).
Similarly, the exponential disribution has the memoryless property:
Proposition 9.11. Let X ∼ Exp(λ). Then for any x, y ∈ R+,
P(X > x+ y|X > x) = P(X > y).
That is, the knowledge of waiting for some time does not give information about how
much more we have to wait.
Proof. First,
P(X > t) = 1− FX(t) = e−λt.
So
P(X > x+ y|X > x) =P(X > x+ y)
P(X > x)=e−λ(x+y)
e−λx= e−λy = P(X > y).
ut
However, it turns out that the exponential distribution is the only continuous random
variable with the memoryless property:
Proposition 9.12. Let X be a continuous random variable, such that for all x, y ∈ R+,
P(X > x+ y|X > x) = P(X > y).
Then, X ∼ Exp(λ) for some λ > 0.
62
Proof. X is continuous, so that means FX is a continuous function. Let g(t) = 1 −FX(t) = P(X > t). So g is a continuous function as well.
Now, for any x, y ∈ R+,
g(x+ y) = g(x)g(y).
Let λ ≥ 0 be such that g(1) = e−λ (recall that g : R → [0, 1]). For any 0 < n ∈ N we
have that
g(n) = g(n− 1)g(1) = · · · = g(1)n.
Since g(n)→ 0 we have that g(1) < 1 and λ > 0.
For any 0 < m ∈ N we have that
g(1) = g( 1m)g(1− 1
m) = · · · = g( 1m)m,
so g(1/m) = g(1)1/m. In a similar way, we get that for any rational number
g( nm) = g(1)n/m = e−λn/m.
Since g is continuous and coincides with e−λ· for all q ∈ Q, we get that g(t) = e−λt. ut9.1.3. Normal Distribution.
Some calculus. Let us look at the following integral:
I =
∫ ∞
−∞e−t
2dt.
How can we compute this value?
We use two-dimensional calculus to compute it: Note that
I2 =
∫
R2
e−(x2+y2)dxdy.
Setting x = r cos θ and y = r sin θ (or r =√x2 + y2 and θ = arctan(y/x) ) so the
Jacobian is
J =
dxdr = cos θ dx
dθ = −r sin θ
dydr = sin θ dy
dθ = r cos θ
So det(J) = r. Thus,
I2 =
∫ 2π
0
∫ ∞
0e−r
2rdrdθ.
63
Substituting s = r2 so that ds = 2rdr, we have that
I2 =
∫ 2π
0
∫ ∞
0
12e−sdsdθ = π.
So I =√π.
Now, we can use another trick to compute
I =
∫ ∞
−∞exp
(−(t− µ)2
2σ2
)dt.
Letting x = t−µ√2σ
we have that dt =√
2σdx. Thus,
I =√
2σ
∫ ∞
−∞e−x
2dx =
√2πσ.
We are ready to define the normal distribution:
Carl Friedrich Gauss
(1777–1855)
Let µ ∈ R and σ ∈ R+. An absolutely continuous random variable has normal-(µ, σ)
distribution, denoted X ∼ N(µ, σ), if it has PDF
fX(t) =1√
2πσ2exp
(−(t− µ)2
2σ2
).
A normal-(0, 1) random variable is called a standard normal random variable.
Example 9.13. The distribution of the lifetime of people in Oceania is distributed as
a normal-(75, 8) random variable. What is the probability a person lives more than 100
years? Show that it is less than e−ξ2/2 · 1√
2π, where ξ = 25/8.
George Orwell
(1903–1950)
If X = lifetime, then X ∼ N(75, 8). So
P(X > 100) =
∫ ∞
100
1√2π · 8
· e(t−75)2/(2·82)dt.
Taking s = (t− 75)/8 we have that dt = 8ds so
P(X > 100) =
∫ ∞
ξ
1√2πes
2/2ds.
Since on (ξ,∞) we have that 1 < sξ we get that
P(X > 100) <
∫ ∞
ξ
1√2πse−s
2/2ds.
Substituting u = s2/2, so du = sds we have that
P(X > 100) <
∫ ∞
ξ2/2
1√2πe−udu = e−ξ
2/2 · 1√2π.
64
454
A nice property of the normal distribution is the following:
Exercise 9.14. Let X ∼ N(µ, σ). Define Y (ω) = aX(ω) + b. Show that Y ∼ N(aµ +
b, aσ).
Solution. We need to prove that for any y ∈ R,
P(Y ≤ y) =
∫ y
−∞
1√2π · aσ
exp
(−(t− aµ− b)2
2a2σ2
)dt.
Since
ω : Y (ω) ≤ y = ω : X(ω) ≤ (y − b)/a ,
we have that
P(Y ≤ y) = P(X ≤ (y − b)/a) =
∫ (y−b)/a
−∞
1√2π · σ
exp
(−(t− µ)2
2σ2
)dt.
Substituting t− µ = s−aµ−ba (or t = s−b
a ), we have that dt = 1ads and
P(Y ≤ y) =
∫ y
−∞
1√2π · aσ
exp
(−(s− aµ− b)2
2a2σ2
)ds.
ut
Corollary 9.15. If X is a normal-(µ, σ) random variable, then X−µσ is a standard
normal random variable.
Example 9.16. Let X ∼ N(µ, σ). Show that FX(t) > 0 for all t.
Indeed,
FX(t) =
∫ t
−∞fX(s)ds ≥
∫ t
t−1fX(s)ds ≥ inf
s∈[t−1,t]fX(s).
Since fX is increasing up to µ and decreasing after µ, we have that the infimum on
[t− 1, t] is obtained at the endpoints. So
FX(t) ≥ min fX(t− 1), fX(t) > 0.
454
65
Example 9.17. Let X ∼ N(0, 1). Show that X2 is absolutely continuous. Is X2
normally distributed?
First, to show that X2 is absolutely continuous we need to show that
FX2(t) =
∫ t
−∞f(s)ds
for some function f : R→ R+. Indeed, if t < 0 thenX2 ≤ t
= ∅ so FX2(t) = 0.
If t ≥ 0,
FX2(t) = P[X2 ≤ t] = P[−√t ≤ X ≤
√t] =
∫ √t
−√t
1√2πe−s
2/2ds.
Since this last function in the integrand is even (symmetric around 0), we have that
FX2(t) = 2
∫ √t
0
1√2πe−s
2/2 =
∫ t
0
1√2πx
e−x/2dx,
where we have used the change of variables x = s2 so ds = 12√xdx.
Now, if we define
f(s) =
1√2πs
e−s/2 if s > 0
0 if s ≤ 0,
then we get from all the above that
FX2(t) =
∫ t
−∞f(s)ds.
So X2 is absolutely continuous with density f = fX .
Is X2 normal? No. This can be seen since FX2(0) = 0 but FN (0) > 0 for any
N ∼ N(µ, σ). 454
66
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 10
10.1. Independence Revisited
10.1.1. Some reminders. Let (Ω,F ,P) be a probability space. Given a collection of
subsets K ⊂ F , recall that the σ-algebra generated by K, is
σ(K) =⋂G : G is a σ-algebra ,K ⊂ G ,
and this σ-algebra is the smallest σ-algebra containing K. σ(K) can be thought of as all
possible information that can be generated by the sets in K.
Recall that a collection of events (An)n are mutually independent if for any finite
number of these events A1, . . . , An we have that
P(A1 ∩ · · · ∩An) = P(A1) · · ·P(An).
We can also define the independence of families of events:
Definition 10.1. Let (Ω,F ,P) be a probability space. Let (Kn)n be a collection of
families of events in F . Then, (Kn)n are mutually independent if for any finite
number of families from this collection, K1, . . . ,Kn and any events A1 ∈ K1, A2 ∈K2, . . . , An ∈ Kn, we have that
P(A1 ∩ · · · ∩An) = P(A1) · · ·P(An).
For example, we would like to think that (An) are mutually independent, if all the
information from each event in the sequence is independent from the information from
the other events in the sequence. This is the content of the next proposition.
67
Proposition 10.2. Let (An) be a sequence of events on a probability space (Ω,F ,P).
Then, (An) are mutually independent if and only if the σ-algebras (σ(An))n are mutually
independent.
It is not difficult to prove this by induction:
Proof. By induction on n we show that (σ(A1), . . . , σ(An), An+1 , . . .) are mutually
independent.
The base n = 0 is just the assumption.
So assume that (σ(A1), . . . , σ(An−1), An , An+1 , . . .) are mutually independent.
Let n1 < n2 < . . . < nk < nk+1 < . . . < nm be a finite number of indices such that
nk < n < nk+1. Let Bj ∈ σ(Anj ) for j = 1, . . . , k, and Bj = Anj for j = k + 1, . . . ,m.
Let B ∈ σ(An). If B = ∅ then P(B1 ∩ · · · ∩ Bn ∩ B) = 0 = P(B1) · · ·P(Bn) · P(B). If
B = Ω then P(B1∩· · ·∩Bn∩B) = P(B1∩· · ·∩Bn) = P(B1) · · ·P(Bn), by the induction
hypotheses. If B = An then this also holds by the induction hypotheses. So we only
have to deal with the case B = Acn. In this case, for X = B1 ∩ · · · ∩Bn,
P(X ∩B) = P(X \ (X ∩An)) = P(X)− P(X ∩An) = P(X)(1− P(An)) = P(X)P(B),
where we have used the induction hypotheses to say that X and An are independent.
Since this holds for any choice a a finite number of events, we have the induction
step. ut
However, we will take a more windy road to get stronger results... (Corollary 10.8)
10.2. π-systems and Independence
Definition 10.3. Let K be a family of subsets of Ω.
• We say that K is a π-system if ∅ ∈ K and K is closed under intersections; that
is, for all A,B ∈ K, A ∩B ∈ K.
• We say that K is a Dynkin system (or λ-system) if K is closed under set comple-
ments and countable disjoint unions; that is, for any A ∈ K and any sequence
(An) in K of pairwise disjoint subsets, we have that Ac ∈ K and⊎nAn ∈ K.
68
The main goal now is to show that probability measures are uniquely defined once
they are defined on a π-system. This is the content of Theorem 10.7.
Proposition 10.4. If F is Dynkin system on Ω and F is also a π-system on Ω, then
F is a σ−algebra.
Proof. Since F is a Dynkin system, it is closed under complements, and Ω = ∅c ∈ F .
Let (An) be a sequence of subsets in F . Set B1 = A1, C1 = ∅, and for n > 1,
Cn =n−1⋃
j=1
Aj and Bn = An \ Cn.
Since F is a π-system and closed under complements, Cn = (⋂n−1j=1 A
cj)c ∈ F , and so
Bn = An ∩ Ccn ∈ F . Since (Bn) is a sequence of pairwise disjoint sets, and since F is a
Dynkin system,⋃
n
An =⋃
n
Bn ∈ F .
ut
Proposition 10.5. If (Dα)α is a collection of Dynkin systems (not necessarily count-
able). Then D =⋂αDα is a Dynkin system.
leave as ex-
ercise
Proof. Since ∅ ∈ Dα for all α, we have that ∅ ∈ D.
If A ∈ D, then A ∈ Dα for all α. So Ac ∈ Dα for all α, and thus Ac ∈ D.
If (An)n is a countable sequence of pairwise disjoint sets in D, then An ∈ Dα for all
α and all n. Thus, for any α,⋃nAn ∈ Dα. So
⋃nAn ∈ D. ut
Eugene Dynkin (1924–)Lemma 10.6 (Dynkin’s Lemma). If a Dynkin system D contains a π-system K, then
σ(K) ⊂ D.
Proof. Let
F =⋂D′ : D′ is a Dynkin system containing K
.
So F is a Dynkin system and K ⊂ F ⊂ D. We will show that F is a σ−algebra, so
σ(K) ⊂ F ⊂ D.
69
Suppose we know that F is closed under intersections (which is Claim 3 below). Since
∅ ∈ K ⊂ F , we will then have that F is a π-system. Being both a Dynkin system and a
π−system, F is a σ−algebra.
Thus, to show that F is a σ−algebra, it suffices to show that F is closed under
intersections.
Note that F is closed under complements (because all Dynkin systems are).
Claim 1. If A ⊂ B are subsets in F , then B \A ∈ F .
Proof. If A,B ∈ F , then since F is a Dynkin system, also Bc ∈ F . Since A ⊂ B, we
have that A,Bc are disjoint, so A∪Bc ∈ F and so B\A = (Ac∩B) = (A∪Bc)c ∈ F . ut
Claim 2. For any K ∈ K, if A ∈ F then A ∩K ∈ F .
Proof. Let E = A : A ∩K ∈ F.Let A ∈ E and (An) be a sequence of pairwise disjoint subsets in E.
Since K ∈ F and A ∩K ∈ F , by Claim 1 we have that Ac ∩K = K \ (A ∩K) ∈ F .
So Ac ∈ E.
Since (An ∩K)n is a sequence of pairwise disjoint subsets in F , we get that
⋃
n
An⋂K =
⋃
n
(An ∩K) ∈ F .
So we conclude that E is a Dynkin system. Since K is closed under intersections, E
contains K. Thus, by definition F ⊂ E.
So for any A ∈ F we have that A ∈ E, and A ∩K ∈ F . ut
Claim 3. For any B ∈ F , if A ∈ F then A ∩B ∈ F .
Proof. Let E = A : A ∩B ∈ F.Let A ∈ E and (An) be a sequence of pairwise disjoint subsets in E.
Since B ∈ F and A∩B ∈ F , by Claim 1 we have that Ac ∩B = B \ (A∩B) ∈ F . So
Ac ∈ E.
Since (An ∩B)n is a sequence of pairwise disjoint subsets in F , we get that
⋃
n
An⋂B =
⋃
n
(An ∩B) ∈ F .
70
So we conclude that E is a Dynkin system. By Claim 2, K is contained in E. So by
definition, F ⊂ E. ut
Since F is closed under intersections, this completes the proof. ut
The next theorem tells us that a probability measure on (Ω,F) is determined by it’s
values on a π-system generating F .
Theorem 10.7 (Uniqueness of Extension). Let K be a π−system on Ω, and let F =
σ(K) be the σ−algebra generated by K. Let P,Q be two probability measures on (Ω,F),
such that for all A ∈ K, P (A) = Q(A). Then, P (B) = Q(B) for any B ∈ F .
Proof. Let D = A ∈ F : P (A) = Q(A). So K ⊂ D. We will show that D is a Dynkin
system, and since it contains K, the by Dynkin’s Lemma it must contain F = σ(K).
If A ∈ D, then P (Ac) = 1− P (A) = 1−Q(A) = Q(Ac), so Ac ∈ D.
Let (An) be a sequence of pairwise disjoint sets in D. Then,
P (⊎
n
An) =∑
n
P (An) =∑
n
Q(An) = Q(⊎
n
An).
So⊎nAn ∈ D. ut
Corollary 10.8. Let (Ω,F ,P) be a probability space. Let (Πn)n be a sequence of π-
systems, and let Fn = σ(Πn). Then, (Πn)n are mutually independent if and only if
(Fn)n are mutually independent.
Proof. We will prove by induction on n that for any n ≥ 0, the collection (F1,F2, . . . ,Fn,Πn+1,Πn+2, . . . , )
are mutually independent.
For n = 0 this is the assumption. For n > 1, let n1 < n2 < . . . < nk < nk+1 < . . . <
nm be a finite number of indices such that nk < n < nk+1. Let Aj ∈ Fnj , j = 1, . . . , k
and Aj ∈ Πnj , j = k + 1, . . . ,m.
For any A ∈ Fn, if P(A1 ∩ · · · ∩Am) = 0 then A is independent of A1 ∩ · · · ∩Am. So
assume that P(A1 ∩ · · · ∩Am) > 0.
For any A ∈ Fn define the probability measure
P (A) := P(A|A1 ∩ · · · ∩Am).
71
By induction, the collection (F1, . . . ,Fn−1,Πn,Πn+1, . . .) are mutually independent, so
P (A) = P(A) for any A ∈ Πn. Since Πn is a π-system generating Fn, we have by Theo-
rem 10.7 that P (A) = P(A) for any A ∈ Fn. Since this holds for any choice of a finite
number of events A1, . . . , Am, we get that the collection (F1, . . . ,Fn−1,Fn,Πn+1, . . .)
are mutually independent. ut
Corollary 10.9. Let (Ω,F ,P) be a probability space, and let (An)n be a sequence of
mutually independent events. Then, the σ-algebras (σ(An))n are mutually independent.
10.3. Second Borel Cantelli Lemma
We conclude this lecture with
Lemma 10.10 (Second Borel-Cantelli Lemma). Let (An) be a sequence of mutually
independent events. If∑
n P(An) =∞ then P(lim supnAn) = P(An i.o.) = 1.
Proof. It suffices to show that P(lim inf Acn) = 0.
For fixed m > n note that by independence
P(
m⋂
k=n
Ack) =
m∏
k=n
(1− P(Ak)).
Since limm⋂mk=nA
ck =
⋂k≥nA
ck, we have that
P(⋂
k≥nAck) = lim
m→∞
m∏
k=n
(1− P(Ak)) =∏
k≥n(1− P(Ak)) ≤ exp
(−∑
k≥nP(Ak)
),
where we have used 1− x ≤ e−x.
Since∑
n P(An) = ∞, we have that∑
k≥n P(Ak) = ∞ for all fixed n. Thus,
P(⋂k≥nA
ck) = 0 for all fixed n, and
P(lim inf Acn) = P(⋃
n
⋂
k≥nAck) = P(lim
n
⋂
k≥nAck) = lim
n→∞P(⋂
k≥nAck) = 0.
ut
Example 10.11. A biased coin is tossed infinitely many times, all tosses mutually
independent. What is the probability that the sequence 01010 appears infinitely many
times?
72
Let p be the probability the coin’s outcome is 1. Let an be the outcome of the n-th toss.
let An be the event that an = 0, an+1 = 1, an+2 = 0, an+3 = 1, an+4 = 0. Since the tosses
are mutually independent, we have that the sequence of events (A5n+1)n≥0 are mutually
independent. Since P(A5n+1) = p−2(1 − p)−3 > 0, we have that∑∞
n=0 P(A5n+1) = ∞.
So the Borel-Cantelli Lemma tells us that P(A5n+1 i.o.) = 1. Since A5n+1 i.o. implies
An i.o. we are done. 454
73
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 11
11.1. Independent Random Variables
Let X : (Ω,F) → (R,B) be a random variable. Recall that in the definition of
a random variable we require that X is a measurable function; i.e. for any Borel set
B ∈ B we have X−1(B) ∈ F . We want to define a σ-algebra that is all the possible
information that can be infered from X.
Definition 11.1. Let (Ω,F ,P) be a probability space and let (Xα)α∈I : Ω → R be a
collection of random variables. The σ-algebra generated by (Xα)α∈I is defined as
σ(Xα : α ∈ I) := σ(X−1α (B) : α ∈ I,B ∈ B
).
Note that for a random variable X : Ω → R, the collectionX−1(B) : B ∈ B
is a
σ-algebra, so σ(X) =X−1(B) : B ∈ B
.
We now define independent random variables.
Definition 11.2. Let (Xn)n, (Yn)n be collections of random variables on some proba-
bility space (Ω,F ,P). We say that (Xn)n are mutually independent if the σ-algebras
(σ(Xn))n are mutually independent.
We say that (Xn)n are independent of (Yn)n if the σ-algebras σ((Xn)n) and σ((Yn)n)
are independent.
An important property (that is a consequence of the π-system argument above): keep this for
later
Proposition 11.3. Let (Xn)n be a collection of random variables on (Ω,F ,P). (Xn) are
mutually independent, if for any finite number of random variables from the collection,
74
X1, . . . , Xn, and any real numbers a1, a2, . . . , an, we have that
P[X1 ≤ a1, X2 ≤ a2, . . . , Xn ≤ an] = P[X1 ≤ a1] · P[X2 ≤ a2] · · ·P[Xn ≤ an].
Proof. Let X1, . . . , Xn be a finite number of random variables.
Define two probability measure on the Borel sets of Rn: For any B ∈ Bn let
P1(B) = P((X1, . . . , Xn) ∈ B) and P2(B) = P(X1 ∈ π1B) · · ·P(Xn ∈ πnB),
where πj is the projection onto the j-th coordinate. Since these two measure are identical
on the π-system
(−∞, a1]× · · · × (−∞, an] : a1, a2, . . . , an ∈ R ,
and since this π-system generates all Borel sets on Rn, we get that P1 = P2 for all Borel
sets on Rn.
Thus, if Aj ∈ σ(Xj), j = 1, . . . , n, then for all j, Aj = X−1j (Bj) for some Borel set on
R, so
P(A1∩· · ·∩An) = P((X1, . . . , Xn) ∈ B1×· · ·×Bn) = P(X1 ∈ B1) · · ·P(X1 ∈ Bn) = P(A1) · · ·P(An).
ut
Example 11.4. We toss two dice. Y is the number on the first die and X is the sum
of both dice.
Here the probability space is the uniform measure on all pairs of dice results, so
Ω = 1, 2, . . . , 62.
What is σ(X)?
σ(X) = (x, y) ∈ Ω : x+ y ∈ A ,A ⊂ 2, . . . , 12 .
(We have to know that all subsets of 2, . . . , 12 are Borel sets. This follows from the
fact that r =⋃n(r − 1/n, r].) On the other hand,
σ(Y ) = (x, y) ∈ Ω : x ∈ A , A ⊂ 1, . . . , 6 .
Now note
P[X = 7, Y = 3] =1
36= P[X = 7] · P[Y = 3].
75
This is mainly due to P[X = 7] = 1/6. Similarly, for any y ∈ 1, . . . , 6, we have that
the events X = 7 and Y = y are independent.
Are X and Y independent random variables? NO!
For example,
P[X = 6, Y = 3] =1
366= 5
36· 1
6= P[X = 6] · P[Y = 3].
For independence we need all information fromX to be independent of the information
from Y ! 454
Example 11.5. Let (Xn)n be independent random variables such that Xn ∼ Exp(1)
for all n. Then, for any α > 0,
P[Xn > α log n] = e−α logn = n−α.
Note that in this case,
∑
n≥1
P[Xn > α log n] =
∞ if α ≤ 1
<∞ if α > 1.
For every n the event Xn > α log n ∈ σ(Xn). So the sequence (Xn > α log n)n is
independent. By the Borel-Cantelli Lemma (both parts)
P[Xn > α log n i.o.] =
1 if α ≤ 1
0 if α > 1.
454
76
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 12
12.1. Joint Distributions
In the same way that we constructed the Borel sets on R, we can consider the σ-
algebra of subsets of Rd, generated by sets of the form (a1, b1]× (a2, b2]× · · · × (ad, bd].
This is the Borel σ-algebra on Rd, usually denoted by Bd or B(Rd).
Suppose now that we have d random variables X1, . . . , Xd on some probability space
(Ω,F ,P). Then, we can define a functionX : Ω→ Rd byX(ω) = (X1(ω), X2(ω), . . . , Xd(ω)).
Suppose we are told that this function is measurable from (Ω,F) to (Rd,Bd). In this
case X is called a Rd-valued random variable, and we say that X1, . . . , Xd have a
joint distribution on (Ω,F ,P).
Proposition 12.1. X1, . . . , Xd are random variables if and only if X = (X1, . . . , Xd)
is a Rd-valued random variable.
Proof. Note that given a measurable function X : (Ω,F) → (Rd,Bd) we can always
write X = (X1, . . . , Xd), and each Xj is a random variable, since for any B ∈ B,
X−1j (B) = X−1(R× · · · ×B × · · · × R) ∈ F .
Now suppose X1, . . . , Xd are random variables on (Ω,F ,P). We need to show that
X−1(B) ∈ F for all B ∈ Bd.Since Bd = σ((a1, b1]× (a2, b2]× · · · × (ad, bd] : aj < bj , j = 1, . . . , d), by Propo-
sition 7.3 it suffices to prove that
X−1((a1, b1]× (a2, b2]× · · · × (ad, bd]) ∈ F
77
for all aj < bj , j = 1, 2, . . . , d. This follows from the fact that
X−1((a1, b1]× (a2, b2]× · · · × (ad, bd]) =d⋂
j=1
X−1j ((aj , bj ]) ∈ F .
ut
The probability measure PX on (Rd,Bd) defined by PX(B) = P(X ∈ B) = P(X−1(B))
is called the distribution of X. We can also define the joint distribution function
of X1, . . . , Xd:
F(X1,...,Xd)(t1, . . . , td) = FX(~t) = P(X1 ≤ t1, X2 ≤ t2, . . . , Xd ≤ td).
A restatement of Proposition 11.3 is:
Proposition 12.2. Let (X1, . . . , Xd) be d random variables on a probability space (Ω,F ,P)
with a joint distribution. Then, (X1, . . . , Xd) are mutually independent if and only if
∀ (t1, . . . , td) ∈ Rd FX(t1, . . . , td) = FX1(t1) · · ·FXd(td).
Example 12.3. An urn contains 3 red balls, 4 white balls and 5 blue balls. We remove
three balls from the urn, all balls equally likely. Let X be the number of red balls
removed and Y the number of white balls removed.
What is the joint distribution of (X,Y )?
A natural probability space here is the uniform measure on Ω = S : |S| = 3, S ⊂ 1, . . . , 12.Suppose f : 1, . . . , 12 → red, white, blue is the function assigning a color to each
ball in the urn.
So
(X,Y )(S) =(# s ∈ S : f(s) = red ,# s ∈ S : f(s) = white
).
So
P(X,Y )((0, 0)) = P(S ∈ Ω : f(s) = blue ∀ s ∈ S) =
(53
)(
123
) .
For any S ∈ Ω, we can write S = Sr ] Sw ] Sb where Sr = s ∈ S : f(s) = red and
similarly for white and blue. So |S| = |Sr|+ |Sw|+ |Sb|.
78
If (X,Y )(S) = (0, 1) then Sr = 0, Sw = 1 and Sb = 2. So
P(X,Y )((0, 1)) =
(30
)·(
41
)·(
52
)(
123
) .
Similary, if (X,Y )(S) = (1, 1) then Sr = 1, Sw = 1 and Sb = 1. So
P(X,Y )((1, 1)) =
(31
)·(
41
)·(
51
)(
123
) .
Generally, for integers x, y ≥ 0, such that x+ y ≤ 3,
P(X,Y )((x, y)) =
(3x
)·(
4y
)·(
53−x−y
)(
123
) .
Of course F(X,Y ) can be calculated by summing these. 454
Example 12.4. Two fair coins are tossed, there outcomes independent. X ∈ 0, 1is the outcome of the first coin. Y ∈ 0, 1 is the outcome of the second coin. Z =
X + Y (mod 2). Show that X,Z are independent.
First P(X = 0) = P(X = 1) = 1/2. Also,
P(Z = 0) = P(X = Y = 0) + P(X = Y = 1) = 14 + 1
4 = 12 .
Similarly, P(Z = 1) = 1/2.
Now,
P(X = 0, Z = 0) = P(X = 0, Y = 0) = 14 = P(X = 0) · P(Z = 0).
Similarly for other values. 454
As in the case of 1-dimensional random variables, we would like to define the infor-
mation a random variable carries as some σ-algebra. Thus, for a d-dimensional random
variable X we define
σ(X) :=X−1(B) : B ∈ Bd
.
This is a σ−algebra.
Recall that if X = (X1, . . . , Xd) then
σ(X1, . . . , Xd) = σ(X−1j (B) : B ∈ B, 1 ≤ j ≤ d
)
= σ(X−1(R× · · · × R×B × R× · · · × R) : B ∈ B
).
79
The following proposition shows that these notations agree.
Proposition 12.5. Let X = (X1, . . . , Xd) be a d-dimensional random variable on some
probability space. Then, σ(X) = σ(X1, . . . , Xd).
Proof. Define G =B ∈ Bd : X−1(B) ∈ σ(X1, . . . , Xd)
we get that this is a σ-algebra,
and that the set of rectangles K = (a1, b1]× · · · × (ad, bd] : aj < bj ∈ R is contained
in G. Since Bd is generated by K, we get that Bd ⊂ G, and σ(X) ⊂ σ(X1, . . . , Xn).
On the other hand, if B ∈ B then X−1j (B) = X−1(R × · · · × B × · · · × R) ∈ σ(X). So
σ(X1, . . . , Xn) ⊂ σ(X) and they are equal. ut
12.2. Marginals
Lemma 12.6. Let X = (X1, . . . , Xd) be a d-dimensional random variable on some
probability space (Ω,F ,P). Let Y = (X1, . . . , Xd−1) and let Z = (X2, . . . , Xd) (which
are d− 1-dimensional). Then, for any ~t ∈ Rd−1,
FY (~t) = limt→∞
FX(~t, t) and FZ(~t) = limt→∞
FX(t,~t).
Proof. Fix ~t = (t1, . . . , td−1) ∈ Rd−1. Let A = X1 ≤ t1, . . . , Xd−1 ≤ td−1.Let sn ∞, and set An = Xd ≤ sn. Since (sn) is an increasing sequence, we have
that An ⊂ An+1, so (An)n is an increasing sequence, and thus (A ∩ An)n is also an
increasing sequence. Note that
limnAn =
⋃
n
An = X <∞ = Ω,
so limn(A ∩An) = A. Thus, using continuity of probability
FY (~t) = P[A] = limn
P[A ∩An] = limn→∞
FX(~t, sn).
Since this holds for any sequence sn ∞, we have the first assertion.
The proof of the second assertion is almost identical, switching the roles of the first
and last coordinate. ut
80
Corollary 12.7. Let X = (X1, . . . , Xd) be a d-dimensional random variable on some
probability space (Ω,F ,P). Then, for any t ∈ R,
FXj (t) = limti→∞ : i 6=j
FX(t1, . . . , tj−1, t, tj+1, . . . , td).
FXj above is called the marginal distribution function of Xj .
81
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 13
13.1. Discrete Joint Distributions
Recall that a random variable is discrete if there exists a countable set R such that
P(X ∈ R) = 1. R is the range of X, and the function fX(r) = P(X = r) is the density
of X.
Suppose that X,Y are discrete random variables, with ranges RX , RY and densities
fX , fY . Note that R := RX ∪RY is also a range for both X and Y .
Now, we can define the joint density of X and Y by
fX,Y (x, y) = P(X = x, Y = y) = P((X,Y ) = (x, y)).
Note that
P((X,Y ) 6∈ R2) = P(X 6∈ R ∪ Y 6∈ R) ≤ P(X 6∈ R) + P(Y 6∈ R) = 0.
So P((X,Y ) ∈ R2) = 1, and we can think of (X,Y ) as a R2-valued discrete random
variable.
This of course can be generalized:
IfX1, . . . , Xn are discrete random variables. Then (X1, . . . , Xn) is a Rn valued random
variable, supported on some countable set in Rn, and the density of (X1, . . . , Xn) is
defined to be
f(X1,...,Xn)(x1, . . . , xn) = P(X1 = x1, . . . , Xn = xn). in exercise
Exercise 13.1. Let X1, . . . , Xd be random variables. Show that X = (X1, . . . , Xd) is a
jointly discrete random variable if and only if X1, . . . , Xd are each discrete.
82
Assume that X1, . . . , Xd are discrete random variables. Show that X1, . . . , Xd are
mutually independent, if and only if
∀ (t1, . . . , td) ∈ Rd f(X1,...,Xd)(t1, . . . , td) = fX1(t1) · fX2(t2) · · · fXd(td).
13.2. Discrete Marginal Densities
Proposition 13.2. If X = (X1, . . . , Xd) is a discrete joint random variable with range
Rd and density fX , then the density of Xj is given by
fXj (x) =∑
xi∈R : i 6=jfX(x1, . . . , xj−1, x, xj+1, . . . , xd).
Proof. This follows from
FX(t1, . . . , td) =∑
R3rj≤tj : j=1,...,d
fX(r1, . . . , rd).
So
limti→∞ : i 6=j
FX(t1, . . . , tj−1, t, tj+1, . . . , td) =∑
R3rj≤t
∑
ri∈R : i 6=jfX(r1, . . . , rd).
Thus,
fXj (r) = FXj (r)− FXj (r−) =∑
R3rj≤r
∑
ri∈R : i 6=jfX(r1, . . . , rd)−
∑
R3rj<r
∑
ri∈R : i 6=jfX(r1, . . . , rd)
=∑
ri∈R : i 6=jfX(r1, . . . , rj−1, r, rj+1, . . . , rd).
ut
The above density is called the marginal density of Xj .
Example 13.3. Noga and Shir come home from school and decide if to do homework
or watch TV.
Let X = 1 if Noga does homework and X = 0 if she watches TV. Let Y = 1 if Shir
does homework and Y = 0 if she watches TV.
We are given that
P((X,Y ) = (1, 1)) = 14 , P((X,Y ) = (0, 0)) = 1
2 , P((X,Y ) = (1, 0)) = 316 .
What are the marginal densities of X and Y ?
83
Who is more likely to watch TV?
Note that since the density must sum to 1, P((X,Y ) = (0, 1)) = 116 .
Well, let’s calculate the marginal densities:
fX(1) = P((X,Y ) = (1, 0)) + P((X,Y ) = (1, 1)) = 716 .
So it must be that fX(0) = 916 (why?). Similarly,
fY (1) = P((X,Y ) = (0, 1)) + P((X,Y ) = (1, 1)) = 516 .
So fY (0) = 1116 .
We see that Shir is more likely to watch TV. 454
13.3. Conditional Distributions (Discrete)
Example 13.4. After mating season sea turtles land on the beach and lay their eggs.
After hatching, the baby turtles must make it back to the sea on their own, however,
there are many dangers and predators awaiting.
Each female hatches a Poisson-λ number of eggs. After hatching, each baby turtle
has probability p to make it back to the sea, all turtles mutually independent.
What is the probability that exactly k turtles make it back to sea? 454
How do we solve this? We would like to use conditional probabilities, to say that
conditioned on their being n eggs, the number of surviving turtles is Binomial-(n, p).
So we want to define distributions conditioned on the results of other distributions.
Definition 13.5. Let X,Y be discrete random variables defined on some probability
space (Ω,F ,P). For y such that fY (y) > 0 define the random variable X|Y = y as the
random variable whose density is
fX|Y (x|y) = fX|Y=y(x) = P(X = x|Y = y).
X Note that the correct way to do this would be to define a different density fX|y(·)
for every y. This is captured in the notation fX|Y (·|y).
84
Example 13.6. Let the density of X,Y be given by:
X \ Y 0 1
0 0.4 0.1
1 0.2 0.3
So
P(Y = 1) = 0.4 and P(Y = 0) = 0.6.
Thus,
fX|Y (1|1) = P(X = 1|Y = 1) = 0.3/0.4 = 3/4,
fX|Y (0|1) = P(X = 0|Y = 1) = 0.1/0.4 = 1/4,
fX|Y (1|0) = 0.2/0.6 = 1/3,
fX|Y (0|0) = 0.4/0.6 = 2/3.
454
Let’s return to solve the example above:
Solution of Example 13.4. Let X be the number of eggs, and let Y be the number of
turtles that survive.
What we are given, is that X ∼ Poi(λ) and that Y |X = n ∼ Bin(n, p).
That is, for all k ≤ n,
P(Y = k|X = n) =
(n
k
)pk(1− p)n−k.
Thus,
P(Y = k) =
∞∑
n=0
P(Y = k|X = n)P(X = n) =
∞∑
n=k
(n
k
)pk(1− p)n−k · e−λλ
n
n!
=pkλk
k!e−λ
∞∑
n=k
λn−k(1− p)n−k(n− k)!
=(pλ)k
k!e−λeλ(1−p) = e−λp · (λp)k
k!.
That is, the number of surviving turtles has the distribution of a Poisson-λp random
variable. ut
85
Exercise 13.7. Let X,Y be discrete random variables. Show that in exercise
fX,Y (x, y) = fX|Y (x|y)fY (y),
for all y such that fY (y) > 0.
13.4. Sums of Independent Random Variables (Discrete Case)
Suppose that X,Y are random variables on some probability space. One can define
the random variable Z(ω) = X(ω) + Y (ω), or in short Z = X + Y . (This is always a
random variable, because Z = φ(X,Y ), where φ(x, y) = x+y, and since φ is continuous
it is measurable).
Suppose X,Y are independent discrete random variables, with densities fX , fY .
First, from independence, we have that
P(X = x, Y = y) = P(X = x)P(Y = y),
for all x, y. Thus, f(X,Y )(x, y) = fX(x)fY (y).
Now, what is the density of X + Y ? Well,
fX+Y (z) = P(X + Y = z) = P(⊎
y
X + Y = z, Y = y) =∑
y
P(X + Y = z, Y = y)
=∑
y
P(X = z − y)P(Y = y) =∑
y
fX(z − y)fY (y).
This leads to the following definition:
Definition 13.8. Let f, g be two real valued functions supported on a countable set
R ⊂ R. The (discrete) convolution of f, g, denoted f ∗ g, is the real valued function
f ∗ g : R→ R,
(f ∗ g)(z) =∑
y∈R∪(z−R)
f(z − y)g(y).
Note that f ∗ g = g ∗ f by change of variables.
We conclude with the observation:
Proposition 13.9. Let X,Y be two independent discrete random variables on some
probability space. Let Z = X + Y . Then, fZ = fX ∗ fY .
86
Example 13.10. Consider X1, . . . , Xn mutually independent random variables where
Xj ∼ Poi(λj) for some λ1, . . . , λn > 0.
Let Z = X1 + · · ·+Xn. What is the distribution of Z?
Well, since fX1(x) = 0 for x < 0,
fX1+X2(z) = fX1 ∗ fX2(z) =∞∑
x=0
fX1(z − x)fX2(x)
=
z∑
x=0
e−λ1λz−x1
(z − x)!· e−λ2 λ
x2
x!= e−(λ1+λ2) · 1
z!
z∑
x=0
(z
x
)λz−x1 λx2
= e−(λ1+λ2) · (λ1 + λ2)z
z!.
This is the density of the Poisson distribution!
So X1 +X2 ∼ Poi(λ1 + λ2). Continuing inductively, we conclude that Z ∼ Poi(λ1 +
λ2 + · · ·+ λn).
That is: the sum of independent Poisson random variables is again a Poisson random
variable. 454
Example 13.11. Let X ∼ Poi(α) and Y ∼ Poi(β). Assume X,Y are independent.
What is the distribution of X|X + Y = n?
Let’s calculate explicitly: Recall that X + Y ∼ Poi(α+ β). For k ≤ n,
P(X = k|X + Y = n) =P(X = k)P(Y = n− k)
P(X + Y = n)= e−α
αk
k!· e−β βn−k
(n− k)!·(e−(α+β) (α+ β)n
n!
)−1
=
(n
k
)αkβn−k
(α+ β)n=
(n
k
)(α
α+ β
)k·(
1− α
α+ β
)n−k.
So X conditioned on X + Y = n has the Binomial-(n, α/(α+ β)) distribution. 454
87
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 14
14.1. Continuous Joint Distributions
Definition 14.1. LetX1, . . . , Xd be random variables on some probability space (Ω,F ,P).
We say that X1, . . . , Xd have an absolutely continuous joint distribution (or that
X = (X1, . . . , Xd) is a Rd-valued absolutely continuous random variable) if there ex-
ists a non-negative integrable function fX = fX1,...,Xd : Rd → R+, such that for all
~t = (t1, . . . , td),
FX(~t) =
∫ t1
−∞· · ·∫ td
−∞fX(~s)d~s.
Marginal distributions are as in the discrete case: If X = (X1, . . . , Xd) is a joint
random variable then the marginal distribution of Xj is
FXj (t) = limtk→∞ : k 6=j
FX(t1, . . . , tj−1, t, tj+1, . . . , td).
Specifically, if X is absolutely continuous, then
fXj (t) =
∫· · ·∫fX(t1, . . . , tj−1, t, tj+1, . . . , td)dt1 · · · dtj−1dtj+1 · · · dtd.
!!! It is not true that if X1, . . . , Xd are absolutely continuous, then necessarily
X = (X1, . . . , Xd) is absolutely continuous as a Rd-valued random variable.
Example 14.2. If X = Y are the same absolutely continuous random variable, then
(X,Y ) is not absolutely continuous, since if it was, then for the Borel setA =
(x, y) ∈ R2 : x = y
,
1 =
∫ ∫fX,Y (x, y)dxdy =
∫ ∫
AfX,Y (x, y)dxdy = 0
since fX,Y (x, y) = 0 if (x, y) 6∈ A and since A has measure 0 in the plane.
88
A more elementary way to see this (in the scope of these notes) is: If fX,Y is the
density of a pair (X,Y ), such that X = Y is an absolutely continuous random variable,
then: Let gs(x) =∫ s−∞ fX,Y (x, y)dy. Then,
FX,Y (∞, s) = P[X ≤ ∞, Y ≤ s] = P[X ≤ s, Y ≤ s] = FX,Y (s, s).
So∫ ∞
−∞gs(x)dx = FX,Y (∞, s) = FX,Y (s, s) =
∫ s
−∞gs(x)dx,
which implies that∫ ∞
sgs(x)dx = 0.
Thus, gs(x) = 0 for all (actually a.e.) x ≥ s (since gs is a non-negative function). Now,
if x ≥ s then
0 = gs(x) =
∫ s
−∞fX,Y (x, y)dy,
so fX,Y (x, y) = 0 for all (actually a.e.) x ≥ s, y ≤ s. Since this holds for all s we have
that fX,Y = 0 for all (or a.e.) x ≥ y. Exchanging the roles of X,Y we get that fX,Y = 0
(a.e.), which is not a density function, contradiction!
454
However, we have the following criterion for independence:
Proposition 14.3. X1, . . . , Xn are mutually independent absolutely continuous random
variables if and only if X = (X1, . . . , Xn) is a Rn-valued absolutely continuous random
variable with density fX(t1, . . . , tn) = fX1(t1) · · · fXn(tn).
Proof. The “if” part follows from Proposition 11.3.
For the “only if” part: Let f(t1, . . . , tn) = fX1(t1) · · · fXn(tn). We need to show that
P[X1 ≤ t1, . . . , Xn ≤ tn] =
∫
Rnf(t1, . . . , tn)dt1 · · · dtn.
Indeed, this follows from independence. ut
89
14.2. Conditional Distributions (Continuous)
Recall that if X,Y are discrete random variables, then
fX|Y (x|y) = P[X = x|Y = y].
For continuous random variables this is not defined, since P[Y = y] = 0. However, we
still can define the following:
Let X,Y be absolutely continuous random variables. Let y ∈ R be such that fY (y) >
0. Define a function fX|Y (·|y) : R→ R+ by
fX|Y (x|y) :=f(X,Y )(x, y)
fY (y).
Then, for every y such that fY (y) > 0, this function is a density; indeed, recall that
fY (y) =∫f(X,Y )(x, y)dx so
∫ ∞
−∞fX|Y (x|y)dx =
∫f(X,Y )(x, y)
fY (y)dx = 1.
Thus, fX|Y (·|y) defines an absolutely continuous random variable, denoted X|Y = y,
and called X conditioned on Y = y. fX|Y (·|y) is the conditional density of X
conditioned on Y = y. We have the identity
f(X,Y )(x, y) = fX|Y (x|y) · fY (y).
For any Borel set B ∈ B, we define
P[X ∈ B|Y = y] := P[(X|Y = y) ∈ B] =
∫
BfX|Y (x|y)dx.
In many cases, the information is given as conditional densities, rather than joint
densities.
Example 14.4. Let X,Y have joint density
fX,Y (x, y) =
1ye−x/ye−y x, y ≥ 0
0 otherwise
Let’s compute the density of X|Y : First, the marginal density of Y is
fY (y) =
∫ ∞
0
1
ye−x/ye−ydx = e−y,
90
for y ≥ 0 and 0 otherwise. For x, y ≥ 0,
fX|Y (x|y) =
1ye−x/ye−y
e−y=
1
ye−x/y,
and fX|Y (x|y) = 0 for x < 0. So, X|Y = y ∼ Exp(1/y).
Thus, for example,
P[X > 1|Y = y] =
∫ ∞
1fX|Y (x|y)dx = e−1/y.
454
14.3. Sums (Continuous)
Let X,Y be independent absolutely continuous random variables. Let Z = X + Y .
What is FZ?
P[Z ≤ z] = P[X + Y ≤ z] =
∫ ∫
x+y≤zfX,Y (x, y)dxdy
=
∫ ∞
−∞
∫ z−y
−∞fX(x)fY (y)dxdy
=
∫ ∞
−∞FX(z − y)fY (y)dy = FX ∗ fY (z).
Now, note that
FZ(z) =
∫ ∞
−∞
∫ z−y
−∞fX(x)fY (y)dxdy
=
∫ z
−∞
∫ ∞
−∞fX(s− y)fY (y)dyds =
∫ z
−∞fX ∗ fY (s)ds.
Thus, Z = X + Y is an absolutely continuous random variable with density fX ∗ fY .
Example 14.5. What is the distibution of X + Y , for independent X,Y , when:
• X ∼ N(0, 1) and Y ∼ N(0, 1).
• X ∼ Exp(λ) and Y ∼ Exp(α).
In the normal case we have that (x−y)2+y2
2 = x2
4 + (y − x2 )2. So,
fX+Y (x) =1
2πe−x
2/4 ·∫ ∞
−∞e−(y−x/2)2dy =
1
2πe−x
2/4 · √π =1√
2π ·√
2e−x
2/4.
So X + Y ∼ N(0,√
2).
91
In the exponential case for x > 0,
fX+Y (x) = fX ∗ fY (x) = αλe−λx∫ ∞
−∞1[0,∞)(x− y)1[0,∞)(y)e−(α−λ)ydy
=αλ
α− λe−λx(1− e−(α−λ)x) =
αλ
α− λ(e−λx − e−αx).
Since P[X + Y < 0] = 0 this is the density of X + Y . 454
92
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 15
In this lecture we want to define the average value of a random variable X. If X is
discrete, it would be intuitive to define the average as∑
x P[X = x]x. If X is abso-
lutely continuous, perhaps one would also think of defining the average analogously as∫xfX(x)dx. But what about the general case?
We will follow Lebesgue integration theory for this task.
15.1. Preliminaries For the TA
Proposition 15.1. Let X = (X1, . . . , Xn) and Y = (Y1, . . . , Ym) be random variables
on some probability space (Ω,F ,P). Then, X is independent of Y if and only if for
any measurable functions g : Rn → R and h : Rm → R we have that g(X), h(Y ) are
independent.
Proof. Now, for the “only if” direction: Let g, h be measurable functions as in the
proposition. Let A1 ∈ σ(g(X)), A2 ∈ σ(h(Y )). Note that there exist Borel sets B1, B2
such that A1 = X−1g−1(B1) and A2 = Y −1h−1(B2). Since g is measurable, g−1(B1) ∈Bn, and similarly, h−1(B2) ∈ Bm. Thus, A1 ∈ σ(X) and A2 ∈ σ(Y ). Thus, A1, A2 are
independent. Since this holds for all A1, A2 we have that g(X), h(Y ) are independent.
For the “if” direction: Let A1 ∈ σ(X), A2 ∈ σ(Y ). So A1 = X−1(B1) and A2 =
Y −1(B2) for some B1 ∈ Bn and B2 ∈ Bm. Define g(~x) = 1~x∈B1 and h(~y) = 1~y∈B2.
Since these are measurable functions, and since A1 ∩ A2 = X ∈ B1, Y ∈ B2, we have
that
P[A1 ∩A2] = P[g(X) = 1, h(Y ) = 1] = P[g(X) = 1]P[h(Y ) = 1] = P[A1]P[A2].
Since this holds for all A1, A2 we have that X is independent of Y . ut
93
Example 15.2. The function g(x) = bxc is measurable, since for any b ∈ R,
g−1(−∞, b] = x : bxc ≤ b = (−∞, bbc] ∈ B.
The function g(x) = c · x is measurable, since it is continuous. The function g(x) =
max x, 0 is measurable since g−1(−∞, b] = [0, b] if b ≥ 0 and g−1(−∞, b] = ∅ if b < 0.
In a similar way g(x) = max −x, 0 is measurable.
Thus, if X,Y are independent random variables, then X+ := max X, 0 , Y + :=
max Y, 0 are independent, and X+n = 2−nb2nX+c is independent of Y +
n = 2−nb2nY +c.454
15.2. Simple Random Variables
Let us start with simple random variables:
This dfn is
important,
all ω ∈ Ω are
sent to some
non-negative
valueDefinition 15.3. Let (Ω,F ,P) be a probability space. X is a simple random variable
if there exists a finite set of positive numbers A = a1, . . . , am ⊂ R such that Ω =
X = a1 ] · · · ] X = am ] X = 0.
Specifically, such a X is a discrete random variable. In this case it is intuitive that
the average of X should be∑m
k=1 ak P[X = ak]. Note that
X =
m∑
k=1
ak1X=ak,
where all ak > 0 and the events X = ak are pairwise disjoint.
We want to say that simple random variables somehow approximate all random vari-
ables, so that we can define the average on these. This is the content of the following
definitions and proposition.
Definition 15.4. Let (Ω,F ,P) be a probability space. A sequence of random variables
(Xn)n is monotone non-decreasing if for all ω ∈ Ω, and all n, Xn(ω) ≤ Xn+1(ω).
Let (Ω,F ,P) be a probability space. Let (Xn)n be a sequence of random variables.
Suppose that for every ω ∈ Ω the limit X(ω) := limn→∞Xn(ω) exists. Then, the
94
function X is a random variable, since X = lim supXn = lim inf Xn, and since for any
Borel set B,
X−1(B) =ω : lim inf
n→∞Xn(ω) ∈ B
=⋃
n
⋂
k≥nω : Xk(ω) ∈ B = lim inf X−1
n (B) ∈ F .
(In fact, limX−1n (B) = X−1(B).)
Definition 15.5. Let (Ω,F ,P) be a probability space. Let (Xn)n be a sequence of
random variables. Suppose that for every ω ∈ Ω the limit X(ω) := limn→∞Xn(ω) exists.
Then, the function X is a random variable. In this case we say that Xn converges
everywhere to X.
If in addition, (Xn)n is a monotone non-decreasing sequence, we say that Xn con-
verges everywhere monotonely (from below) to X, and denote this by Xn X.
Finally, a non-negative random variable X is a random variable satisfying X(ω) ≥ 0
for all ω ∈ Ω. We write this as X ≥ 0. In general, we write X ≤ Y if X(ω) ≤ Y (ω) for
all ω ∈ Ω. (So, for example, a sequence (Xn)n is monotone non-decreasing if Xn ≤ Xn+1
for all n.)
Proposition 15.6. Let (Ω,F ,P) be a probability space, and X a random variable. If
X is non-negative, then there exists a sequence (Xn)n of simple random variables such
that Xn X.
Proof. For all n, k define An,k := X−1[k2−n, (k + 1)2−n), which is of course an event.
Note that for all n,
Ω =2nn−1⊎
k=0
An,k⊎X−1[n,∞).
For all n ≥ 1 define
Xn =
2nn−1∑
k=0
2−nk1An,k + n1X−1[n,∞).
[That is, we divide the interval [0, n) into intervals of length 2−n, and replace X with
the leftmost value in that interval, cutting this after n. See Figure 4]
So (Xn)n is a sequence of simple random variables.
95
Note that Xn(ω) = min n, 2−nb2nX(ω)c. If X(ω) < n then
Xn(ω) = 2−nb2nX(ω)c = 2−nb122n+1X(ω)c ≤ 2−(n+1)b2n+1X(ω)c = Xn+1(ω).
(For all x, if bx/2c ≤ x/2 then 2bx/2c ≤ x.) If n ≤ X(ω) < n+1 then 2n+1n ≤ 2n+1X(ω)
so
Xn(ω) = n ≤ 2−(n+1)b2n+1X(ω)c = Xn+1(ω).
Finally, if X(ω) ≥ n + 1, then Xn(ω) = n ≤ n + 1 = Xn+1(ω). So, Xn(ω) ≤ Xn+1(ω)
for all ω and all n, and (Xn)n is a non-decreasing sequence.
Now to show that limnXn(ω) = X(ω): Note that |X(ω) − 2−nb2nX(ω)c| ≤ 2−n.
Thus, for all ω ∈ Ω, limn→∞Xn(ω) = X(ω). ut
0 n n + 1
XnXn+1
Figure 4. Xn and Xn+1
96
15.3. Expectation of Non-Negative Random Variables
We want to define the expectation on a random variable by approximating by simple
random variables as in Proposition 15.6, and taking the limit. For this we have to make
sure the limit exists and makes sense.
Let us start by defining the expectation for a simple random variable.
Definition 15.7. If X =∑m
k=1 ak1X−1(ak) is a simple random variable, define
E[X] :=m∑
k=1
ak P[X = ak].
Proposition 15.8. The above definition does not depend on the choice of the partition
of Ω; that is, if X =∑n
j=1 bj1Bj where Ω = ]nj=1Bj are a partition of Ω, then E[X] =∑n
j=1 bj P[Bj ].
Proof. Suppose that
X =
m∑
k=1
ak1X−1(ak) =
n∑
j=1
bj1Bj ,
where Ω = ]mj=1Bj are a partition of Ω as in the statement of the proposition. We
assume without loss of generality that am = 0 so Ω = ]mk=1 X = ak. Then, for any
k, j, if ω ∈ X = ak ∩ Bj then ak = X(ω) = bj . Otherwise, if X = ak ∩ Bj = ∅ the
P[X = ak, Bj ] = 0. Thus, for all k, j,
ak P[X = ak, Bj ] = bj P[X = ak, Bj ].
Since Ω = ]nj=1Bj = ]mk=1 X = ak, by the law of total probability we obtain,
E[X] =
m∑
k=1
ak P[X = ak] =
m∑
k=1
n∑
j=1
ak P[X = ak, Bj ]
=n∑
j=1
m∑
k=1
bj P[X = ak, Bj ] =
n∑
j=1
bj P[Bj ].
ut
Proposition 15.9. If X,Y are simple random variables, then:
• If X ≤ Y then E[X] ≤ E[Y ].
• For all a ≥ 0, E[aX + Y ] = aE[X] + E[Y ].
97
Specifically we have for a simple random variable X,
E[X] = sup E[S] : 0 ≤ S ≤ X ,S is simple .
Proof. Assume that X =∑
k ak1X=ak and Y =∑
j bj1Y=bj. If ω ∈ Ω is such that
X(ω) = ak and Y (ω) = bj then ak ≤ bj . So, ak P[X = ak, Y = bj ] ≤ bj P[X = ak, Y =
bj ]. Thus,
E[X] =∑
k,j
ak P[X = ak, Y = bj ] ≤∑
k,j
bj P[X = ak, Y = bj ] = E[Y ].
Now, note that if Z = aX + Y for some a ≥ 0, then
Z =∑
k,j
(a · ak + bj)1X=ak,Y=bj
since Ω = ]k X = ak ] X = 0 and Ω = ]j Y = bj ] Y = 0. So
E[Z] =∑
k,j
(a · ak + bj)P[X = ak, Y = bj ] = aE[X] + E[Y ].
ut
Definition 15.10. If X is a general non-negative random variable, define
E[X] := sup E[S] : 0 ≤ S ≤ X ,S is simple ,
(where we allow the value E[X] =∞, and in this case we say X has infinite expectation).
This definition coincides with the previous definition for simple random variables by
the proposition above.
important
for what
followsNote that the definition immediately implies that
0 ≤ X ≤ Y ⇒ E[X] ≤ E[Y ]
because the supremum for Y is over a larger set.
98
15.4. Monotone Convergence
Theorem 15.11 (Monotone Convergence). Let (Xn)n be a non-decreasing sequence of
non-negative random variables. If Xn X then E[Xn] E[X].
Proof. The proof has four cases:
• Case 1: Xn are simple, X = 1A.
For this case, let ε > 0, and set An := Xn > 1− ε.If ω ∈ A then X(ω) = 1, then since Xn(ω) X(ω), then Xn(ω) > 1 − ε
for some n. On the other hand, if Xn(ω) > 1 − ε, then X(ω) > 1 − ε and so
X(ω) = 1 and ω ∈ A. Thus, (An)n is an increasing sequence of events, such that
limnAn =⋃nAn = A.
Since Xn are non-negative, Xn ≥ (1− ε)1An . Hence, (1− ε)P[An] ≤ E[Xn] ≤E[X] = P[A]. Continuity of probability finishes this case.
• Case 2: Xn, X simple.
Assume X =∑m
k=0 ak1X=ak, where 0 = a0 < a1 < · · · < am and Ω =
X = a0 ] X = a1 ] · · · ] X = am. Fix k > 0 and consider the sequence
Yn := a−1k 1X=akXn. Xn X implies that Yn 1X=ak. By Case 1, since
Yn are simple, we have that E[Yn] P[X = ak].
Since Xn ≤ X we have that 0 ≤ Xn1X=a0 ≤ X1X=0 = 0. Thus, Xn =
Xn ·∑m
k=0 1X=ak =∑m
k=1 ak · a−1k 1X=akXn. By linearity of expectation for
simple random variables,
E[Xn] =∑
k
ak · E[a−1k 1X=akXn]
∑
k
ak P[X = ak] = E[X].
• Case 3: Xn simple, X general (non-negative).
Let 0 ≤ S ≤ X be a simple random variable, and Yn := min Xn, S. Xn X
implies that Yn S. So Case 2 and monotonicity of expectation for simple
random variables give that E[Xn] ≥ E[Yn] E[S]. Thus, limn E[Xn] ≥ E[S] for
all simple S ≤ X. Taking the supremum we are done.
99
• Case 4: Xn, X general.
Define Yn := min 2−nb2nXnc, n. So Yn are all simple, and Yn ≤ Xn ≤ X, and
so E[Yn] ≤ E[Xn] ≤ E[X]. If we show that Yn X then we are done by Case 3.
As in Proposition 15.6, Yn ≤ min
2−(n+1)b2n+1Xnc, n+ 1≤ Yn+1, so (Yn)n
is a non-decreasing sequence.
For all ω and n such that Xn(ω) < n, we have that |Yn(ω) − Xn(ω)| ≤2−n. So, for any ω, and for any ε > 0, if n > n0(ω, ε) is large enough, then
|Yn(ω)−X(ω)| < ε. That is, Yn X.
ut
15.5. Linearity
Lemma 15.12. Let X,Y be a non-negative random variables. Then, for any a > 0,
• X ≤ Y implies E[X] ≤ E[Y ].
• E[aX + Y ] = aE[X] + E[Y ].
Proof. The first assertion is immediate from the definition, since if 0 ≤ S ≤ X is simple
then 0 ≤ S ≤ Y , so we are taking a supremum over a larger set.
Let Xn X and Yn Y be sequences of simple random variables converging
everywhere monotonely to X,Y respectively, as in Proposition 15.6. So aXn + Yn aX + Y .
Monotone convergence tells us that
E[Xn] E[X] E[Yn] E[Y ] and E[aXn + Yn] E[aX + Y ].
Since for all n, E[aXn + Yn] = aE[Xn] + E[Yn], we are done. ut
15.6. Expectation
Finally, let us define the expectation of a general random variable.
First, some notation. Let X be a random variable. Define X+ := max X, 0 and
X− := max −X, 0. Note that X+, X− are non-negative random variables, and that
X = X+ −X− and |X| = X+ +X−.
100
Definition 15.13. Let X be a random variable. If E[X+] = E[X−] = ∞ we say that
the expectation of X is not defined (or X does not have expectation). If at least one of
E[X+],E[X−] is finite then define the expectation of X by
E[X] := E[X+]− E[X−].
15.6.1. Properties of Expectation. The most important property of expectation is
linearity:
Theorem 15.14. Let X,Y be random variables on some probability space (Ω,F ,P) such
that E[X],E[Y ] exist. Then:
• For all a ∈ R, E[aX + Y ] = aE[X] + E[Y ].
• X ≤ Y implies that E[X] ≤ E[Y ].
• |E[X]| ≤ E[|X|].
Proof. For linearity, note that (X+Y )+−(X+Y )− = X+Y = X+−X−+Y +−Y −, so
(X+Y )+ +X−+Y − = (X+Y )−+X+ +Y +. Since both sides are linear combinations
of non-negative random variables,
E(X + Y )+ + EX− + EY − = E(X + Y )− + EX+ + EY +
which implies that
E[X + Y ] = E(X + Y )+ − E(X + Y )− = EX+ − EX− + EY + − EY − = EX + EY.
Also, for any a > 0, (aX)+ = aX+ and (aX)− = aX−. If a < 0 then (aX)+ = −aX−
and (aX)− = −aX+. Thus, in both cases E[aX] = aE[X]. This proves linearity.
For the second assertion, let Z = Y − X. So Z ≥ 0, and thus E[Z] ≥ 0. Since
E[Z] = E[Y ]− E[X] by linearity, we are done.
Finally, for the last assertion
|E[X]| = |E[X+]− E[X−]| ≤ E[X+] + E[X−] = E[|X|],
where we have used the fact that X+ ≥ 0 and X− ≥ 0. ut
101
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 16
16.1. Expectation - Discrete Case Expectation
for discrete
RVsProposition 16.1. Let X be a discrete random variable, with range R and density fX .
Then,
E[X] =∑
r∈RrfX(r).
Proof. For all N , let
X+N :=
∑
r∈R∩[0,N ]
1X=rr and X−N := −∑
r∈R∩[−N,0]
1X=rr.
Note that X+N X+ and X−N X−. Moreover, by linearity
E[X+N ] =
∑
r∈R∩[0,N ]
P[X = r]r and E[X−N ] = −∑
r∈R∩[−N,0]
P[X = r]r.
Using monotone convergence we get that
E[X] = E[X+]− E[X−] = limN→∞
E[X+N ]− E[X−N ] =
∑
r∈RfX(r)r.
ut Examples:
Ber, Bin,
Poi, Geo
Example 16.2. Let us calculate the expectations of different discrete random variables:
• If X ∼ Ber(p) then
E[X] = 1 · P[X = 1] + 0 · P[X = 0] = p.
102
• If X ∼ Bin(n, p) then since(nk
)· k = n ·
(n−1k−1
)for 1 ≤ k ≤ n,
E[X] =
n∑
k=0
fX(k) · k =
n∑
k=0
(n
k
)pk(1− p)n−k · k
= np ·n∑
k=1
(n− 1
k − 1
)pk−1(1− p)n−k = np.
An easier way, would be to note thatX =∑n
k=1Xk whereXk ∼ Ber(p) (and in
fact X1, . . . , Xn are independent). Thus, by linearity, E[X] =∑n
k=1 E[Xk] = np.
• For X ∼ Poi(λ):
E[X] =∞∑
k=0
e−λλk
k!· k = λ ·
∞∑
k=1
e−λλk−1
(k − 1)!= λ.
• For X ∼ Geo(p): Note that for g(x) = −(1−x)k, we have p∂xg(x) = k(1−x)k−1.
E[X] =∞∑
k=1
fX(k) · k =∞∑
k=1
(1− p)k−1p · k
= −p ·∞∑
k=1
∂∂p(1− p)k = −p · ∂∂p
( ∞∑
k=1
(1− p)k)
= −p · ∂∂p(
1p − 1
)= p · 1
p2= 1
p .
Do this oneAnother way: Let E = E[X]. Then,
E = p+∞∑
k=2
(1−p)k−1pk = p+
∞∑
k=2
(1−p)k−1p(k−1)+∞∑
k=2
(1−p)k−1p = p+(1−p)E+1−p.
So E = 1 + (1− p)E or E = 1/p.
454
Example 16.3. A pair of independent fair dice are tossed. What is the expected
number of tosses needed to see Shesh-Besh?
Note that each toss of the dice is an independent trial, such that the probability of
seeing Shesh-Besh is 2/36 = 1/18. So if X = number of tosses until Shesh-Besh, then
X ∼ Geo(1/18). Thus, E[X] = 18. 454
103
16.1.1. Function of a random variable. Let g : Rd → R be a measurable function.
Let (X1, . . . , Xd) be a joint distribution of d discrete random variables with range R
each. Then, Y = g(X1, . . . , Xd) is a random variable. What is its expectation?
Well, first we need the density of Y : For any y ∈ R we have that
P[Y = y] = P[(X1, . . . , Xd) ∈ g−1(y)] =∑
(r1,...,rd)∈Rd∩g−1(y)
P[(X1, . . . , Xd) = (r1, . . . , rd)].
So, if RY := g(Rd), then since
Rd =⊎
y∈RY
Rd ∩ g−1(y),
and since Y is discrete we get that
E[g(X1, . . . , Xd)] =∑
y∈RY
y P[Y = y]
=∑
y∈RY
∑
~r∈Rd∩g−1(y)
f(X1,...,Xd)(~r) · g(~r)
=∑
~r∈Rdf(X1,...,Xd)(~r) · g(~r).
Specifically,
Proposition 16.4. Let g : Rd → R be a measurable function, and X = (X1, . . . , Xd) a
discrete joint random variable with range Rd. Then,
E[g(X)] =∑
~r∈RdfX(~r) · g(~r).
Example 16.5. Manchester United plans to earn some money selling Wayne Rooney
jerseys. Each jersey costs the club x pounds, and is sold for y > x pounds. Suppose
that the number of people who want to buy jerseys is a discrete random variable with
range N.
What is the expected profit if the club orders N jerseys?
How many jerseys should be ordered in order to maximize this profit?
Solution. Let pk = fX(k) = P[X = k]. Let gN : N→ R be the function that gets the
number of people that want to buy jerseys and gives the profit, if the club ordered N
104
jerseys. That is,
gn(k) =
ky −Nx k ≤ NN(y − x) k > N
The expected profit is then
E[gN (X)] =∞∑
k=0
pkgN (k) =∞∑
k=0
pkN(y−x)−N∑
k=0
pk(N−k)y = N(y−x)−N∑
k=0
pk(N−k)y.
Call this G(N) := N(y−x)−∑Nk=0 pk(N−k)y. We want to maximize this as a function
of N . Note that
G(N + 1)−G(N) = y − x−N∑
k=0
ypk(N + 1−N) = y − x− y P[X ≤ N ].
So G(N+1) > G(N) as long as P[X ≤ N ] < y−xy , so the club should order N+1 jerseys
for the largest N such that P[X ≤ N ] < y−xy . 454
Example 16.6. Some more calculations:
• X ∼ H(N,m, n) (recall that X is the number of “special” objects chosen, when
choosing uniformly n objects out of N objects, with a total of m special objects).
We use the fact that(m
k
)· k =
(m− 1
k − 1
)·m and
(N
n
)=
(N − 1
n− 1
)· Nn.
E[X] =
(N
n
)−1
·n∧m∑
k=0
k ·(m
k
)·(N −mn− k
)
=
(N − 1
n− 1
)−1
· nN·m ·
n∧m∑
k=1
(m− 1
k − 1
)·(
(N − 1)− (m− 1)
(n− 1)− (k − 1)
)=nm
N.
• X ∼ NB(m, p) (the number of trials until the m-th success). Using(k − 1
m− 1
)· k =
(k
m
)·m
with j = k + 1 and n = m+ 1,
E[X] =∞∑
k=m
k ·(k − 1
m− 1
)pm(1− p)k−m = mp−1 ·
∞∑
j=n
(j − 1
n− 1
)pn(1− p)j−n =
m
p.
454
105
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 17
17.1. Expectation - Continuous Case
Our goal is now to prove the following theorem: Goal is
E g(X)
Theorem 17.1. Let X = (X1, . . . , Xd) be an absolutely continuous random variable,
and let g : Rd → R be a measurable function. Then,
E[g(X)] =
∫
Rdg(~x)fX(~x)d~x.
The main lemma here is
Lemma 17.2. Let X = (X1, . . . , Xd) be an absolutely continuous random variable.
Then, for any Borel set B ∈ Bd,
P[X ∈ B] =
∫
BfX(~x)d~x.
Proof. Let Q(B) =∫B fX(~x)d~x. Then Q is a probability measure on (Rd,Bd), that
coincides with PX on the π-system of rectangles (−∞, b1]×· · ·× (−∞, bd]. Thus, P[X ∈B] = PX(B) = Q(B) for all B ∈ Bd. ut
Remark 17.3. We have not really defined the integral∫B fX(~x)d~x. However, for our
purposes, we can define it as P[X ∈ B], and note that this coincides with the Riemann
integral on intervals.
Proof of Theorem 17.1. First assume that g ≥ 0, so Rd = g−1[0,∞). For all n define
Yn = 2−nb2ng(X)c which are discrete non-negative random variables.
First, we show that
E[Yn] =
∫
Rd2−nb2ng(~x)cfX(~x)d~x.
106
Indeed, for n, k ≥ 0 let Bn,k = g−1[2−nk, 2−n(k + 1)) ∈ Bd.Note that
Yn =∞∑
k=0
2−nk1X∈Bn,k and E[Yn] =∞∑
k=0
2−nk P[X ∈ Bn,k].
Now, since
Rd = g−1[0,∞) =
∞⊎
k=0
g−1[2−nk, 2−n(k + 1)) =
∞⊎
k=0
Bn,k,
and since
1Bn,k2−nb2ngcfX = 1Bn,k2−nkfX ,
we have by the lemma above that
∫
Rd2−nb2ng(~x)cfX(~x)d~x =
∫
Rd
∞∑
k=0
1Bn,k2−nb2ng(~x)cfX(~x)d~x =
∞∑
k=0
2−nk
∫
Bn,k
fX(~x)d~x
=∞∑
k=0
2−nk P[X ∈ Bn,k] = E[Yn].
Here we have used the fact that
N∑
k=0
1Bn,k2−nk ∞∑
k=0
1Bn,k2−nk.
Since |2−nb2ngc − g| ≤ 2−n , we get that |Yn − g(X)| ≤ 2−n. Thus,
∣∣∣∣E[g(X)]−∫
Rdg(~x)fX(~x)d~x
∣∣∣∣ ≤ |E[g(X)]− E[Yn]|+∣∣∣∣∫
Rdg(~x)fX(~x)d~x−
∫
Rd2−nb2ng(~x)cfX(~x)d~x
∣∣∣∣
≤ 2 · 2−n → 0.
This proves the theorem for non-negative functions g.
Now, if g is a general measurable function, consider g = g+ − g−. Since g+, g− are
non-negative, we have that
E[g(X)] = E[g+(X)− g−(X)] = E[g+(X)]− E[g−(X)] =
∫
Rd(g+(~x)− g−(~x))fX(~x)d~x.
ut Expectation
for cont.
107
Corollary 17.4. Let X be an absolutely continuous random variable with density fX .
Then,
E[X] =
∫ ∞
−∞tfX(t)dt.
X Compare∫tfX(t)dt to
∑r P[X = r] in the discrete case. This is another place
where fX is “like” P[X = ·] (although the latter is identically 0 in the continuous case,
as we have seen).
Exam-
ples: Unif.,
Normal, Exp.
Example 17.5. Expectations of some absolutely continuous random variables:
• X ∼ U [0, 1]: E[X] =∫ 1
0 tdt = 1/2.
• More generally, X ∼ U [a, b]:
E[X] =
∫ b
at · 1
b−adt = 12(b−a)(b2 − a2) = b+a
2 .
• X ∼ Exp(λ): We use integration by parts, since∫e−λt = −λ−1e−λt,
E[X] =
∫ ∞
0t · λe−λtdt = −te−λt
∣∣∣∞
0+
∫ ∞
0e−λtdt =
1
λ.
• X ∼ N(µ, σ):
E[X] =
∫ ∞
−∞
t√2πσ
exp(− (t−µ)2
2σ2
)dt.
Change u = t− µ so du = dt,
E[X] =1√2πσ
·∫ ∞
−∞u exp
(− u2
2σ2
)du+
∫ ∞
−∞µfX(t)dt.
Since the function u 7→ u exp(− u2
2σ2
)is an odd function, its integral is 0, so
E[X] = µ
∫ ∞
−∞fX(t)dt = µ.
A simpler way is to notice that X−µσ ∼ N(0, 1) so
E[X − µσ
] =1√2π
∫ ∞
−∞te−t
2/2dt = 0,
as an integral over an odd function. E[X] = µ follows from linearity of expecta-
tion.
454
108
Proposition 17.6. Let X be an absolutely continuous random variable, such that E[X]
exists. Then,
E[X] =
∫ ∞
0(P[X > t]− P[X ≤ −t])dt =
∫ ∞
0(1− FX(t)− FX(−t))dt.
Proof. Note that∫ ∞
0P[X > t]dt =
∫ ∞
0
∫ ∞
tfX(s)dsdt =
∫ ∞
0
∫ s
0dtfX(s)ds =
∫ ∞
0sfX(s)ds.
Similarly,∫ ∞
0P[X ≤ −t]dt =
∫ ∞
0
∫ −t
−∞fX(s)dsdt =
∫ 0
−∞
∫ −s
0dtfX(s)ds =
∫ 0
−∞−sfX(s)ds.
Subtracting both we have the result. ut in exercises
In a similar way we can prove
Exercise 17.7. Let X be a discrete random variable, with range Z such that E[X] exists.
Then,
E[X] =∞∑
k=0
(P[X > k]− P[X < −k]).
Example 17.8. Let X ∼ N(0, 1). Compute E[X2].
By the above,
E[X2] =1√2π
∫
Rx2e−x
2/2dx = − 1√2πxe−x
2/2∣∣∣∞
−∞+
1√2π
∫
Re−x
2/2dx = 1
where we have used integration by parts, with ∂∂xe−x2/2 = −xe−x2/2. 454
17.2. Examples Using Linearity
Example 17.9 (Coupon Collector). Gilad collects “super-goal” cards. There are N
cards to collect altogether. Each time he buys a card, he gets one of the N uniformly
at random, independently.
What is the expected amount of cards Gilad needs to buy in order to collect all cards?
For k = 0, 1, . . . , N − 1, let Tk be the number of cards bought after getting the k-th
new card, until getting the (k + 1)−th new card. That is, when Gilad has k different
cards, he buys Tk more cards until he has (k + 1) different cards.
109
If Gilad has k different cards, then with probability N−kN he buys a card he does not
already have. So, Tk ∼ Geo(N−kN ).
Since the total number of cards Gilad buys until getting all N cards is T = T0 +T1 +
T2 + · · ·+ TN−1, using linearity of expectation
E[T ] = E[T0] + E[T1] + · · ·+ E[TN−1] = 1 +N
N − 1+ · · ·+N = N ·
N∑
k=1
1
k.
454
Example 17.10 (Permutation fixed points). n soldiers receive packages from their
families. The packages get mixed up at the post office, so that each order is equally
likely. Let X be the number of soldiers that receive the correct package from their own
family. What is the expectation of X?
Let us solve this by writing X as a sum. Let Xi be the indicator of the event that
soldier i receives the correct package. What is the probability that Xi = 1? There are
(n− 1)! possible ordering for which this may happen, so E[Xi] = (n−1)!n! = 1
n .
Note that X =∑n
i=1Xi, so by linearity, E[X] = n · 1n = 1.
Note also that the Xi’s are not independent, for example, if i 6= j,
P[Xi = 1, Xj = 1] = (n−2)!n! = 1
n(n−1) 6= 1n2 .
454
Example 17.11. We toss a die 100 times. What is the expected sum of all tosses?
Here it really begs to use linearity. If Xk is the outcome of the k-th toss, and X =∑100
k=1Xk then
E[X] =
100∑
k=1
E[Xk] = 100 · 72 = 350.
454
Example 17.12. 200 random numbers are output by a computer, each distributed
uniformly on [0, 1]. What is their expected sum?
E[X] = 200 · E[U [0, 1]] = 200 · 12 = 100.
110
454
Example 17.13. Let Xn ∼ U [0, 2−n], for n ≥ 0, and let SN =∑N
k=0Xk. What is the
expectation of SN?
Linearity of expectation gives
E[SN ] =
N∑
k=0
E[Xk] =
N∑
k=0
2−(k+1) = 1− 2−(N+1).
Note that if S∞ =∑∞
k=0Xk, then SN S∞ so monotone convergence gives that
E[S∞] = 1. 454
17.3. A formula for general random variables
Theorem 17.14. Let X be a random variable. Then
E[X+] =
∫ ∞
0(1− FX(t))dt and E[X−] =
∫ ∞
0FX(−t)dt.
Thus, if E[X] exists then
E[X] =
∫ ∞
0(1− FX(t)− FX(−t))dt.
Proof. First suppose that X ≥ 0 and discrete. Then, if the range of X is RX =
0 = r0 < r1 < r2 < · · · , we have for any t ∈ [rj , rj+1),
P[X > t] = P[X ≥ rj+1] =∞∑
n=j+1
P[X = rn].
Thus,∫ rj+1
rj
(1− FX(t))dt = (rj+1 − rj) ·∞∑
n=j+1
P[X = rn].
Summing over j and interchange the sums we have∫ ∞
0(1− FX(t))dt =
∞∑
j=0
∫ rj+1
rj
(1− FX(t))dt =
∞∑
j=0
∞∑
n=j+1
(rj+1 − rj) · P[X = rn]
=
∞∑
n=1
P[X = rn]
n−1∑
j=0
(rj+1 − rj) =∞∑
n=1
rn P[X = rn] = E[X].
Since for t > 0, the positivity of X tells us that FX(−t) = 0, we have the theorem for
discrete non-negative random variables.
111
We also have the formula: for t ∈ (rj , rj+1],
P[X ≥ t] = P[X ≥ rj+1] =∞∑
n=j+1
P[X = rn].
Thus,
∫ rj+1
rj
P[X ≥ t]dt = (rj+1 − rj) ·∞∑
n=j+1
P[X = rn] =
∫ rj+1
rj
P[X > t]dt,
so
E[X] =
∫ ∞
0P[X > t]dt =
∫ ∞
0P[X ≥ t]dt.(17.1)
Now, if X ≥ 0 is any general non-negative random variable, let Xn := 2−nb2nXc. So
Xn is discrete, non-negative and |Xn−X| ≤ 2−n. Also, because Xn ≤ X, for any t > 0,
|FXn(t)− FX(t)| = |P[Xn ≤ t]− P[X ≤ t]| = P[Xn ≤ t,X > t]
≤ P[t < X ≤ t+ 2−n] = FX(t+ 2−n)− FX(t)→ 0,
as n → ∞ by right continuity. Since Xn ≤ Xn+1 we get that 1 − FXn(t) 1 − FX(t),
and by monotone convergence,
E[Xn] =
∫ ∞
0(1− FXn(t))dt
∫ ∞
0(1− FX(t))dt.
However, since |E[Xn] − E[X]| ≤ 2−n we obtain the theorem for general non-negative
random variables.
Finally, if X = X+−X− is any random variable, we have for all t ≥ 0 that FX+(t) =
P[X+ ≤ t] = P[0 ≤ X ≤ t] + P[X < 0] = P[X ≤ t] = FX(t), and since X+ ≥ 0,
E[X+] =
∫ ∞
0(1− FX(t))dt.
Also, since X− = (−X)+, for all t ≥ 0, we have P[X− ≥ t] = P[−X ≥ t] = FX(−t),so by (17.1),
E[X−] =
∫ ∞
0P[X− ≥ t]dt =
∫ ∞
0FX(−t)dt.
ut
112
Example 17.15. Let X be a random variable with distribution function
FX(t) =
0 t < 0
t+12 0 ≤ t < 1
2
78
12 ≤ t < 10
1 10 ≤ t
Is X discrete? Is X absolutely continuous?
X cannot be absolutely continuous, because it is not even continuous, since 0, 12 , 10
are discontinuity points of FX .
X cannot be discrete because if r is such that P[X = r] > 0 then r is a discontinuity
point of FX , so these can only be 0, 12 , 10. However,
P[X 6∈
0, 12 , 10
] ≥ P[1
8 < X ≤ 14 ] = FX(1
4)− FX(18) = 5
8 − 916 = 1
16 > 0.
Finally, let us compute the expectation of X. Since X ≥ 0, we have that X = X+, so
E[X] =
∫ ∞
0(1− FX(t))dt =
∫ 1/2
0
1−t2 dt+
∫ 10
1/2
18dt
= 12(t− t2
2 )∣∣∣1/2
0+ 1
8 · 192 = 1
4 − 116 + 19
16 = 118 .
454
113
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 18
18.1. Jensen’s Inequality
Recall that a function g : R → R is convex on [a, b] if for any 0 < λ < 1, and any
x, y ∈ [a, b],
g(λx+ (1− λ)y) ≤ λg(x) + (1− λ)g(y).
For example, any twice differentiable function such that g′′ ≥ 0 is convex. e.g. x2,
− log x.
Convex functions are continuous, and so always measurable.
First a small lemma regarding convex functions:
convex
funcs lie
above tan-
gentsLemma 18.1. Let a < m < b, and let g : [a, b] → R be convex. Then, there exist
A,B ∈ R such that g(m) = Am+B and for all x ∈ [a, b], g(x) ≥ Ax+B.
Proof. For all a ≤ x < m < y ≤ b we have m = λx+ (1− λ)y where λ = y−my−x ∈ (0, 1).
Thus, g(m) ≤ λg(x) + (1 − λ)g(y) which implies that g(m)−g(x)m−x ≤ g(y)−g(m)
y−m . Thus,
taking supx<mg(m)−g(x)m−x ≤ A ≤ infy>m
g(y)−g(m)y−m , we have that for all x < y ∈ [a, b],
g(x) ≥ A(x−m) + g(m) = Ax+ (g(m)−Am). ut
Theorem 18.2 (Jensen’s Inequality). Let g : R → R be convex on [a, b]. Let X be a
random variable such that P[a ≤ X ≤ b] = 1. Then,
g(E[X]) ≤ E[g(X)].
Johan Jensen
(1859–1925)
Proof. If X is constant, then there is nothing to prove, so assume X is not constant.
Thus, a < E[X] < b. Let m = E[X]. Then, g(X) ≥ AX +B and g(E[X]) = AE[X] +B
114
m
Figure 5. A convex function lies above some linear function at an inner
point m.
for some A,B ∈ R. Specifically, since g(X)− ≤ |A||X|+ |B| ≤ |A|max |a|, |b|+ |B| we
get that E[g(X)−] <∞, so E[g(X)] is well defined.
Now, since AX +B ≤ g(X),
g(E[X]) = AE[X] +B = E[AX +B] ≤ E[g(X)].
ut A-G in-
equality
Example 18.3. Let a1, . . . , an be n positive numbers. Let X be a discrete random
variable with density fX(ak) = 1n .
Note that the function g(x) = − log x is a convex function on (0,∞) (since g′′(x) =
x−2 > 0). So Jensen’s inequality gives that
− log1
n
n∑
k=1
ak = − log(E[X]) ≤ E[− log(X)] = − 1
n
n∑
k=1
log ak.
Exponentiating we get
1
n
n∑
k=1
ak ≥(
n∏
k=1
ak
)1/n
.
This is the arithmetic-geometric mean inequality. 454
115 Example:
Lp in Lq for
p > q Example 18.4. Let p ≥ 1. The function g(x) = |x|p is convex, because away from
0, g′′(x) = p(p − 1)|x|p−2, and at 0 we have g(0) = 0 ≤ λg(x) + (1 − λ)g(y) for any
0 < λ < 1 and x, y.
Thus, E[|X|p] ≥ |E[X]|p for any random variable X such that E[X] is defined.
Furthermore, for 1 ≤ q < p, if E[|X|p] <∞, then
E[|X|q] = (E[|X|q]p/q)q/p ≤ (E[|X|p])q/p <∞,
since the function x 7→ |x|p/q is convex. 454
18.2. Moments
Definition 18.5. Let X be a random variable on a probability space (Ω,F ,P).
The n-th moment of X is defined to be E[Xn], if this exists and is finite. If this
expectation does not exist (or is infinite), then we say that X does not have a n-th
moment.
Definition 18.6. Let X be a random variable on a probability space (Ω,F ,P), such
that |E[X]| <∞.
The variance of X is defined to be
Var[X] := E[|X − E[X]|2].
The standard deviation of X is defined to be√
Var[X].
Proposition 18.7. X has a second moment if and only if |E[X]| <∞ and Var[X] <∞.
In fact, Var[X] = E[X2]− E[X]2.
Proof. Using linearity of expectation,
E[(X − a)2] = E[X2]− 2aE[X] + a2.
So E[(X − a)2] <∞ if and only if −∞ < E[X] <∞ and E[X2] <∞. ut
Example 18.8. Let us calculate some second moments and variances:
• X ∼ Ber(p):
E[X2] = P[X = 1] ·12 +P[X = 0] ·02 = p. So Var[X] = E[X2]−E[X]2 = p(1−p).
116
• X ∼ Bin(n, p):
We use the identity for all 2 ≤ k ≤ n, k(k − 1)(nk
)= n(n− 1)
(n−2k−2
).
E[X(X − 1)] =n∑
k=0
k(k − 1)P[X = k] =n∑
k=2
n(n− 1)
(n− 2
k − 2
)pk−2p2(1− p)n−k
= p2n(n− 1) ·n−2∑
k=0
(n− 2
k
)pk(1− p)n−2−k = p2n2 − p2n.
By linearity of expectation,
p2n(n− 1) = E[X2]− E[X] = E[X2]− np.
So E[X2] = p2n(n− 1) + np and Var[X] = np− p2n = np(1− p).• X ∼ Geo(p):
E[X2] =∞∑
k=1
k2(1− p)k−1p = p+∑
k=2
(k − 1 + 1)2(1− p)k−2p(1− p)
= p+ (1− p)∞∑
k=1
(k + 1)2(1− p)k−1p = p+ (1− p)E[(X + 1)2]
= p+ (1− p)E[X2] + (1− p) + (1− p)2E[X] = 1 +2(1− p)
p+ (1− p)E[X2].
So
E[X2] =1
p+
2(1− p)p2
=2− pp2
.
Thus
Var[X] = E[X2]− E[X]2 =1− pp2
.
• X ∼ Poi(λ):
E[X2] =∞∑
k=0
k2 · e−λλk
k!
= λ ·∞∑
k=1
(k − 1 + 1)e−λλk−1
(k − 1)!= λ ·
∞∑
k=0
(k + 1)e−λλk
k!
= λ · E[X + 1] = λ(λ+ 1).
Thus, Var[X] = E[X2]− E[X]2 = λ.
117
• X ∼ Exp(λ): Use integration by parts:
E[X2] =
∫ ∞
0t2λe−λtdt = −t2e−λt
∣∣∣∞
0+
∫ ∞
02te−λtdt
=2
λE[X] =
2
λ2.
So Var[X] = 2λ2− 1
λ2= 1
λ2.
454 Var[aX]
Exercise 18.9. Let X be a random variable with Var[X] < ∞. Show that for any
a, b ∈ R,
Var[aX + b] = a2 Var[X].
Solution. Let µ = E[X]. So E[aX + b] = aµ+ b. Then,
Var[aX + b] = E[(aX + b− (aµ+ b))2] = E[(aX − aµ)2] = a2 Var[X].
ut
Example 18.10. • X ∼ U [a, b]: Y = X−ab−a ∼ U [0, 1]. (Prove this!) So
E[Y 2] =
∫ 1
0s2ds = 1
3 .
Thus Var[Y ] = 13 − 1
4 = 112 . We have that
Var[X] = Var[(b− a)Y + a] = (b− a)2 Var[Y ] =(b− a)2
12.
• X ∼ N(µ, σ): Recall that Y := (X − µ)/σ has N(0, 1) distribution. Thus, since
X = σY + µ, we have that Var[X] = σ2 Var[Y ].
Now, using integration by parts, since ∂∂xe−x2/2 = −xe−x2/2,
Var[Y ] = E[Y 2] =1√2π
∫ ∞
−∞t2e−t
2/2dt
= − 1√2π· te−t2/2
∣∣∣∞
−∞+
1√2π
∫ ∞
−∞e−t
2/2dt = 1.
So Var[X] = σ2. (Note that this implies that E[X2] = σ2 + µ2.)
454
118
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 19
19.1. Covariance
Definition 19.1. Let X,Y be random variables on some probability space (Ω,F ,P)
such that E[X],E[Y ],E[XY ] are finite. The covariance of X and Y is defined to be
Cov[X,Y ] := E[(X − E[X]) · (Y − E[Y ])].
The following are immediate:
Proposition 19.2. Let X,Y, Z be random variables on some probability space (Ω,F ,P).
Then,
• Cov[X,Y ] = E[XY ]− E[X]E[Y ].
• Cov[X,Y ] = Cov[Y,X].
• Cov[aX + Y,Z] = aCov[X,Z] + Cov[Y, Z].
• Cov[X,X] = Var[X] ≥ 0.
Where each equality holds if the relevant covariances are defined.
Note that Cov is almost an inner product on the vector space of random variables on
(Ω,F ,P) with finite expectation.
19.2. Inner Products and Cauchy-Schwarz
Reminder:
Let V be a vector space over R. An inner-product on V is a function < ·, · >: V ×V →R such that
• Symmetry: For all v, u ∈ V : < u, v >=< v, u >.
119
• Linearity: For all α ∈ R and all v, u, w ∈ V : < αu + v, w >= α < u,w > + <
v,w >.
• Positivity: For all v ∈ V , < v, v >≥ 0.
• Definiteness: For all v ∈ V : < v, v >= 0 if and only if v = 0.
For our purposes, we would like to replace the last condition by
• Definiteness’: For all v ∈ V : < v, v >= 0 if and only if < v, u >= 0 for all
u ∈ V .
A basic, but super important and fundamental result is the Cauchy-Schwarz inequal-
ity: (Denote ||v|| = √< v, v >.)
Augustin-Louis Cauchy
(1789–1857)
C-S for in-
ner products
Theorem 19.3 (Cauchy-Schwarz Inequality). Let < ·, · > be a an inner product on V
with the above modified definiteness condition. For all v, u ∈ V we have
| < v, u > | ≤ ||v|| · ||u||.
Proof. If ||u|| = 0 then both sides are 0 (because of the modified definiteness condition),
so we can assume without loss of generality that ||u|| > 0. Set λ = ||u||−2· < v, u >.
Then,
0 ≤< v − λu, v − λu >=< v, v > +λ2 < u, u > −2λ < v, u >= ||v||2 − ||u||−2 < v, u >2
Multiplying by ||u||2 completes the proof. ut
Hermann Schwarz
(1843–1921)
19.2.1. Two Inner Products. Let (Ω,F ,P) be a probability space. Consider the
following vector space over R: Let L2(Ω,F ,P) be the space of all random variables X
with finite second moment.
In the exercise we show that if Cov[X,X] = 0 then Cov[X,Y ] = 0 for all Y , and if
E[X2] = 0 then E[XY ] = 0 for all Y . Thus, we have that both Cov[X,Y ] and E[XY ]
form (modified) inner products on L2(Ω,F ,P).
X In order to get honest to goodness inner products, one needs to take random
variables modulo measure 0 (for E[X,Y ]) and modulo measure 0 and additive constants
for Cov.
120
We conclude: C-S for Cov
and E[XY ]
Theorem 19.4 (Cauchy-Schwarz inequality). Let X,Y be random variables with finite
second moment. Then,
|Cov[X,Y ]|2 ≤ Var[X] ·Var[Y ] and |E[XY ]|2 ≤ E[X2] · E[Y 2].
19.3. Correlation
Definition 19.5. If X,Y are random variables such that Cov[X,Y ] = 0 we say that
X,Y are uncorrelated. X1, . . . , Xn are said to be pairwise uncorrelated if for any
k 6= j, Xk, Xj are uncorrelated.
We will soon see that if X,Y are independent, they are uncorrelated. The opposite,
however, is not true. Indeed, Uncor-
related 6⇒
independent Example 19.6. Let X,Y have a joint distribution given by
fX,Y =
X \ Y −1 0 1
0 1/3 0 1/3
1 0 1/3 0
.
So fX(1) = 1/3 and fY (1) = fY (−1) = 1/3. Hence,
E[XY ] = P[X = 1, Y = 1]− P[X = 1, Y = −1] = 0,
E[X] = 1/3 and E[Y ] = 1/3 − 1/3 = 0. So Cov[X,Y ] = E[XY ] − E[X]E[Y ] = 0. But
X,Y are not independent, as
P[X = 1, Y = 1] = 0 6= 1
3· 1
3= P[X = 1] · P[Y = 1].
454
X We show that independence is the fact that any functions of X and Y are uncor-
related.
indepen-
dence iff
cov=0 for all
funcsTheorem 19.7. Let X,Y be random variables on some probability space (Ω,F ,P).
Then, X,Y are independent, if and only if for any measurable functions g, h such that
Cov[g(X), h(Y )] is defined, we have that Cov[g(X), h(Y )] = 0.
121
indepen-
dence implies
uncorrelated
Lemma 19.8. Let X,Y be random variables with finite second moment. If X,Y are
independent then they are uncorrelated.
Proof. It suffices to show that if X,Y are independent, then E[XY ] = E[X]E[Y ].
If X,Y are discrete random variables with range R,
E[XY ] =∑
r,r′∈RP[X = r, Y = r′]rr′ =
∑
r,r′∈RP[X = r]r P[Y = r′]r′ = E[X]E[Y ].
Define X+n := 2−nb2nX+c and Y +
n := 2−nb2nY +c. Since X+n , Y
+n are independent
and discrete, we have that
E[X+n Y
+n ] = E[X+
n ]E[Y +n ].
Since X+n X+ and Y +
n Y +, and these are non-negative, we get by monotone
convergence that
E[X+Y +] = limn→∞
E[X+n Y
+n ] = lim
n→∞E[X+
n ]E[Y +n ] = E[X+]E[Y +].
In a similar way, for X−n := 2−nb2nX−c and Y −n := 2−nb2nY −c, we have that
X−n , Y−n are independent, X−n , Y
+n are independent, and X+
n , Y−n are independent. Thus,
E[XξnY
ζn ] = E[Xξ
n]E[Y ζn ] for any choice of ξ, ζ ∈ +,−, and taking limits we get that
E[XξY ζ ] = E[Xξ]E[Y ζ ].
Altogether,
E[XY ] = E[(X+ −X−)(Y + − Y −)] = E[X+Y +]− E[X+Y −]− E[X−Y +] + E[X−Y −]
= E[X+]E[Y +]− E[X+]E[Y −]− E[X−]E[Y +] + E[X−]E[Y −]
= (E[X+]− E[X−]) · (E[Y +]− E[Y −]) = E[X] · E[Y ].
ut
Proof of Theorem 19.7. We repeatedly make use of the fact that Cov[X,Y ] = 0 if and
only if E[XY ] = E[X]E[Y ].
122
For the “if” direction: Let A1 ∈ σ(X), A2 ∈ σ(Y ). So A1 = X−1(B1), A2 = Y −1(B2)
for Borel sets B1, B2. For the functions g(x) = 1x∈B1 and h(x) = 1x∈B2 we have
that
P[A1 ∩A2] = E[1X∈B1,Y ∈B2] = E[g(X)h(Y )] = E[g(X)]E[h(Y )] = P[A1]P[A2].
This was the “easy” direction.
Now, for the “only if” direction: We want to show that if X,Y are independent,
then E[g(X)h(Y )] = E[g(X)]E[h(Y )] for any measurable functions such that the above
expectations exist. Since g(X), h(Y ) are independent, the theorem follows from Lemma
19.8. ut in exercise
Pythago-
rian Theorem Exercise 19.9. Let X1, . . . , Xn be random variables with finite second moment. Show
that
Var[X1 + · · ·+Xn] =n∑
k=1
Var[Xk] + 2∑
j<k
Cov[Xj , Xk].
Deduce the Pythagorian Theorem: If X1, . . . , Xn are all pairwise uncorrelated, then
Var[X1 + · · ·+Xn] =n∑
k=1
Var[Xk].
The Pythagorian Theorem lets us calculate very easily the variance of a binomial
random variable: Var[Ber]
Example 19.10. Let X ∼ Bin(n, p). We know that X =∑n
k=1 Yk, where Y1, . . . , Yn
are independent Ber(p) random variables. So
Var[X] =n∑
k=1
Var[Yk] = np(1− p).
The straightforward calculation would be more difficult. 454
19.4. Examples
Example 19.11. Let X ∼ U [−1, 1] and Y = X2.
123
What is the distribution of Y ?
P[Y ≤ t] = P[X ∈ [−√t,√t]] =
√t 0 ≤ t ≤ 1
0 t < 0
1 t ≥ 1
Note that for
fY (t) =
12 t−1/2 t ∈ [0, 1]
0 t 6∈ [0, 1]
we have that for any t ∈ [0, 1],
∫ t
−∞fY (s)ds =
∫ t
0
12s−1/2ds =
√t = FY (t),
so Y is absolutely continuous with density fY .
Are X,Y independent? Of course not, for example,
P[X ∈ [0, 1/2] , Y ∈ [1/2, 1]] = 0 6= 12 · (1− 2−1/2) = P[X ∈ [0, 1]] · P[Y ∈ [1/2, 1]].
What is Cov(X,Y )? Well,
E[XY ] = E[X3] =
∫ 1
−1t3 1
2dt = 0.
Also, E[X] = 0 so
Cov(X,Y ) = E[XY ]− E[X] · E[Y ] = 0.
That is, X,Y are uncorrelated.
What about Z = X2n for general n? In this case,
E[XZ] = E[X2n+1] =
∫ 1
−1t2n+1 1
2dt =1
2(2n+ 2)t2n+2
∣∣∣1
−1= 0.
So X,X2n are uncorrelated for any n.
As for odd moments, if W = X2n+1, then
E[XW ] =
∫ 1
−1t2n+2 1
2dt =2
2(2n+ 3)=
1
2n+ 3,
So X,W are not uncorrelated, because Cov(X,W ) = 12n+3 . 454
124
Example 19.12. Let a ∈ (−1, 1) and
A =
1 a
a 1
So |A| = 1− a2 and
A−1 =1
1− a2·
1 −a−a 1
Let (X,Y ) be a two dimensional absolutely continuous random variable with density
fX,Y (t, s) =1
2π ·√|A|·exp
(−1
2〈A−1(t, s), (t, s)〉)
=1
2π ·√
1− a2·exp
(− 1
2(1−a2)· (t2 + s2 − 2ats)
).
First, let us show that fX,Y is indeed a density. We will use the fact that t2+s2−2ats =
(t− as)2 + s2 − a2s2.
∫ ∞
−∞fX,Y (t, s)dt =
1
2π ·√
1− a2· exp
(− s2(1−a2)
2(1−a2)
)·∫ ∞
−∞exp
(− (t− as)2
2(1− a2)
)dt
=1
2π ·√
1− a2· e−s2/2 ·
√2π(1− a2) =
1√2π· e−s2/2.
We have used that 1√2π(1−a2)
·exp(− (t−as)2
2(1−a2)
)is the density of a N(as,
√1− a2) random
variable. So Y ∼ N(0, 1). Symmetrically, X ∼ N(0, 1).
Thus∫ ∫
fX,Y dtds = 1 and fX,Y is indeed a density. Moreover we have calculated
the marginal densities of X and Y .
Let us calculate
Cov(X,Y ) = E[XY ] =
∫ ∫fX,Y (t, s)tsdtds
=
∫ ∞
−∞s · 1√
2π· e−s2/2
∫ ∞
−∞t · 1√
2π(1− a2)· exp
(− (t− as)2
2(1− a2)
)dtds
=
∫ ∞
−∞as · sfY (s)ds = E[aY 2] = a.
So now we also have that
Var[X + Y ] = Var[X] + Var[Y ] + 2 Cov(X,Y ) = 1 + 1 + 2a = 2(1 + a).
Note that X,Y are independent if and only if Cov(X,Y ) = 0. 454
125
Example 19.13. A particle moves in the NE (nord-est) quadrant of the plane, [0,∞)2.
The direction θ ∼ U [0, π/2] and the velocity V ∼ U [0, 1] independently.
What is the covariance of X(t) and V where the particle’s position at time t is given
by (X(t), Y (t))?
Note that for any t X(t) = V t cos(θ). So we want to calculate
E[X(t)V ]− E[X(t)]E[V ] = tE[V 2] · E[cos(θ)]− tE[V ]2 · E[cos(θ)] = tVar[V ] · E[cos(θ)].
We know that Var[V ] = 1/12. Also,
E[cos θ] =
∫ π/2
0
2π cos tdt = 2
π .
So Cov(X(t), V ) = t6π . 454
Example 19.14. Suppose Y ∼ Exp(λ) and for any s > 0, X|Y = s ∼ U [0, s]. What is
Cov(X,Y )?
We can calculate
E[XY ] =
∫ ∫tsfX,Y (t, s)dtds =
∫ ∞
0sfY (s)
∫tfX|Y (t|s)dtds
=
∫ ∞
0sfY (s) · s2ds = 1
2 · E[Y 2] = λ−2.
Also, E[Y ] = λ−1. If we consider the function (t, s) 7→ t we get
E[X] =
∫ ∫tfX,Y (t, s)dtds =
∫ ∞
0fY (s)
∫ s
0tfX|Y (t|s)dtds =
∫ ∞
0E[X|Y = s]fY (s)ds
=
∫ ∞
0
s2fY (s)ds = 1
2 E[Y ] = 12λ .
So altogether,
Cov(X,Y ) =1
λ2− 1
2λ· 1
λ=
1
2λ2.
454
19.5. Markov and Chebyshev inequalities Markov
Theorem 19.15 (Markov’s inequality). Let X ≥ 0 be a non-negative random variable.
Then, for any a > 0,
P[X ≥ a] ≤ E[X]
a.
Andrey Markov
(1856–1922)
126
Proof. Consider the random variable Y = a1X≥a. Note that Y ≤ X (here we use the
non-negativity of X). So
E[X] ≥ E[Y ] = aP[X ≥ a].
ut Chebyshev
Theorem 19.16 (Chebyshev’s inequality). Let X be a random variable with finite sec-
ond moment. Then, for any a > 0,
P[|X − E[X]| ≥ a] ≤ Var[X]
a2.
Pafnuty Chebyshev
(1821–1894)
Proof. Apply Markov to the non-negative random variable Y = (X − E[X])2, and note
that |X − E[X]| ≥ a =Y ≥ a2
. ut
If X is a random variable with E[X] = µ and Var[X] = σ2, then Chebyshev’s in-
equality tells us that the probability that X deviates from its mean kσ (i.e. k standard
deviations) is at most 1/k2.
Example 19.17. The average wage in Eurasia is 1000 a month.
• A person is chosen at random. What is the probability that this person earns
at least 10, 000 a month?
By Markov’s inequality, if X is the random person’s salary, then
P[X ≥ 10, 000] ≤ E[X]
10, 000=
1
10.
• The average wage for an Inner Party member is 10, 000 a month, and the stan-
dard deviation of the wage is 100. What is the probability that a randomly
chosen Inner Party member earns at least 11, 000?
Here we have that E[X] = 10, 000 and Var[X] = 10, 000. So
P[X ≥ 11, 000] ≤ P[|X − E[X]| ≥ 1, 000] ≤ Var[X]
1, 0002=
1
100.
454
Example 19.18. Let X ∼ N(0, σ). Chebyshev’s inequality tells us that P[|X| > kσ] ≤k−2.
127
However because σ−1X ∼ N(0, 1), we can calculate for a = tσ ,
P[|X| > t] = P[ 1σX > a] + P[ 1
σX < −a] =
∫ −a
−∞fσ−1X(s)ds+
∫ ∞
afσ−1X(s)ds
= 2
∫ ∞
a
1√2πe−s
2/2ds ≤ 2
∫ ∞
a
s
a· 1√
2πe−s
2/2ds
=2
a ·√
2π· (−e−s2/2)
∣∣∣∞
a=
2
a ·√
2π· e−a2/2.
If we plug in t = kσ we get
P[|X| > kσ] ≤ 2√2πk2
· e−k2/2 << k−2,
which is a much better bound. 454
19.6. The Weierstrass Approximation Theorem
Karl Weierstrass
(1815–1897)
Here is a probabilistic proof by Bernstein for the famous Weierstrass Approximation
Theorem.
Theorem 19.19 (Weierstrass Approximation Theorem). Let f : [0, 1]→ R be continu-
ous. For every ε > 0 there exists a polynomial p(x) such that supx∈[0,1] |p(x)−f(x)| < ε.
(That is, the polynomials are dense in L∞([0, 1]).)
Proof. By classical analysis, f is uniformly continuous and bounded on [0, 1]; that is,
there exists M > 0 such that supx∈[0,1] |f(x)| ≤ M and for every ε > 0 there exists
δ = δ(ε) > 0 such that if |x− y| ≤ δ then |f(x)− f(y)| < ε2 .
Fix ε > 0. Let δ = δ(ε). Let x ∈ (0, 1). Consider B ∼ Bin(n, x). Chebyshev’s
inequality guaranties that
P[|B − nx| ≥ n2/3] ≤ Var[B] · n−4/3 = x(1− x) · n−1/3 ≤ 1
4n1/3.
Thus, we may choose n = n(ε,M) large enough so that this probability is at most ε4M
for any x. Assume also that n−1/3 < δ.
Define the polynomial
p(x) :=n∑
j=0
(n
j
)xj(1− x)n−jf( jn).
That is, p(x) = E[f(B/n)] where B ∼ Bin(n, x).
128
Let A =|Bn − x| > δ
. So
P[A] ≤ P[|B − nx| > n2/3] ≤ ε
4M.
Thus,
|p(x)− f(x)| = |E[f(B/n)− f(x)]|
≤ E[|f(B/n)− f(x)| · 1A] + E[|f(B/n)− f(x)| · 1Ac ]
≤M · P[A] + ε2 · P[Ac] ≤ 3ε
4 < ε.
This bound is independent of x. ut
19.7. The Paley-Zygmund Inequality
Raymond Paley
(1907–1933)
Antoni Zygmund
(1900–1992)
Theorem 19.20 (Paley-Zygmund Inequality). Let X ≥ 0 be a non-negative random
variable with finite variance. Then, for any α ∈ [0, 1],
P[X > αE[X]] ≥ (1− α)2 · (E[X])2
E[X2].
Specifically,
P[X > 0] ≥ (E[X])2
E[X2].
Proof. Note that
X = X1X≤αE[X] +X1X>αE[X].
Since X ≥ 0 we have that X1X≤αE[X] ≤ αE[X] and by Cauchy-Schwarz,
(E[X1X>αE[X]])2 ≤ E[X2] · P[X > αE[X]].
Combining these we have,
E[X] ≤ αE[X] +√E[X2] · P[X > αE[X]],
which implies that
(1− α)2(E[X])2 ≤ E[X2] · P[X > αE[X]].
ut
129
Example 19.21 (Erdos-Renyi random graph). Suppose that n students are on Face-
book. For every two different students, say x, y, they become friends with probability
p, all pairs of students being independent.
We say that students x and y are connected if there is some sequence of students
x = x1, . . . , xk = y such that xj and xj+1 are friends for every j.
What is the probability that some student has no friends? What is the probability
that all students are connected?
Paul Erdos (1913–1996)
Alfred Renyi (1921–1970)
Suppose 1, 2, . . . , n is the set of students. Let Ix,y be the indicator of the event that
x and y are friends. So (Ix,y)x 6=y are(n2
)independent Bernoulli-p random variables.
Let X be the number of students with no friends; the isolated students.
Let us begin by calculating the expected number of isolated students: A student x is
isolated if Ix,y = 0 for all y 6= x. So this happens with probability (1− p)n−1. Thus, by
linearity,
E[X] =∑
x
P[x is isolated ] = n(1− p)n−1 =: µ.
Let us also calculate the second moment of X: First, for x 6= y, is both are isolated
then for all z 6∈ x, y we have Ix,z = 0 and Iy,z = 0. Also, Ix,y = 0. So,
P[x is isolated , y is isolated ] = (1− p)2(n−2)+1 = (1− p)2(n−1)(1− p)−1.
Thus,
E[X2] = E∑
x,y
1x is isolated ,y is isolated
=∑
x
P[x is isolated ] +∑
x6=yP[x is isolated , y is isolated ]
= µ+ n(n− 1)(1− p)2(n−1)(1− p)−1 = µ+ µ2 · (1− p)−1 · (1− 1n).
Specifically,
(E[X])2
E[X2]=(µ−1 + (1− p)−1 · (1− 1
n))−1
.
We will make use of the inequalities e− p
1−p ≤ 1− p ≤ e−p valid for all p ≤ 12 .
130
Now, if p ≥ (1+ε) lognn−1 , then
E[X] ≤ ne−p(n−1) = n−ε → 0,
so by Markov’s inequality,
P[X ≥ 1] ≤ E[X]→ 0.
That is, with high probability all students are connected.
On the other hand, if p ≤ (1−ε) lognn−1 , then
µ ≥ n exp(− 1
1−p · (1− ε) log n)→∞
(because 11−p(1− ε) < 1 for large enough n). Thus, by the Paley-Zygmund inequality,
P[X > 0] ≥ (E[X])2
E[X2]=(µ−1 + (1− p)−1 · (1− 1
n))−1 → 1.
That is, with high probability there exists a student that has no friends. 454
131
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 20
20.1. Convergence of Random Variables
As in the case of sequences of numbers, we would like to talk about convergence of
random variables. There are many ways to say that a sequence of random variables
approximates some limiting random variable. We will talk about 4 such possibilities.
20.2. a.s. and Lp Convergence
We have already seen some type of convergence; namely, Xn → X if Xn(ω)→ X(ω)
for all ω ∈ Ω. This is a strong type of convergence; we are usually not interested in
changes that have probability 0, so we will define the following type of convergence:
Let (Xn)n be a sequence of random variables on some probability space (Ω,F ,P). We
have already seen that lim inf Xn, lim supXn are both random variables. Thus, the set
ω : lim inf Xn(ω) = lim supXn(ω) is an event in F . Hence we can define: a.s. conver-
genceDefinition 20.1. Let (Xn)n be a sequence of random variables on (Ω,F ,P). Let X be
a random variable.
We say that (Xn) converges to X almost surely, or a.s., denoted
Xna.s.−→ X
if P[limXn = X] = 1; that is, if
P[ω : lim
n→∞Xn(ω) = X(ω)
] = 1.
Another type of convergence is a type of average convergence:
132
Lp conver-
genceDefinition 20.2. Let p > 0. Let (Xn)n be a sequence of random variables on (Ω,F ,P)
such that E[|Xn|p] <∞ for all n. Let X be a random variable such that E[|X|p] <∞.
We say that (Xn) converges to X in Lp, denoted
XnLp−→ X
if
limn→∞
E[|Xn −X|p] = 0.
In the exercises we will show that for 0 < p < q, Lq convergence implies Lp conver-
gence.
We will also show that the limit is unique; that is the limit is
unique
Exercise 20.3. Suppose that Xna.s.−→ X and Xn
Lp−→ Y . Show that P[X = Y ] = 1.
Suppose that XnLp−→ X and Xn
Lq−→ Y . Show that P[X = Y ] = 1. a.s. 6⇒ Lp
Example 20.4. Fix p > 0. Let (Ω,F ,P) be uniform measure on [0, 1]. For all n let Xn
be the discrete random variable
Xn(ω) =
n1/p ω ∈ [0, 1/n]
0 otherwise .
So Xn has density
fXn(s) =
1n s = n1/p
1− 1n s = 0
0 otherwise
Note that
E[|Xn|q] = 1n |n1/p|q = nq/p−1.
Thus, if 0 < q < p, then E[|Xn − 0|q]→ 0, so XnLq−→ 0.
However, for q ≥ p we have that E[|Xn − 0|q] ≥ 1 for all n, so Xn 6 Lq
−→ 0 (and thus
Xn 6 Lq
−→ X for any X, since the limit must be unique).
We claim that Xna.s.−→ 0. Indeed, let ω ∈ [0, 1]. If ω > 0 the for all n > 1
ω , we get
that Xn(ω) = 0, and thus Xn(ω)→ 0. Thus,
P[ω : Xn(ω) 6→ 0] ≤ P[0] = 0,
133
so Xna.s.−→ 0.
This example has shown that
• We can have Lq convergence without Lp convergence, for p > q.
• We can have a.s. convergence without Lp convergence.
454 Lp 6⇒ a.s.
Example 20.5. For all n let Xn be mutually independent Ber(1/n) random variables.
Thus, for all n and all p > 0,
E[|Xn|p] = 1n → 0 so Xn
Lp−→ 0.
On the other hand, let An = |Xn| > 1/2 . So (An)n is a sequence of mutually
independent events, and P[An] = 1/n. Thus, since∑
n P[An] = ∞, using the second
Borel-Cantelli Lemma,
P[lim supn
An] = 1.
That is, with probability 1, there exist infinitely many n such that Xn > 1/2. Thus,
P[lim supn
Xn ≥ 1/2] = 1.
So we cannot have that Xna.s.−→ 0. Since the limit must be unique, we cannot have
Xna.s.−→ X for any X.
We conclude that: It is possible for (Xn) to converge in Lp for all p, but not to
converge a.s. 454
20.3. Convergence in Probability convergence
in prob.
Definition 20.6. Let (Xn)n be a sequence of random variables on (Ω,F ,P). We say
that (Xn) converges in probability to X, denoted
XnP−→ X
if for all ε > 0,
limn→∞
P[|Xn −X| > ε] = 0.
134
a.s. ⇒ P
Example 20.7. Suppose that Xna.s.−→ X. That is,
P[|Xn −X| → 0] = 1.
Fix ε > 0. For any n let An = |Xn −X| > ε. Note that if An occurs for infinitely
many n, then lim supn |Xn −X| ≥ ε > 0. Thus, by Fatou’s Lemma,
0 ≤ lim supn
P[An] = P[lim supn
An] = P[An i.o.] ≤ P[lim supn|Xn −X| > 0] = 0.
Thus,
limn
P[|Xn −X| > ε] = lim supn
P[An] = 0,
and so XnP−→ X.
That is: a.s. convergence implies convergence in probability. 454 Lp ⇒ P
Example 20.8. Assume that XnLp−→ X. Then, for any ε > 0, by Markov’s inequality,
P[|Xn −X| > ε] ≤ E[|Xn −X|p]εp
→ 0.
Thus, XnP−→ X.
That is, convergence in Lp implies convergence in probability. 454
The following is in the exercises: limit is
unique
Exercise 20.9. Let XnP−→ X and Xn
P−→ Y . Show that P[X = Y ] = 1.
135
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 21
21.1. The Law of Large Numbers
In this section we will prove
Theorem 21.1 (Strong Law of Large Numbers). Let X1, X2, . . . , Xn, . . . , be a sequence
of mutually independent random variables, such that E[Xn] = 0 and supn E[X2n] < ∞.
For each N let
SN =
N∑
n=1
Xn.
Then,
SNN
a.s.−→ 0.
21.1.1. Kolmogorov’s inequality.
Lemma 21.2 (Kolmogorov’s inequality). Let X1, X2, . . . , Xn, . . . , be a sequence of mu-
tually independent random variables, such that E[Xn] = 0 and E[X2n] <∞. For each N
let
SN =N∑
n=1
Xn and MN = max1≤n≤N
|Sn|.
Then, for any λ > 0,
P[MN ≥ λ] ≤∑N
n=1 E[X2n]
λ2.
Proof. Fix λ > 0 and let
An = S1 < λ, S2 < λ, . . . , Sn−1 < λ, Sn ≥ λ .
We will make a few observations.
136
First note that since (Xn) are mutually independent, they are uncorrelated, so
E[SN ] = 0 and Var[SN ] = E[S2N ] =
N∑
n=1
E[X2n].
Next, note that MN ≥ λ if and only if there exists 1 ≤ n ≤ N such that |Sn| ≥ λ and
|Sk| < λ for all 1 ≤ k < n. That is, MN ≥ λ implies that there exists 1 ≤ n ≤ N such
that S2n · 1An ≥ λ2.
Since (X1, . . . , Xn) are independent of (Xn+1, . . . , XN ), we have that the random
variable Sn · 1An is independent of Xn+1 +Xn+2 + · · ·+XN = SN − Sn. Thus, for any
1 ≤ n ≤ N we have that
E[S2N · 1An ] = E[(SN − Sn + Sn)2 · 1An ] = E[(SN − Sn)2 · 1An ] + E[S2
n · 1An ] + 2E[(SN − Sn) · Sn · 1An ]
≥ E[S2n · 1An ],
where we have used that
E[(SN − Sn) · Sn · 1An ] = E[SN − Sn] · E[Sn · 1An ] = 0
by independence.
We now have that
E[S2N ] ≥ E[S2
N · 1MN≥λ] = E[
N∑
n=1
1AnS2N ] ≥
N∑
n=1
E[S2n · 1An ]
Note that by Boole’s inequality, since MN ≥ λ implies that there exists 1 ≤ n ≤ N such
that S2n · 1An ≥ λ2,
P[MN ≥ λ] ≤N∑
n=1
P[S2n · 1An ≥ λ2] ≤ 1
λ2
N∑
n=1
E[S2n · 1An ] ≤ 1
λ2E[S2
N ].
ut
21.1.2. Kronecker’s Criterion.
Lemma 21.3. Suppose that x1, . . . , xn, . . . , is a sequence of numbers such that∑∞
n=1xnn
converges. Then,
limN→∞
1
N
N∑
n=1
xn = 0.
137
Proof. Let sN =∑N
n=1xnn . So sN → s :=
∑∞n=1
xnn . Thus, also 1
N
∑Nn=1 sn → s. Note
that for all n, xn = n(sn − sn−1) (where we set s0 = 0). Thus,
1
N
N∑
n=1
xn =1
N(NsN + ((N − 1)−N)sN−1 + ((N − 2)− (N − 1))sN−2 + · · ·+ (1− 2)s1)
= SN −N − 1
N· 1
N − 1
N−1∑
n=1
sn → s− s = 0.
ut
21.1.3. Proof of Law of Large Numbers.
Lemma 21.4. Let X1, X2, . . . , Xn, . . ., be a sequence of mutually independent random
variables such that E[Xn] = 0 for all n. If∑∞
k=1 Var[Xk] <∞ then
P
[ ∞∑
k=1
Xk converges
]= 1.
Proof. Let Sn =∑n
k=1Xk.
For every ω ∈ Ω,∑∞
k=1Xk(ω) converges if and only if Sn(ω) converges, which is if
and only if (Sn(ω))n form a Cauchy sequence. Thus, it suffices to show that
P[(Sn)n is a Cauchy sequence ] = 1.
Let n > 0, and consider
Sn+m − Sn = Xn+1 +Xn+2 + · · ·+Xn+m.
Kolmogorov’s inequality gives that
P[ max1≤m≤N
|Sn+m − Sn| ≥ λ] ≤ 1
λ2·N∑
k=1
E[X2n+k] ≤
1
λ2·∑
k≥1
E[X2n+k].
Taking N →∞ on the left-hand side, using continuity of probability for the increasing
sequence of events,
P[supm≥1|Sn+m − Sn| ≥ λ] ≤ 1
λ2·∑
k≥1
E[X2n+k].
138
This last term is the tail of the convergent series∑
k E[X2k ]. Thus, for any r > 0 there
exists n(r) so that∑
k≥n E[X2k ] < r−3, which implies that
P[infn
supm≥1|Sn+m − Sn| > r−1] ≤ P[sup
m≥1|Sn(r)+m − Sn(r)| > r−1] < r−1.
Now, the events
infn supm≥1 |Sn+m − Sn| > r−1
are increasing in r, so taking r →∞with continuity of probability,
P[infn
supm≥1|Sn+m − Sn| > 0] ≤ 0.
That is,
P[infn
supm≥1|Sn+m − Sn| = 0] = 1.
Now let ω ∈ Ω be such that infn supm≥1 |Sn+m(ω) − Sn(ω)| = 0. This implies that
for all ε > 0 there exists N = N(ε, ω) such that
supm≥1|SN+m(ω)− SN (ω)| < ε/2.
Thus, for any n > N(ε, ω) and any m ≥ 1,
|Sn+m(ω)− Sn(ω)| ≤ |SN+n+m−N (ω)− SN (ω)|+ |SN+n−N (ω)− SN (ω)| < ε.
That is, for any ε > 0 there exists N = N(ε, ω) such that for all n > N(ε, ω) and all
m ≥ 1, |Sn+m(ω) − Sn(ω)| < ε; that is, for this ω, (Sn(ω))n is a Cauchy sequence. We
have shown that
ω : infn
supm≥1|Sn+m(ω)− Sn(ω)| = 0 ⊂ ω : (Sn(ω))n is a Cauchy sequence .
Since the first event has probability 1, so does the second event, and we are done. ut
The proof of the law of large numbers is now immediate:
Proof of Theorem 21.1. By Kronecker’s Criterion is suffices to show that
P[ ∞∑
n=1
Xn
nconverges
]= 1.
By the above Lemma it suffices to show that
∞∑
n=1
Var[Xn/n] <∞.
139
Since Var[Xn/n] = 1n2 , this is immediate. ut
Example 21.5. The Central Bureau of Statistics wants to asses how many citizens use
the train system. They poll N randomly and independently chosen citizens, and set
Xj = 1 if the j-th citizen uses the train, and Xj = 0 otherwise. Any citizen is equally
likely to be chosen, and all are independent.
As N →∞ the sample mean
1
N
N∑
j=1
Xj
converges to E[Xj ] = E[X1] which is the probability that a randomly chosen citizen uses
the train system. Since we choose each citizen uniformly, this probability is exactly the
number of citizens that use the train, divided by the total number of citizens.
Thus, for large N ,
C · 1
N
N∑
j=1
Xj
is a good approximation for the number of citizens that use the train system, where C
is the number of citizens. 454
140
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 22
22.1. Convergence in Distribution
Previously we talked about types of convergence that required the sequence and the
limit to be defined on the same probability space. We now look at a type of convergence
which does not have this requirement.
Definition 22.1. Let (Xn)n be a sequence of random variables. We say that (Xn)
converges in distribution to X, denoted
XnD−→ X,
if for all t ∈ R such that FX(t) is continuous at t,
limn→∞
FXn(t) = FX(t).
That is, the distribution functions of Xn converge pointwise to the distribution func-
tion of X.
Remark 22.2. !!! Note that the convergence is required only at continuity points of
FX .
Example 22.3. Let Xn ∼ U [0, 1/n]. Then XnD−→ 0. Indeed, the distribution function
of 0 is F0(t) = 1 for t ≥ 0 and FX(t) = 0 for t < 0. Note that
FXn(t) =
0 t < 0
nt 0 ≤ t ≤ 1/n
1 t > 1/n
141
Since 0 is not a continuity point for F0, we don’t care about it. For t < 0 we have that
FXn(t) = 0 = FX(t). For t > 0 we have that for large enough n > 1/t, FXn(t) = 1 =
FX(t). 454
X Note that for the other types of convergence we needed all random variables to live
on the same space. For convergence in distribution, this is not required. However, this
is the weakest kind of convergence, as can be seen by the following proposition.
Proposition 22.4. Convergence in probability implies convergence in distribution.
Proof. Let XnP−→ X. Fix t ∈ R.
First, for all ε > 0,
P[Xn ≤ t] = P[Xn ≤ t, |Xn −X| ≤ ε] + P[Xn ≤ t, |Xn −X| > ε]
≤ P[X ≤ t+ ε] + P[|Xn −X| > ε]→ P[X ≤ t+ ε].
So lim supn FXn(t) ≤ FX(t + ε) for all ε. Since FX is right-continuous, we get that
lim supn FXn(t) ≤ FX(t).
On the other hand, for any ε > 0,
P[X ≤ t− ε] = P[X ≤ t− ε, |Xn −X| ≤ ε] + P[X ≤ t− ε, |Xn −X| > ε]
≤ P[Xn ≤ t] + P[|Xn −X| > ε].
Taking lim inf we get that FX(t − ε) ≤ lim infn FXn(t). Now, if FX is continuous at t,
then taking ε→ 0 gives
FX(t) ≤ lim infn
FXn(t) ≤ lim supn
FXn(t) ≤ FX(t).
So XnD−→ X. ut
22.2. Approximating by Smooth Functions
Lemma 22.5. Let (Xn)n be be a sequence of random variables. Let C3b be the space of
all three times continuously differentiable functions on R with bounded derivatives. If
for every φ ∈ C3b ,
E[φ(Xn)]→ E[φ(X)]
142
then XnD−→ X.
Proof. We want to choose a C3 function ψ that will approximate the step function
1(−∞,0]. For this we choose a C3 function that is 1 on (−∞, 0], decreasing on [0, 1] and
1 on [1,∞). Call this function ψ. (An explicit construction can be found in Section 22.8
below.)
For every t ∈ R and ε > 0 define
ψt,ε(x) = ψ(ε−1(x− t)).
Note that for every t ∈ R and ε > 0 we have
ψt−ε,ε(x) ≤ 1(−∞,t](x) ≤ ψt,ε(x).
For convergence in distribution we need to show that for every t ∈ R such that FX is
continuous at t,
limn→∞
E[1(−∞,t](Xn)] = E[1(−∞,t](X)].
Fix t ∈ R. For any ε > 0 we have that
E[1(−∞,t](Xn)]− E[1(−∞,t](X)]
≤ E[ψt,ε(Xn)]− E[1(−∞,t](X)]→ E[ψt,ε(X)]− E[1(−∞,t](X)]
≤ E[1(−∞,t+ε](X)]− E[1(−∞,t](X)] = P[X ∈ (t, t+ ε]] = FX(t+ ε)− FX(t).
Similarly,
E[1(−∞,t](X)]− E[1(−∞,t](Xn)]
≤ E[1(−∞,t](X)]− E[ψt−ε,ε(Xn)]→ E[1(−∞,t](X)]− E[ψt−ε,ε(X)]
≤ E[1(−∞,t](X)]− E[1(−∞,t−ε](X)] = P[X ∈ (t− ε, t]] = FX(t)− FX(t− ε).
We conclude that
lim supn→∞
∣∣E[1(−∞,t](Xn)]− E[1(−∞,t](X)]∣∣ ≤ |FX(t+ ε)− FX(t)|+ |FX(t)− FX(t− ε)|
for all ε > 0.
For any t that is a continuity point of FX the above tends to 0 as ε→ 0. ut
143
22.3. Central Limit Theorem
We will now build the tools for the following very important approximation theorem:
Theorem 22.6 (Central Limit Theorem). Let X1, X2, . . . , Xn, . . . , be mutually inde-
pendent random variables, such that E[Xn] = 0 and E[X2n] = 1. Let
SN =
N∑
n=1
Xn.
Then,SN√N
D−→ N(0, 1).
That is, for all t ∈ R,
P[SN ≤ t√N ]→
∫ t
−∞
1√2πe−s
2/2ds.
In the exercises we will show that
Exercise 22.7. Let X1, X2, . . . , Xn, . . . , be mutually independent random variables, such
that E[Xn] = µ and Var[Xn] = σ2. Let
SN =N∑
n=1
Xn.
Then,SN −Nµ√
Nσ
D−→ N(0, 1).
X That is, no matter what distribution we start with - we can approximate the sum
of many independent trials by a normal random variable.
22.4. Uses for the CLT
Example 22.8. Every student’s grade in a course has expectation 80 and standard
deviation 20. Give an approximation of the average grade for 100 students, if all students
are independent. What is a good estimate for the probability of the average grade to
be below 70?
If Xk is the grade of the k-th student, then the average grade is M := 1100
∑100k=1Xk.
144
The central limit theorem tells us that 110·20 · (100M − 100 · 80) = 1
2(M − 80) is very
close to a N(0, 1) random variable. Thus,
P[M < 70] = P[12(M − 80) < −5] ≈ P[N(0, 1) < −5] =
∫ −5
−∞
1√2πe−s
2/2ds = ...
which can be calculated using a standard normal distribution table. 454
Example 22.9. X1, X2, . . ., are independent random variables with the same distribu-
tion, such that E[Xn] = 60 and Var[Xn] = 25.
Show that we can take N large enough so that the empirical average 1N
∑Nk=1Xk is
between 55 and 65 with probability greater than 0.99.
How large should we take N if we want to ensure this?
Let SN =∑N
k=1Xk. The central limit theorem tells us that the distribution of
15√N
(SN − 60N) converges to the distribution of a N(0, 1) random variable. Thus,
P[55 < 1N SN < 65] = P[−
√N <
SN − 60N
5√N
<√N ] ≈ P[−
√N < N(0, 1) <
√N ].
The right hand side above
P[−√N <
SN − 60N
5√N
<√N ]
goes to 1. So there is some large enough N so that this is greater than 0.99. 454
Example 22.10. X1, X2, . . ., are independent random variables with the same distri-
bution, such that E[Xn] = µ and Var[Xn] = 1.
Show that there exists N0 and c > 0 such that for all N > N0,
P[|SN −Nµ| ≤ 12
√N ] ≥ c,
where SN =∑N
k=1Xk.
Note that we cannot use Kolmogorov here, because that would only give
P[ max1≤n≤N
|Sn − nµ| > 12
√N ] ≤ 4.
However, the central limit theorem tells us that
P[|SN −Nµ| ≤ 12
√N ]→ P[|N(0, 1)| ≤ 1
2 ].
145
If we take c = 12 P[|N(0, 1)| ≤ 1
2 ], we get that for all large enough N ,
P[|SN −Nµ| ≤ 12
√N ] > c.
454
Example 22.11. The government decides to test if the lab workers at the central lab
of Kupat Cholim are adequate. They are given an exam, and graded between 0 and
100. The exam is such so that the expectation of a random worker’s grade is 80 and
standard deviation is 20.
If the lab has 10000 workers, give a good approximation of the probability that the
average grade in the lab is smaller than 70?
In this case, we can let X1, . . . , X10000 be the grades of the workers, so E[Xj ] = 80 and
Var[Xj ] = 400. If A = 110000
∑10000j=1 Xj is the average grade, the central limit theorem
tells us that
P[A ≤ 70] = P[100(A− 80) ≤ −10 · 100] = P[ 1100·20
10000∑
j=1
(Xj − 80) ≤ −50]
is very close to
P[N(0, 1) ≤ −50] =
∫ −50
−∞
1√2πe−s
2/2ds.
454
Example 22.12. Suppose there are 1.2 million people that will actually vote in the
elections. We want to forecast how many mandates party A will receive in the elections.
Each mandate is worth 1.2 · 106/120 = 104 people. There are a people that will actually
vote for party A, so party A will get 10−4a mandates.
Suppose we choose a random voter uniformly at random. The probability that she
votes for party A is then p = a1.2·106
, and the variance is p(1− p).If we repeat this experiment 900 times, we can sum up the number of people who
said they would vote for party A, and the average of this number should be close to p.
So if Xj is the indicator of the event that the j-th voter polled said she would vote for
party A, we get that
120 · 1
900
900∑
j=1
Xj
146
is close to 120p = 1.2 · 106 · 10−4p = 10−4a, which is the number of mandates party A
gets.
How good is this approximation? What is the number of mandates so that this is
close to the real number with 95% confidence?
For any k, we have that
P[∣∣120 · 1
900
900∑
j=1
Xj − 10−4a∣∣ > k] = P[
∣∣ 1900
900∑
j=1
(Xj − p)∣∣ > k
120 ]
= P[∣∣ 1
30
900∑
j=1
(Xj − p)∣∣ > k
4 ]
which is close to
P[|N(0, 1)| > k
4√p(1−p)
]
by the central limit theorem. If we take k ≥ 4√p(1− p)t then this is at most
2
∫ ∞
t
1√2π· e−s2/2ds ≤ 2
∫ ∞
t
st
1√2π· e−s2/2ds =
√2
t√πe−t
2/2.
For t = 2.05 this is smaller than 0.05. Since p(1−p) ≤ 1/4 we can take k = 5 > 2 ·2.05 ≥4√p(1− p) · 2.05 and then
P[∣∣120 · 1
900
900∑
j=1
Xj − 10−4a∣∣ > k] < 0.05.
454
Example 22.13. Let (Xn)n be independent identically distributed random variables, let
(Yn)n be independent identically distributed random variables, such that E[Xn] = E[Yn]
for all n, and for all n:
Var[Xn] = Var[Yn] = 1 and Cov(Xn, Yn) = 12 .
Let S =∑104
j=1Xj and T =∑104
j=1 Yj . Give a good approximation for P[S > T ].
We have that
S − T =104∑
j=1
(Xj − Yj),
147
where (Xn − Yn)n are independent with mean 0 and
Var[Xj − Yj ] = Var[Xj ] + Var[Yj ]− 2 Cov(Xj , Yj) = 1.
So
P[S > T ] = P[S − T > 0] = P[ 1100(S − T ) > 0] ≈ P[N(0, 1) > 0] =
1
2.
454
Example 22.14. A gambler plays a game repeatedly where at each game he earns
Geo(0.1) Shekel and loses Poi(10) Shekel, both wins and losses independent.
• What is a good approximation for the probability that he has more than 20
Shekel after 100 games?
• What is a good approximation for the probability that he has exactly 0 Shekel?
The amount of money after 100 games is
M :=100∑
j=1
Xj − Yj
where Xj ∼ Geo(0.1) and Yj ∼ Poi(10) all independent. Since E[Xj − Yj ] = 0 and since
Var[Xj − Yj ] = Var[Xj ] + Var[Yj ] = 100 · 0.9 + 10 = 100,
we have using the central limit theorem
P[M > 20] = P[ 1100M > 0.2] ≈ P[N(0, 1) > 0.2]
For P[M = 0] we cannot use the approximation P[N = 0] = 0! So we use the following
trick: Since M takes on only integer values, M = 0 if and only if M ∈ [−1/2, 1/2]. So,
P[M = 0] = P[ 1100M ∈ [− 1
200 ,1
200 ]] ≈ P[− 1200 ≤ N(0, 1) ≤ 1
200 ]
454
148
22.5. Lindeberg’s Method
Jarl Lindeberg
(1876–1932)
A “hands-on” proof of the CLT (that does not use the Levy Continuity Theorem) was
given by Lindeberg. The main idea is: If Z1, . . . , Zn are independent N(0, 1) random
variables, then n−1/2(Z1 + · · · + Zn) ∼ N(0, 1). So we can change X1, X2, . . . , Xn into
Z1, . . . , Zn one by one, each time comparing the sums. Hopefully each of the n differences
will not be large.
First, a technical lemma regarding C3 functions.
Lemma 22.15. For any C3 function φ and any x, y ∈ R,
|φ(x+ y)− φ(x)− φ′(x)y − 1
2φ′′(x)y2| ≤ min
||φ′′||∞|y|2, ||φ′′′||∞|y|3
.
Proof. For any y > 0, by Taylor’s theorem, expanding around x,
Brook Taylor
(1685–1731)
|φ(x+ y)− φ(x)− φ′(x)y − 1
2φ′′(x)y2| ≤ 1
6 ||φ′′′||∞ · y3.
The second order Taylor expansion
|φ(x+ y)− φ(x)− φ′(x)y| ≤ 12 ||φ′′||∞ · y2,
with the triangle inequality gives that
|φ(x+ y)− φ(x)− φ′(x)y − 1
2φ′′(x)y2| ≤ |φ(x+ y)− φ(x)− φ′(x)y|+ 1
2 ||φ′′||∞ · y2
≤ ||φ′′||∞ · y2.
If y < 0 apply the previous argument to the function φ(−x), and note that the
absolute values of the derivatives of the two functions are always the same. ut
Lemma 22.16. Let A,X,Z be independent random variables with E[A] = E[X] =
E[Z] = 0 and E[X2] = E[Z2]. Then for any C3 function φ and any ε, δ > 0,
∣∣∣E[φ(A+ δX)]− E[φ(A+ δZ)]∣∣∣
≤Mφδ2 ·(εE[X2] + εE[Z2] + E[Z21|Z|>εδ−1] + E[X21|X|>εδ−1
),
where Mφ = max ||φ′′||∞, ||φ′′′||∞.
We gain ε
in the first
term, and the
1|X|>εδ−1term for the
second term.
149
Proof. Set
Y =
(φ(A+ δX)− φ(A)− φ′(A) · δX − 1
2φ′′(A) · δ2X2
)
−(φ(A+ δZ)− φ(A)− φ′(A) · δZ − 1
2φ′′(A) · δ2Z2
).
Since E[X] = E[Z] and E[X2] = E[Z2] we get that
E[Y ] = E[φ(A+ δX)]− E[φ(A+ δZ)].
Also, for YX = φ(A+ δX)− φ(A)− φ′(A) · δX − 12φ′′(A) · δ2X2,
|E[YX ]| = |E[YX1|X|>εδ−1] + E[YX1|X|≤εδ−1]|
≤ ||φ′′||∞δ2 E[X21|X|>εδ−1] + ||φ′′′||∞δ3 E[|X|31|X|≤εδ−1]
≤ ||φ′′||∞δ2 E[X21|X|>εδ−1] + ε||φ′′′||∞δ2 E[X2].
ut
We are now ready to prove the Central Limit Theorem.
Theorem 22.17 (Central Limit Theorem). Let (Xn)n be independent identically dis-
tributed random variables with E[Xk] = 0 and E[X2k ] = 1. Let Sn =
∑nk=1Xk and
Z ∼ N(0, 1). Then, for any C3b function φ,
E[φ(n−1/2Sn)]→ E[φ(Z)],
so consequently, n−1/2SnD−→ Z.
Proof. Let (Zn)n be independent N(0, 1) random variables, also independent of (Xn)n.
Note that n−1/2∑n
k=1 Zk ∼ N(0, 1).
Fix n and let 0 ≤ k ≤ n. Define
Ak = X1 + · · ·+Xk + Zk+1 + · · ·Zn.
So Sn = An and n−1/2A0 ∼ N(0, 1). We can write the telescopic sum
E[φ(Z)]− E[φ(n−1/2Sn)] =
n−1∑
k=0
E[φ(n−1/2Ak)]− E[φ(n−1/2Ak+1)].
So it suffices to bound the increments |E[φ(n−1/2Ak)]− E[φ(n−1/2Ak+1)]|.
150
For 0 ≤ k ≤ n − 1 let A = n−1/2(X1 + · · · + Xk + Zk+2 + · · ·Zn). So n−1/2Ak =
A+ n−1/2Zk+1 and n−1/2Ak+1 = A+ n−1/2Xk+1. Thus, for any ε > 0,
∣∣∣E[φ(n−1/2Ak)]− E[φ(n−1/2Ak+1)]∣∣∣ ≤Mφ · n−1 ·
(2ε+ E[Z2
k+11|Zk+1|>ε√n] + E[X2
k+11|Xk+1|>ε√n])
Summing over k we get for any ε > 0,
∣∣∣E[φ(Z)]− E[φ(n−1/2Sn)]∣∣∣ ≤Mφ ·
(2ε+ E[Z21|Z|>ε√n] + E[X21|X|>ε√n]
).
Taking n → ∞, and recalling that ε > 0 was arbitrary, it suffices to show that for any
random variable X with finite variance, the quantity E[X21|X|>ε√n]→ 0.
Indeed, the sequence (X21|X|≤ε√n)n is monotonically increasing to X2. So mono-
tone convergence implies that
E[X21|X|>ε√n] = E[X2]− E[X21|X|≤ε√n]→ 0.
ut
22.6. Characteristic Function
The characteristic function is actually just the Fourier transform of a random variable.
22.6.1. Preliminaries. Let (X,Y ) be a two-dimensional random variable. We can
think of (X,Y ) as a complex valued random variable and write (X,Y ) = X+iY . Define
the expectation of a complex valued random variable to be E[X + iY ] = E[X] + iE[Y ]
as long as these expectations are all finite.
Note that if g : R → C is a measurable function, then we can write g = Reg + Img,
and these are also measurable. Moreover, if X,Y are independent, and g, h : R→ C are
measurable, then g(X), h(Y ) are independent, since Reg(X), Img(X) are independent
of Reh(Y ), Imh(Y ).
Definition 22.18. Let X be a random variable. Define a function ϕX : R→ C by
ϕX(t) := E[eitX ] = E[cos(tX)] + iE[sin(tX)].
φX is called the characteristic function of X.
151
Note that φX is always defined since E[| cos(tX)|] ≤ 1 and E[| sin(tX)|] ≤ 1. Moreover,
we get that (as a complex number) |φX(t)| ≤ 1.
Note that if X is discrete with density fX then
φX(t) =∑
r
fX(r)eitr.
If X is absolutely continuous with density fX then
φX(t) =
∫ ∞
−∞eitsfX(s)ds.
(This latter object is the Fourier transform of fX .)
Example 22.19. Let’s calculate some characteristic functions:
• X ∼ Ber(p):
φX(t) = peit + (1− p) = 1− p(1− eit).
• Z = X + Y for X,Y independent:
φZ(t) = E[eitXeitY ] = E[eitX ]E[eitY ] = φX(t) · φY (t).
• Y = cX for some real c ∈ R:
φY (t) = E[eitcX ] = φX(tc).
• X ∼ Bin(n, p): We can write X =∑n
k=1Xj where (Xk)nk=1 are independent
Ber(p). Thus,
φX(t) =n∏
k=1
φXk(t) = (1− p(1− eit)n.
• X ∼ Geo(λ):
φX(t) =
∞∑
k=1
(1− p)k−1peitk =peit
1− (1− p)eit .
• X ∼ Poi(λ):
φX(t) =∞∑
k=0
e−λλk
k!eitk = e−λ(1−eit).
152
• X ∼ U [a, b]:
φX(t) =
∫ ∞
−∞eitsfX(s)ds =
1
b− a
∫ b
aeitsds =
1
it(b− a)(eitb − eita).
• X ∼ Exp(λ):
φX(t) =
∫ ∞
0λe−λseitsds =
λ
λ− it .
• X ∼ N(µ, σ): Write Y = (X − µ)/σ. So Y ∼ N(0, 1). We use the fact
that s 7→ sin(ts)e−s2/2 is an odd function, so its integral over R is 0. Since
eits = cos(ts) + i sin(ts) we get that
φY (t) =
∫ ∞
−∞
1√2πe−s
2/2eitsds =
∫ ∞
−∞
1√2πe−s
2/2 cos(ts)ds.
Let χs(t) = e−s2/2 cos(ts). Since |χs(t)| ≤ 1, since χ′s(t) = −e−s2/2s sin(ts)
which is continuous in t for every s, since for any t
∫
R
∫ ε
−ε|χ′s(t+ η)|dηds ≤
∫
R|s|e−s2/2ds <∞,
and since
∫
Rχ′s(t)ds = −
∫ ∞
−∞sin(ts)·se−s2/2ds = sin(ts)e−s
2/2∣∣∣∞
−∞−∫ ∞
−∞t cos(ts)e−s
2/2ds = −t∫
Rχs(t)ds,
which is continuous in t, we can differentiate φY under the integral to get
φ′Y (t) =1√2π
∫
Rχ′s(t) = −t · 1√
2π
∫
Rχs(t) = −tφY (t).
Dividing by φY we get that
d
dt(log φY (t)) =
φ′Y (t)
φY (t)= −t.
Integrating, log φY (t) = − t2
2 + C and since φY (0) = 1 we get that C = 0 and
φY (t) = e−t2/2. This completes the calculation for Y ∼ N(0, 1). For X = σY +µ
we get
φX(t) = E[eitσY ]eitµ = eitµφσY (t) = eitµφY (σt) = eitµe−σ2t2/2.
454
153
Lemma 22.20. Let X be a random variable with E[X] = 0 and E[X2] = 1. Then, as
t→ 0,
φX(t) = 1− t2
2+ o(t2).
Proof. As in Lemma 22.15 we get that for any x ∈ R and t > 0,∣∣∣∣e−itx − 1− itx+
t2
2x2
∣∣∣∣ ≤ mint2x2, t3|x|3
.
Thus, for any ε > 0,
|E[e−itX ]−1− t2
2| ≤ E[min
t2X2, t3|X|3
]
≤ t2 E[X21|X|>εt−1] + εt2.
We have already seen that for 0 < η < 1, as t→ 0, E[X21|X|>tη−1]→ 0 by monotone
convergence. So choosing ε = tη, as t→ 0,
φX(t) = 1− t2
2+ o(t2).
ut
22.7. Levy’s Continuity Theorem
Perhaps the main theorem using characteristic functions is a theorem of Paul Levy
giving a necessary and sufficient condition for convergence in distribution.
Paul Levy (1886–1971)
Theorem 22.21 (Levy’s Continuity Theorem). Let (Xn)n be a sequence of random
variables, and let X be another random variable. Then, XnD−→ X if and only if
φXn(t)→ φX(t) for all t ∈ R.
22.7.1. Proof of CLT. We can use Levy’s Continuity Theorem to give a different proof
of the Central Limit Theorem.
Proof of Theorem 22.6. Let φ(t) = φXn(t). Note that
φSN/√N (t) = φ(t/
√N)N .
As N →∞, we have that
φ(t/√N) = 1− t2
2N+ o(t2N−1),
154
as N →∞. Thus,
limN→∞
φ(t/√N)N = e−t
2/2.
So φSN/√N (t)→ e−t
2/2 which is the characteristic function of a N(0, 1) random variable.
By Levy’s Continuity Theorem we get that SN/√N
D−→ N(0, 1). ut
22.8. An Explicit Approximation of the Step Function
In this section we construct a function ψ, which is C3b , takes values 1 on (−∞, 0],
decreasing on [0, 1] and values 0 on [1,∞).
Some consideration leads us to understand that we would like ψ′ < 0 on (0, 1), and so
we search for a function such that ψ′′(x) = −ψ′′(1−x) on [0, 1], vanishes at the endpoints
0, 1 and obtains local minimum or maximum at the endpoint 0, 1. Some thought leads
to the function g(x) = x2(1− x)2(1− 2x). Integrating and taking a derivative we have
that for
h(x) =x4
12− x5
5+x6
6− x7
21,
the derivatives on (0, 1) satisfy h′(x) = 13x
3(1− x)3, h′′(x) = g(x) and h′′′(x) = g′(x) =
2x− 12x2 + 20x3 − 10x4.
Set A = 420 = 7 · 5 · 3 · 4. Because h(1) = A−1 and h(0) = 0 we take
ψ(x) =
1−A · h(x) x ∈ [0, 1]
1 x < 0
0 x > 1
We have that on (0, 1), ψ′ = −Ah′ < 0, also ψ′′ = −Ah′′ and ψ′′′ = −Ah′′′. These
three derivatives all vanish at 0 and 1 so ψ is C3b .
155
Figure 6. The second derivative −20 · g(x) and the function 1−A · h(x).
156
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 23
23.1. Concentration of Measure
Suppose (Xn)n are a sequence of i.i.d. random variables with P[Xn = 1] = P[Xn =
−1] = 1/2. Let
Sn =n∑
k=1
Xk.
So E[Sn] = 0 and Var[Sn] = E[S2n] = n.
Chebyshev’s inequality tells us that
P[|Sn| ≥ λ√n] ≤ λ−2.
However, the central limit theorem gives that for large n,
P[|Sn| ≥ λ√n] ≈ P[|N(0, 1)| ≥ λ] =
∫ ∞
λ
1√2πe−s
2/2ds+
∫ −λ
−∞
1√2πe−s
2/2ds
≤√
2/π · λ−1
∫ ∞
λse−s
2/2ds =√
2/πλ−1 · (−e−s2/2)∣∣∣∞
λ
=√
2/πλ−1e−λ2/2.
This decays much faster than λ−2. However, we don’t know how good our approximation
by a standard normal is.
Because (Xn)n are independent, we can obtain an very nice concentration result using
a smart trick by Bernstein.
Sergei Bernstein
(1880-1968)
Theorem 23.1 (Bernstein’s Inequality). Let (Xn)n be independent random variables
such that for all n, |Xn| ≤ 1 a.s. and E[Xn] = 0. Let Sn =∑n
k=1Xk. Then for any
λ > 0,
P[Sn ≥ λ] ≤ exp
(−λ
2
2n
),
157
and consequently,
P[|Sn| ≥ λ] ≤ 2 exp
(−λ
2
2n
).
Proof. There are two main ideas:
The first idea, is that for a random variable X with E[X] = 0 and |X| ≤ 1 a.s. one
has E[eαX ] ≤ eα2/2. Indeed, g(x) = eαx is a convex function, so for |x| ≤ 1 we can write
x = β · 1 + (1− β) · (−1), where β = x+12 , and
eαx ≤ βeα + (1− β)e−α = cosh(α) + x sinh(α).
(Here 2 cosh(α) = eα + e−α and 2 sinh(α) = eα − e−α.) Thus, because E[X] = 0, and
using (2k)! ≥ 2kk!,
E[eαX ] ≤ cosh(α) + E[X] sinh(α) = cosh(α)
=∞∑
k=0
α2k
(2k)!≤∞∑
k=0
α2k
2kk!= eα
2/2.
For the second idea, due to Sergei Bernstein, one applies the Chebyshev / Markov
inequality to the non-negative random variable eαX , and then optimizes on α.
In our case, using independence,
E[eαSn ] =n∏
k=1
E[eαXk ] ≤ eα2n/2.
Now apply Markov’s inequality to the non-negative random variable eαSn to get for any
α > 0,
P[Sn ≥ λ] = P[eαSn ≥ eαλ] ≤ exp(
12nα
2 − αλ).
Optimizing over α we get that for α = λ/n,
P[Sn ≥ λ] ≤ exp
(−λ
2
2n
).
ut
Example 23.2. Suppose a gambler plays a game repeatedly and independently, such
that with probability p < 1/2 he earns one dollar and loses one dollar with probability
1− p.
158
For every n let Xn be the amount he won in the n-th game. So Sn =∑n
k=1Xk is the
total amount he wins after n games.
Note that E[Xk] = 2p − 1 < 0, so Yk = 12(Xk − (2p − 1)) has expectation 0 and
|Yk| ≤ 1. By Bernstein’s inequality
P[Sn > 0] = P[
n∑
k=1
Yk >n2 (1− 2p)] ≤ exp
(−n(1− 2p)2
8
).
So the probability of winning is very small even if the house has a very small advantage.
454
159
Introduction to Probability
201.1.8001
Ariel Yadin
Lecture 24
24.1. Review
Exercise 24.1. We have N urns. Urn number k contains k white balls and N −k black
balls.
An urn is chosen randomly, all urns equally likely. Then, we start removing balls
from that urn, all balls equally likely, all balls independent, returning the removed ball
to the urn after removing it. Let A be the event that the first n balls removed are white.
Let B be the event that the n+ 1-th ball removed is black.
Calculate P[A],P[B ∩A],P[B|A].
Now calculate the same if the process is that each time a random urn is chosen inde-
pendently, and then a ball is removed and returned to and from that urn.
Solution to Exercise 24.1. For the first scenario, let Ck be the event the k-th urn was
chosen. Independence gives us that
P[A|Ck] =(kN
)n P[B ∩A|Ck] =(kN
)n · N−kN .
Thus by the law of total probability,
P[A] =1
Nn+1
N∑
k=1
kn P[B ∩A] =1
Nn+2
N∑
k=1
kn(N − k).
Thus,
P[B|A] = 1− 1
N·∑N
k=1 kn+1
∑Nk=1 k
n.
In the second scenario, the probability that a white ball is chosen each time is
1
N
N∑
k=1
k
N=
1
N2·(N + 1
2
)=N + 1
2N:= p.
160
All balls are independent, so
P[A] = pn P[B ∩A] = pn(1− p) P[B|A] = (1− p).
ut
Exercise 24.2. Random variables X,Y are equally distributed if FX = FY . Show that
if X,Y are equally distributed then E[X] = E[Y ] if the expectations exist, and E[X]
does not exist if and only if E[Y ] does not exist. (Hint: Start with simple, go through
non-negative, continue to general. First show that PX = PY .)
Solution to Exercise 24.2. The first observation is that P[X ∈ B] = P[Y ∈ B] for any
Borel set B ∈ B. Indeed, the probability measures PX ,PY agree on the π-system
(−∞, t] : t ∈ R by definition, and this π-system generates B, so PX ,PY agree on all
B.
Now, note that if g : R → R is a measurable function, then if X,Y are equally
distributed, then so are g(X), g(Y ). This is because
P[g(X) ≤ t] = P[X ∈ g−1(−∞, t]] = P[Y ∈ g−1(−∞, t]] = P[g(Y ) ≤ t].
The next step is to prove the claim for simple random variables: If X,Y are simple
and equally distributed then
P[X = r] = FX(r)− FX(r−) = FY (r)− FY (r−) = P[Y = r].
Let R be a range for X and Y . Linearity of expectation now gives us that
E[X] =∑
r∈Rr P[X = r] =
∑
r∈Rr P[Y = r] = E[Y ].
Now assume that X,Y ≥ 0. Let gn(r) = min 2−nb2nrc, n. So 0 ≤ gn(X) X and
0 ≤ gn(Y ) Y . Monotone convergence gives that
E[X] = limn
E[gn(X)] = limn
E[gn(Y )] = E[Y ],
where we have used the fact that gn(X), gn(Y ) are equally distributed simple random
variables for all n. Note that the above also holds if E[X] = E[Y ] =∞.
161
Finally, for general equally distributed X,Y : X+, Y + are equally distributed and
non-negative. Thus, E[X+] = E[Y +]. Similarly for E[X−] = E[Y −]. Thus,
E[X] = E[X+]− E[X−] = E[Y +]− E[Y −] = E[Y ],
if at least one of E[X+] = E[Y +] or E[X−] = E[Y −] is finite. If they are both infinite,
then both E[X] and E[Y ] do not exist. ut
Exercise 24.3. X ∼ U [−1, 1]. Calculate:
• P[|X| > 1/2] .
• P[sin(πX/2) > 1/2].
Solution to Exercise 24.3. Note that |X| > 1/2 = X > 1/2 ] X < −1/2. So,
P[|X| > 1/2] = P[X > 1/2] + P[X < −1/2] = 14 + 1
4 = 12 .
Since sin is increasing in [−π/2, π/2] and since sin(π/6) = 1/2, we get that
P[sin(πX/2) > 1/2] = P[πX/2 > π/6] = P[X > 1/3] = 1/3.
ut
Exercise 24.4. B,C are independent and B ∼ Exp(λ), C ∼ U [0, 1]. What is the
probability that the equation x2 + 2Bx+ C = 0 has two different real solutions.
Solution to Exercise 24.4. We are looking for the probability that
P[4B2 − 4C > 0] = P[B2 > C].
For any t > 0,
P[B2 ≤ t] = P[B ≤√t] =
∫ √t
0λe−λsds =
∫ t
0
1
2√uλe−λ
√udu,
so B2 is absolutely continuous with this density. Note that for all s > 0,∫ ∞
sfB2(u)du =
∫ ∞
s
1
2√uλe−λ
√udu =
∫ ∞√sλe−λtdt = e−λ
√s.
Since B2, C are independent, we have that
fB2|C(t|s) = fB2(t).
162
Thus,
P[B2 > C] =
∫ 1
0
∫ ∞
sfB2,C(t, s)dtds =
∫ 1
0
∫ ∞
sfB2|C(t|s)dtds =
∫ 1
0e−λ√sds
= 2
∫ 1
0ue−λudu = −2u
λ e−λu∣∣∣1
0+ 2
λ
∫ 1
0e−λudu
= − 2λe−λ − 2
λ2(e−λ − 1) = 2
λ2·(
1− (1 + λ)e−λ).
We have now a non-trivial inequality: 2λ2·(1− (1 + λ)e−λ
)≤ 1, which is equivalent to
eλ − (1 + λ) ≤ λ2
2 · eλ.
This indeed holds as 2k! ≤ (k + 2)! for all k and
eλ − (1 + λ) =
∞∑
k=0
λk+2
(k + 2)!≤∞∑
k=0
λk
k!· λ
2
2.
ut
Exercise 24.5. Let (X,Y ) be jointly absolutely continuous with
fY (s) =
C(s3 − s− 1) s ∈ [2, 4]
0 otherwise
and for all s ∈ [2, 4],
fX|Y (t|s) =
C(s) cos( π2s t) t ∈ [0, s]
0 otherwise
Calculate:
• C and C(s) for all s ∈ [2, 4].
• E[Y ],E[X|Y = s],E[X].
• Var[Y ],Var[X].
• Cov(X,Y ) and Var[X + Y ].
Solution to Exercise 24.5. For C:
1 = C
∫ 4
2(s3 − s− 1)ds = C ·
(44
4 − 42
2 − 4− 24
4 + 22
2 + 2)
= C · (60− 6− 2) = C · 52.
163
For C(s):
1 = C(s)
∫ s
0cos( π2s t)dt = C(s) · 2s
π sin( π2s t)∣∣∣s
0
= C(s) · 2sπ ,
So C(s) = π2s .
We also have
E[Y ] = 52−1
∫ 4
2(s4 − s2 − s)ds =
45 − 25
5 · 52− 43 − 23
3 · 52− 42 − 22
2 · 52=
5212
30 · 52.
For s ∈ [2, 4], using the fact that (x sinx+ cosx)′ = sinx+ x cosx− sinx = x cosx we
have
E[X|Y = s] = C(s)
∫ s
0t cos(C(s)t)dt = C(s)−1
∫ π/2
0t cos tdt
= C(s)−1 ·(π2 − 1
)=(1− 2
π
)· s.
To calculate E[X] we use the fact that fX,Y (t, s) = fY (s)fX|Y (t|s) for all s ∈ [2, 4]
and fX,Y (t, s) = 0 otherwise. Thus, with the function (t, s) 7→ t,
E[X] =
∫ 4
2fY (s)
∫ s
0tfX|Y (t|s)dtds =
(1− 2
π
)· 52−1 ·
∫ 4
2sfY (s)ds =
(1− 2
π
)· E[Y ].
Now, similarly
E[XY ] =
∫ 4
2sfY (s)
∫ s
0tfX|Y (t|s)dtds =
(1− 2
π
)· E[Y 2],
so we get that
Cov(X,Y ) =(1− 2
π
)· E[Y 2]−
(1− 2
π
)· E[Y ] · E[Y ] =
(1− 2
π
)·Var[Y ].
ut
Exercise 24.6. A chicken lays an egg of random size. The size of the egg has the
distribution of |N | where N ∼ N(0, 5). For any s > 0, given that an egg is of size s, the
time until the egg hatches is distributed Exp(s−2). What is the expected time for an egg
to hatch?
164
Solution to Exercise 24.6. Let S be the size and T the time. We are given that S ∼|N(0, 5)|. That is, for any s > 0,
P[S ≤ s] = P[N(0, 5) ∈ [−s, s]] =
∫ s
−s
1√2π5
e−t2/50dt = 2
∫ s
0
1√2π5
e−t2/50dt.
So S is absolutely continuous with density
fS(s) =
√2
5√πe−s
2/50 s > 0
0 s ≤ 0
So we can calculate the expectation of T :
E[T ] =
∫ ∫tfT,S(t, s)dtds =
∫ ∞
0fS(s)
∫ ∞
0ts−2e−s
−2tdtds
=
∫ ∞
0s2fY (s)ds = E[S2] = E[N(0, 5)2] = 25.
ut
Exercise 24.7. Let X be a discrete random variable with range Q. Suppose that E[X] =
µ, E[X2] = σ and E[X3] = ρ. Let Y be a discrete random variable defined by Y |X =
q ∼ Poi(q2).
Calculate Cov(X,Y ).
Solution to Exercise 24.7. Note that for any q ∈ Q,
q2 = E[Y |X = q] =∞∑
k=0
ke−q2 q2k
k!.
Note that for q ∈ Q and k ∈ N,
fY,X(k, q) = fX(q)e−q2 q2k
k!.
So
E[XY ] =∑
q∈Q
∞∑
k=0
qkfY,X(k, q) =∑
q∈Qq3fX(q) = ρ,
and
E[Y ] =∑
q∈Q
∞∑
k=0
kfY,X(k, q) =∑
q∈Qq2fX(q) = σ.
Thus, Cov(X,Y ) = ρ− σµ. ut