introduction to probability - tntech.edu...introduction to probability motoya machida january 17,...

Introduction to Probability

Motoya Machida

January 17, 2006

This material is designed to provide a foundation in the mathematics of probability. We begin with the basicconcepts of probability, such as events, random variables, independence, and conditional probability. Having de-veloped these concepts, the remainder of the material concentrates on the methods of calculation in probabilityand their applications through exercises. The topics treated here are divided into the four major parts: (I) prob-ability and distributions, (II) discrete distributions, (III) continuous distributions, and (IV) joint distributions.When completed, these topics will unify the understanding of density and distribution functions of various kinds,calculation and interpretation of moments, and distributions related to the normal distribution.

1 Sample space and events

The set of all possible outcomes of an experiment is called the sample space, and is typically denoted by Ω. Forexample, if the outcome of an experiment is the order of finish in a race among 3 boys, Jim, Mike and Tom, thenthe sample space becomes

Ω = (J, M, T ), (J, T, M), (M, J, T ), (M, T, J), (T, J, M), (T, M, J).

In other example, suppose that a researcher is interested in the lifetime of a transistor, and measures it inminutes. Then the sample space is represented by

Ω = all nonnegative real numbers.

Any subset of the sample space is called an event. In the who-wins-the-race example, “Mike wins the race” isan event, which we denote by A. Then we can write

A = (M, J, T ), (M, T, J).

In the transistor example, “the transistor does not last longer than 2500 minutes” is an event, which we denoteby E. And we can write

E = x : 0 ≤ x ≤ 2500.

1.1 Set operations

Once we have events A, B, . . ., we can define a new event from these events by using set operations—union,intersection, and complement. The event A∪B, called the union of A and B, means that either A or B or bothoccurs. Consider the “who-wins-the-race” example, and let B denote the event “Jim wins the second place”,that is,

B = (M, J, T ), (T, J, M).

1


Then A ∪ B means that “either Mike wins the first, or Jim wins the second, or both,” that is,

A ∪ B = (M, T, J), (T, J, M), (M, J, T ).

The event A ∩ B, called the intersection of A and B, means that both A and B occurs. In our example, A ∩ Bmeans that “Mike wins the first and Jim wins the second,” that is,

A ∩ B = (M, J, T ).

The event Ac, called the complement of A, means that A does not occur. In our example, the event Ac meansthat “Mike does not win the race”, that is,

Ac = (J, M, T ), (J, T, M), (T, J, M), (T, M, J).

Now suppose that C is the event “Tom wins the race.” Then what is the event A ∩ C? It is impossible thatMike wins and Tom wins at the same time. In mathematics, it is called the empty set, denoted by ∅, and can beexpressed in the form

A ∩ C = ∅.If the two events A and B satisfy A ∩ B = ∅, then they are said to be disjoint. For example, “Mike wins therace” and “Tom wins the race” are disjoint events.

1.2 Exercises

1. Two six-sided dice are thrown sequentially, and the face values that come up are recorded.

(a) List the sample space Ω.

(b) List the elements that make up the following events:

i. A = the sum of the two values is at least 5ii. B = the value of the first die is higher than the value of the secondiii. C = the first value is 4

(c) List the events of the following events:

i. A ∩ C

ii. B ∪ C

iii. A ∩ (B ∪ C)

2 Probability and combinatorial methods

A probability P is a “function” defined over all the “events” (that is, all the subsets of Ω), and must satisfy thefollowing properties:

1. If A ⊂ Ω, then 0 ≤ P (A) ≤ 1

2. P (Ω) = 1

3. If A1, A2, . . . are events and Ai ∩ Aj = ∅ for all i 6= j then, P (⋃∞

i=1 Ai) =∑∞

i=1 P (Ai).

Property (c) is called the “axiom of countable additivity,” which is clearly motivated by the property:

P (A1 ∪ · · · ∪ An) = P (A1) + · · · + P (An) if A1, A2, . . . , An are mutually exclusive events.

Quite frankly, there is nothing in one’s intuitive notion that requires this axiom.

Lecture Note


2.1 Equally likely outcomes

When the sample spaceΩ = ω1, . . . , ωN

consists of N outcomes, it is often natural to assume that P (ω1) = · · · = P (ωN). Then we can obtain theprobability of a single outcome by

P (ωi) =1

Nfor i = 1, . . . , N .

Assuming this, we can compute the probability of any event A by counting the number of outcomes in A, andobtain

P (A) =number of outcomes in A

N.

A problem involving equally likely outcomes can be solved by counting. The mathematics of counting is knownas combinatorial analysis. Here we summarize several basic facts—multiplication principal, permutations andcombinations.

2.2 Multiplication principle

If one experiment has m outcomes, and a second has n outcomes, then there are m × n outcomes for the twoexperiments.

The multiplication principle can be extended and used in the following experiment. Suppose that we have ndifferently numbered balls in an urn. In the first trial, we pick up a ball from the urn, record its number, andput it back into the urn. In the second trial, we again pick up, record, and put back a ball. Continue the sameprocedure until the r-th trial. The whole process is called a sampling with replacement. By generalizing themultiplication principle, the number of all possible outcomes becomes

n × · · · × n︸︷︷︸

r

= nr.

Example 1. How many different 7-place license plates are possible if the first 3 places are to be occupied byletters and the final 4 by numbers?

Example 2. How many different functions f defined on 1, . . . , n are possible if the value f(i) takes either 0or 1?

2.3 Permutations

Consider again “n balls in an urn.” In this experiment, we pick up a ball from the urn, record its number, but donot put it back into the urn. In the second trial, we again pick up, record, and do not put back a ball. Continuethe same procedure r-times. The whole process is called a sampling without replacement. Then, the number ofall possible outcomes becomes

n × (n − 1) × (n − 2) × · · · × (n − r + 1)︸︷︷︸

r

Given a set of n different elements, the number of all the possible “ordered” sets of size r is called r-elementpermutation of an n-element set. In particular, the n-element permutation of an n-element set is expressed as

n × (n − 1) × (n − 2) × · · · × 2 × 1︸︷︷︸

n

= n!

Lecture Note


The above mathematical symbol “n!” is called “n factorial.” We define 0! = 1, since there is one way to order 0elements.

Example 3. How many different 7-place license plates are possible if the first 3 places are using different lettersand the final 4 using different numbers?

Example 4. How many different functions f defined on 1, . . . , n are possible if the function f takes values on1, . . . , n and satisfies that f(i) 6= f(j) for all i 6= j?

2.4 Combinations

In an experiment, we have n differently numbered balls in an urn, and pick up r balls “as a group.” Then thereare

n!

(n − r)!

ways of selecting the group if the order is relevant (r-element permutation). However, the order is irrelevantwhen you choose r balls as a group, and any particular r-element group has been counted exactly r! times. Thus,the number of r-element groups can be calculated as

(n

r

)

=n!

(n − r)!r!.

You should read the above symbol as “n choose r.”

Example 5. A committee of 3 is to be formed from a group of 20 people. How many different committees arepossible?

Example 6. From a group of 5 women and 7 men, how many different committees consisting of 2 women and3 men can be formed?

2.5 Binomial theorem

The term “n choose r” is often referred as a binomial coefficient, because of the following identity.

(a + b)n =

n∑

i=0

(n

i

)

ai bn−i.

In fact, we can give a proof of the binomial theorem by using combinatorial analysis. For a while we pretendthat “commutative law” cannot be used for multiplication. Then, for example, the expansion of (a+b)2 becomes

aa + ab + ba + bb,

and consists of the 4 terms, aa, ab, ba, and bb. In general, how many terms in the expansion of (a + b)n shouldcontain a’s exactly in i places? The answer to this question indicates the binomial theorem.

2.6 Optional topic

A lot of n items contains k defectives, and m are selected randomly and inspected. How should the value of mbe chosen so that the probability that at least one defective item turns up is 0.9? Apply your answer to thefollowing cases:

1. n = 1000 and k = 10;

Lecture Note


2. n = 10, 000 and k = 100.

Suppose that we have a lot of size n containing k defectives. If we sample and inspect m random items, what is theprobability that we will find i defectives in our sample? This probability is called a hypergeometric distribution,and expressed by

p(i) =

(k

i

)(n − k

m − i

)

(n

m

) , i = max0, m + k − n, . . . , minm, k

2.7 Exercises

1. How many ways are there to encode the 26-letter English alphabet into 8-bit binary words (sequences ofeight 0’s and 1’s)?

2. What is the coefficient of x3y4 in the expansion of (x + y)7 ?

3. A child has six blocks, three of which are red and three of which are green. How many patterns can shemake by placing them all in a line?

4. A deck of 52 cards is shuffled thoroughly. What is the probability that the four aces are all next to eachother? (Hint: Imagine that you have 52 slots lined up to place the four aces. How many different wayscan you “choose” four slots for those aces? How many different ways do you get consecutive slots for thoseaces?)

3 Conditional probability and independence

Let A and B be two events where P (B) 6= 0. Then the conditional probability of A given B can be defined as

P (A|B) =P (A ∩ B)

P (B).

The idea of “conditioning” is that “if we have known that B has occurred, the sample space should have becomeB rather than Ω.”

Example 1. Mr. and Mrs. Jones have two children. What is the conditional probability that their children areboth boys, given that they have at least one son?

3.1 Bayes rule

Suppose that P (A|B) and P (B) are known. Then we can use this to find

P (A ∩ B) = P (A|B)P (B).

Example 2. Celine estimates that her chance to receive an A grade would be 12 in a French course, and 2

3 in achemistry course. If she decides whether to take a French course or a chemistry course this semester based onthe flip of a coin, what is the probability that she gets an A in chemistry?

Law of total probability. Let A and B be two events. Then we can write the probability P (A) as

P (A) = P (A ∩ B) + P (A ∩ Bc) = P (A|B)P (B) + P (A|Bc)P (Bc).

Lecture Note


In general, suppose that we have a sequence B1, B2, . . . , Bn of mutually disjoint events satisfying⋃n

i=1 Bi = Ω,where “mutual disjointness” means that Bi ∩ Bj = ∅ for all i 6= j. (The events B1, B2, . . . , Bn are called “apartition of Ω.”) Then for any event A we have

P (A) =

n∑

i=1

P (A ∩ Bi) =

n∑

i=1

P (A|Bi)P (Bi).

Example 3. An study shows that an accident-prone person will have an accident in one year period withprobability 0.4, whereas this probability is only 0.2 for a non-accident-prone person. Suppose that we assumethat 30 percent of all the new policyholders of an insurance company is accident prone. Then what is theprobability that a new policyholder will have an accident within a year of purchasing a policy?

Bayes’ rule. Let A and B1, . . . , Bn be events such that the Bi’s are mutually disjoint,⋃n

i=1 Bi = Ω andP (Bi) > 0 for all i. Then

P (Bj |A) =P (A ∩ Bj)

P (A)=

P (A|Bj)P (Bj)∑n

i=1 P (A|Bi)P (Bi).

3.2 Concept of Independence.

Intuitively we would like to say that A and B are independent if knowing about one event give us no informationabout another. That is, P (A|B) = P (A) and P (B|A) = P (B). We say A and B are independent if

P (A ∩ B) = P (A)P (B).

This definition is symmetric in A and B, and allows P (A) and/or P (B) to be 0. Furthermore, a collection ofevents A1, A2, . . . , An is said to be mutually independent if it satisfies

P (Ai1 ∩ · · · ∩ Aim) = P (Ai1 )P (Ai2) · · ·P (Aim

)

for any subcollection Ai1 , Ai2 , . . . , Aim.

3.3 Exercises

1. Urn A has three red balls and two white balls, and urn B has two red balls and five white balls. A faircoin is tossed; if it lands heads up, a ball is drawn from urn A, and otherwise, a ball is drawn from urn B.

(a) What is the probability that a red ball is drawn?

(b) If a red ball is drawn, what is the probability that the coin landed heads up?

2. An urn contains three red and two white balls. A ball is drawn, and then it and another ball of the samecolor are placed back in the urn. Finally, a second ball is drawn.

(a) What is the probability that the second ball drawn is white?

(b) If the second ball drawn is white, what is the probability that the first ball drawn was red?

3. Two dice are rolled, and the sum of the face values is six. What is the probability that at least one of thedice came up a three?

4. A couple has two children. What is the probability that both are girls given that the oldest is a girl? Whatis the probability that both are girls given that one of them is a girl?

5. What is the probability that the following system works if each unit fails independently with probability p(see the figure below)?

Lecture Note


4 Random variables

A numerically valued function X of an outcome ω from a sample space Ω

X : Ω → R : ω → X(ω)

is called a random variable (r.v.), and usually determined by an experiment. We conventionally denote randomvariables by uppercase letters X, Y, Z, U, V, . . . from the end of the alphabet.

Discrete random variables A discrete random variable is a random variable that can take values on a finite seta1, a2, . . . , an of real numbers (usually integers), or on a countably infinite set a1, a2, a3, . . .. The statementsuch as “X = ai” is an event since

ω : X(ω) = aiis a subset of the sample space Ω. Thus, we can consider the probability of the event X = ai, denoted byP (X = ai). The function

p(ai) := P (X = ai)

over the possible values of X , say a1, a2, . . ., is called a frequency function, or a probability mass function. Thefrequency function p must satisfy

∑

i

p(ai) = 1,

where the sum is over the possible values of X . This will completely describe the probabilistic nature of therandom variable X . Another useful function is the cumulative distribution function (c.d.f.), and it is defined by

F (x) = P (X ≤ x), −∞ < x < ∞.

The c.d.f. of a discrete r.v. is a nondecreasing step function. It jumps wherever p(x) > 0, and the jump atai is p(ai). C.d.f.’s are usually denoted by uppercase letters, while frequency functions are usually denoted bylowercase letters.

4.1 Continuous random variables

A continuous random variable is a random variable whose possible values are real values such as 78.6, 5.7, 10.24,and so on. Examples of continuous random variables include temperature, height, diameter of metal cylinder,etc. In what follows, a random variable means a “continuous” random variable, unless it is specifically said tobe discrete.

The probability distribution of a random variable X specifies how its values are distributed over the real numbers.This is completely characterized by the cumulative distribution function (cdf). The cdf

F (t) := P (X ≤ t).

represents the probability that the random variable X is less than or equal to t. Then we often say that “therandom variable X is distributed as F (t).” The cdf F (t) must satisfy the following properties:

Lecture Note


1. The cdf F (t) is a positive function [that is, F (t) ≥ 0].

2. The cdf F (t) increases monotonically [that is, if s < t, then F (s) ≤ F (t)].

3. The cdf F (t) must tend to zero as t goes to negative infinity “−∞”, and must tend to one as t goes topositive infinity “+∞” [that is, lim

t→−∞F (t) = 0 and lim

t→+∞F (t) = 1].

4.2 Probability density function

It is often the case that the probability that the random variable X takes a value in a particular range is givenby the area under a curve over that range of values. This curve is called the probability density function (pdf) ofthe random variable X , denoted by f(x). Thus, the probability that “a ≤ X ≤ b” can be expressed by

P (a ≤ X ≤ b) =

∫ b

a

f(x)dx.

Furthermore, we can find the following relationship between cdf and pdf:

F (t) = P (X ≤ t) =

∫ t

−∞f(x)dx.

This implies that (i) the cdf F is a continuous function, and (ii) the pdf f(x), if f is continuous at x, can beobtained as the derivative

f(x) =dF (x)

dx.

Example 1. A random variable X is called a uniform random variable on [a, b], when X takes any real numberin the interval [a, b] equally likely. Then the pdf of X is given by

f(x) =

1/(b− a) if a ≤ x ≤ b;

0 otherwise [that is, if x < a or b < x].

Compute the cdf F (x) for the uniform random variable X .

4.3 Quantile function

Given a cdf F (x), we can define the quantile function Q(u) = infx : F (x) ≥ u for u ∈ (0, 1). In particular, if Fis continuous and strictly increasing, then we have Q(u) = F−1(u). When F (x) is the cdf of a random variableX , and F (x) is continuous, we can find

xp = Q(p) ⇔ P (X ≤ xp) = p

for every p ∈ (0, 1). The value x1/2 = Q( 12 ) is called the median of F , and the values x1/4 = Q( 1

4 ) and

x3/4 = Q( 34 ) are called the lower quartile and the upper quartile, respectively.

Example 2. Compute the quantile function Q(u) for a uniform random variable X on [a, b].

4.4 Joint distributions

When two discrete random variables X and Y are obtained at the same experiment, we can define their jointfrequency function by

p(ai, bj) = P (X = ai, Y = bj)

Lecture Note


where a1, a2, . . . and b1, b2, . . . are the possible values of X and Y , respectively. The marginal frequency functionof X , denoted by pX , can be calculated by

pX(ai) = P (X = ai) =∑

j

p(ai, bj),

where the sum is over the possible values b1, b2, . . . of Y . Similarly, the marginal frequency function pY (bj) =∑

i p(ai, bj) of Y is given by summing over the possible values a1, a2, . . . of X .

Joint distributions of two continuous random variables. Consider a pair (X, Y ) of continuous randomvariables. A joint density function f(x, y) is a nonnegative function satisfying

∫ ∞

−∞

∫ ∞

−∞f(x, y) dy dx = 1,

and is used to compute probabilities constructed from the random variables X and Y simultaneously. Forexample, we can calculate the probability

P (u1 ≤ X ≤ u2, v1 ≤ Y ≤ v2) =

∫ u2

u1

∫ v2

v1

f(x, y) dy dx.

In particular,

F (u, v) = P (X ≤ u, Y ≤ v) =

∫ u

−∞

∫ v

−∞f(x, y) dy dx.

is called the joint distribution function.

Given the joint density function f(x, y), the distribution for each of X and Y is called the marginal distribution,and the marginal density functions of X and Y , denoted by fX(x) and fY (y), are given respectively by

fX(x) =

∫ ∞

−∞f(x, y) dy and fY (y) =

∫ ∞

−∞f(x, y) dx.

Example 3. Consider the following joint density function of random variables X and Y :

f(x, y) =

λ2e−λy if 0 ≤ x ≤ y;

0 otherwise.(1)

(a) Find P (X ≤ 1, Y ≤ 1). (b) Find the marginal density functions fX(x) and fY (y).

4.5 Conditional density functions

Suppose that two random variables X and Y has a joint density function f(x, y). If fX(x) > 0, then we candefine the conditional density function fY |X(y|x) given X = x by

fY |X(y|x) =f(x, y)

fX(x).

Similarly we can define the conditional density function fX|Y (x|y) given Y = y by

fX|Y (x|y) =f(x, y)

fY (y)

if fY (y) > 0. Then, clearly we have the following relation

f(x, y) = fY |X(y|x)fX(x) = fX|Y (x|y)fY (y).

Example 4. Find the conditional density functions fX|Y (x|y) and fY |X(y|x) for the joint density function inExample 3 (see 4.4).

Lecture Note


4.6 Independent random variables

Let X and Y be discrete random variables with joint frequency function p(x, y). Then X and Y are said to beindependent, if it satisfies

p(x, y) = pX(x)pY (y)

for all possible values of (x, y).

In the same manner, if the joint density function f(x, y) for continuous random variables X and Y is expressedin the form

f(x, y) = fX(x)fY (y) for all x, y,

then X and Y are independent.

4.7 Exercises

1. Suppose that X is a discrete random variable with P (X = 0) = .25, P (X = 1) = .125, P (X = 2) = .125,and P (X = 3) = .5. Graph the frequency function and the cumulative distribution function of X .

2. An experiment consists of throwing a fair coin four times. (1) List the sample space Ω. (2) For eachdiscrete random variable X as defined in (a)–(d), describe the event “X = 0” as a subset of Ω and computeP (X = 0). (3) For each discrete random variable X as defined in (a)–(d), find the possible values of Xand compute the frequency function.

(a) X is the number of heads before the first tail.

(b) X is the number of heads following the first tail.

(c) X is the number of heads minus the number of tails.

(d) X is the number of tails times the number of heads.

3. Let F (x) = 1− exp(−αxβ) for x ≥ 0, α > 0 and β > 0, and F (x) = 0 for x < 0. Show that F is a cdf, andfind the corresponding density.

4. Sketch the pdf and cdf of a random variable X that is uniform on [−1, 1].

5. Suppose that X has the density function f(x) = cx2 for 0 ≤ x ≤ 1 and f(x) = 0 otherwise.

(a) Find c.

(b) Find the cdf.

(c) What is P (.1 ≤ X < .5)?

6. The joint frequency function of two discrete random variables, X and Y , is given by

X1 2 3 4

Y 1 .10 .05 .02 .022 .05 .20 .05 .023 .02 .05 .20 .044 .02 .02 .04 .10

(a) Find the marginal functions of X and Y .

(b) Are X and Y independent?

(c) Find a joint frequency function whose marginals X and Y are given by the same marginal functionsin (a) but are independent.

Lecture Note


7. Suppose that random variables X and Y have the joint density

f(x, y) =

67 (x + y)2 if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1;

0 otherwise.

(a) Find the marginal density functions of X and Y .

(b) Find the conditional density functions of X given Y = y and of Y given X = x.

Optional Problems.

1. Let f(x) = (1 + αx)/2 for −1 ≤ x ≤ 1 and f(x) = 0 otherwise, where −1 ≤ α ≤ 1. Show that f is adensity, and find the corresponding cdf. Find the quartiles and the median of the distribution in termsof α.

2. The Cauchy cumulative distribution function is

F (x) =1

2+

1

πtan−1(x), −∞ < x < ∞.

(a) Show that this is a cdf.

(b) Find the density function.

(c) Find x such that P (X > x) = .1.

5 Expectation

Let X be a discrete random variable whose possible values are a1, a2, . . ., and let p(x) is the frequency functionof X . Then the expectation (expected value or mean) of the random variable X is given by

E[X ] =∑

i

aip(ai).

We often denote the expected value E[X ] of X by µ or µX .

Expectation of discrete random variables. For a function g, we can define the expectation of function of adiscrete random variable X by

E[g(X)] =∑

i

g(ai)p(ai).

Suppose that we have two random variables X and Y , and that p(x, y) is their joint frequency function. Thenthe expectation of function g(X, Y ) of the two random variables X and Y is defined by

E[g(X, Y )] =∑

i,j

g(ai, bj)p(ai, bj),

where the sum is over all the possible values of (X, Y ).

Example 1. A random variable X takes on values 0, 1, 2 with the respective probabilities P (X = 0) = 0.2,P (X = 1) = 0.5, P (X = 2) = 0.3. Compute E[X ] and E[X2].

Lecture Note


5.1 Expectation of continuous random variables

Let X be a continuous random variable, and let f is the pdf of X . Then the expectation of the random variableX is given by

E[X ] =

∫ ∞

−∞x f(x) dx. (2)

For a function g, we define the expectation E[g(X)] of the function g(X) of the random variable X by

E[g(X)] =

∫ ∞

−∞g(x) f(x) dx,

if g(x) is integrable with respect to f(x) dx, that is,

∫ ∞

−∞|g(x)| f(x) dx < ∞. Furthermore, if we have two

random variables X and Y with joint density function f(x, y), then the expectation of the function g(X, Y ) ofthe two random variables becomes

E[g(X, Y )] =

∫ ∞

−∞

∫ ∞

−∞g(x, y)f(x, y) dxdy,

if g is integrable. For example, now we can define E[X2], E[XY ], E[X + Y ] and so on.

Example 2. Let X be a uniform random variable on [a, b]. Compute E[X ] and E[X2].

5.2 Properties of expectation

One can think of the expectation E(X) as “an operation on a random variable X” which returns the averagevalue for X . It is also useful to observe that this is a “linear operator” satisfying the following properties:

1. Let a be a constant, and let X be a random variable having the pdf f(x). Then we can show that

E[a + X ] =

∫ ∞

−∞(a + x)f(x) dx = a +

∫ ∞

−∞x f(x) dx = a + E[X ].

2. Let a and b be scalars, and let X and Y be random variables having the joint density function f(x, y) andthe respective marginal density functions fX(x) and fY (y). Then we can show that

E[aX + bY ] =

∫ ∞

−∞

∫ ∞

−∞(ax + by)f(x, y) dx dy

= a

∫ ∞

−∞x fX(x) dx + b

∫ ∞

−∞y fY (y) dy = aE[X ] + bE[Y ].

3. Let a and b1, . . . , bn be scalars, and let X1, . . . , Xn be random variables. By applying (a) and then apply-ing (b) repeatedly, we can obtain

E

[

a +n∑

i=1

biXi

]

= a + E

[n∑

i=1

biXi

]

= a + b1E[X1] + E

[n∑

i=2

biXi

]

= · · · = a +n∑

i=1

biE[Xi].

5.3 Expectation with independent random variables

Suppose that X and Y are independent random variables. Then the joint density function f(x, y) of X and Ycan be expressed as

f(x, y) = fX(x)fY (y).

Lecture Note


And the expectation of the function of the form g1(X) × g2(Y ) is given by

E[g1(X)g2(Y )] =

∫ ∞

−∞

∫ ∞

−∞g1(x)g2(y) f(x, y) dx dy

=

[∫ ∞

−∞g1(x)fX (x) dx

]

×[∫ ∞

−∞g2(y)fY (y) dy

]

= E[g1(X)] × E[g2(Y )].

5.4 Variance and covariance

The variance of a random variable X, denoted by Var(X) or σ2, is the expected value of “the squared differencebetween the random variable and its expected value E(X).” Thus, it is given by

Var(X) := E[(X − E(X))2] = E[X2] − (E[X ])2.

The square-root√

Var(X) of the variance Var(X) is called the standard error (SE) (or standard deviation (SD))of the random variable X .

Now suppose that we have two random variables X and Y . Then the covariance of two random variables X andY can be defined as

Cov(X, Y ) := E((X − µx)(Y − µy)) = E(XY ) − E(X) × E(Y ),

where µx = E(X) and µy = E(Y ). The covariance Cov(X, Y ) measures the strength of the dependence of thetwo random variables.

Example 3. Find Var(X) for a uniform random variable X on [a, b].

5.5 Properties of variance and covariance

(a) If X and Y are independent, then Cov(X, Y ) = 0 by observing that E[XY ] = E[X ] · E[Y ].

(b) In contrast to the expectation, the variance is not a linear operator. For two random variables X and Y , wehave

Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ). (3)

However, if X and Y are independent, by observing that Cov(X, Y ) = 0 in (3), we have

Var(X + Y ) = Var(X) + Var(Y ). (4)

In general, when we have a sequence of independent random variables X1, . . . , Xn, the property (4) is extendedto

Var(X1 + · · · + Xn) = Var(X1) + · · · + Var(Xn).

Variance and covariance under linear transformation. Let a and b be scalars (that is, real-valued con-stants), and let X be a random variable. Then the variance of aX + b is given by

Var(aX + b) = a2Var(X).

Now let a1, a2, b1 and b2 be scalars, and let X and Y be random variables. Then similarly the covariance ofa1X + b1 and a2Y + b2 can be given by

Cov(a1X + b1, a2Y + b2) = a1a2Cov(X, Y )

Lecture Note


5.6 Moment generating function

Let X be a random variable. Then the moment generating function (mgf ) of X is given by

M(t) = E[etX].

The mgf M(t) is a function of t defined on some open interval (c0, c1) with c0 < 0 < c1. In particular when Xis a continuous random variable having the pdf f(x), the mgf M(t) can be expressed as

M(t) =

∫ ∞

−∞etxf(x) dx.

5.7 Properties of moment generating function

(a) The most significant property of moment generating function is that “the moment generating functionuniquely determines the distribution.”

(b) Let a and b be constants, and let MX(t) be the mgf of a random variable X . Then the mgf of the randomvariable Y = a + bX can be given as follows.

MY (t) = E[etY]

= E[

et(a+bX)]

= eatE[

e(bt)X]

= eatMX(bt).

(c) Let X and Y be independent random variables having the respective mgf’s MX(t) and MY (t). Recall thatE[g1(X)g2(Y )] = E[g1(X)]·E[g2(Y )] for functions g1 and g2. We can obtain the mgf MZ(t) of the sum Z = X+Yof random variables as follows.

MZ(t) = E[etZ]

= E[

et(X+Y )]

= E[etXetY

]= E

[etX]· E[etY]

= MX(t) · MY (t).

(d) When t = 0, it clearly follows that M(0) = 1. Now by differentiating M(t) r times, we obtain

M (r)(t) =dr

dtrE[etX]

= E

[dr

dtretX

]

= E[XretX

].

In particular when t = 0, M (r)(0) generates the r-th moment of X as follows.

M (r)(0) = E[Xr], r = 1, 2, 3, . . .

Example 4. Find the mgf M(t) for a uniform random variable X on [a, b]. And derive the derivative M ′(0) att = 0 by using the definition of derivative and L’Hospital’s rule.

5.8 Exercises

1. If n men throw their hats into a pile and each man takes a hat at random, what is the expected numberof matches?Hint: Suppose Y is the number of matches. How can you express the random variable Y by using indicatorrandom variables X1, . . . , Xn? What is the event for the indicator random variable Xi?

2. Let X be a random variable with the pdf

f(x) =

2x if 0 ≤ x ≤ 1;

0 otherwise.

Then, (a) find E[X ], and (b) find E[X2] and Var(X).

Lecture Note


3. Suppose that E[X ] = µ and Var(X) = σ2. Let Z = (X − µ)/σ. Show that E[Z] = 0 and Var(Z) = 1.

4. Recall that the expectation is an “linear operator.” By using the definitions of variance and covariance interms of the expectation, show that

(a) Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ), and

(b) Var(X − Y ) = Var(X) + Var(Y ) − 2Cov(X, Y ).

5. If X and Y are independent random variables with equal variances, find Cov(X + Y, X − Y ).

6. If X and Y are independent random variables and Z = Y − X , find expressions for the covariance of Xand Z in terms of the variance of X and Y .

7. Let X and Y be independent random variables, and let α and β be scalars. Find an expression for the mgfof Z = αX + βY in terms of the mgf’s of X and Y .

Optional Problems.

1. Suppose that n enemy aircraft are shot at simultaneously by m gunners, that each gunner selects anaircraft to shoot at independently of the other gunners, and that each gunner hits the selected aircraft withprobability p. Find the expected number of aircraft hit by the gunners.Hint: For each i = 1, . . . , n, let Xi be the indicator random variable of the event that the i-th aircraft washit. Explain how you can verify P (Xi = 0) =

(1 − p

n

)mby yourself, and use it to find the expectation of

interest.

6 Binomial and Related Distributions

An experiment involving a trial that results in possibly two outcomes, say “success” or “failure,” and that isrepeatedly performed, is called a binomial experiment. Tossing a coin is an example of such a trial where heador tail are the two possible outcomes.

6.1 Relation between Bernoulli and binomial random variable

A binomial random variable can be expressed in terms of n Bernoulli random variables. If X1, X2, . . . , Xn areindependent Bernoulli random variables with success probability p, then the sum of those random variables

Y =

n∑

i=1

Xi

is distributed as the binomial distribution with parameter (n, p).

Sum of independent binomial random variables. If X and Y are independent binomial random variableswith respective parameters (n, p) and (m, p), then the sum X + Y is distributed as the binomial distributionwith parameter (n + m, p).

Rough explanation. Observe that we can express X =∑n

i=1 Zi and Y =∑n+m

i=n+1 Zi in terms of independent

Bernoulli random variables Zi’s with success probability p. Then the resulting sum X + Y =∑n+m

i=1 Zi must bea binomial random variable with parameter (n + m, p).

Lecture Note


6.2 Expectation and variance of binomial distribution

Let X be a Bernoulli random variable with success probability p. Then the expectation of X becomes

E[X ] = 0 × (1 − p) + 1 × p = p.

By using E[X ] = p, we can compute the variance as follows:

Var(X) = (0 − p)2 × (1 − p) + (1 − p)2 × p = p(1 − p).

Now let Y be a binomial random variable with parameter (n, p). Recall that Y can be expressed as the sum∑n

i=1 Xi of independent Bernoulli random variables X1, . . . , Xn with success probability p. By using the propertyof expectation and variance, we obtain

E[Y ] = E

[n∑

i=1

Xi

]

=

n∑

i=1

E[Xi] = np;

Var(Y ) = Var

(n∑

i=1

Xi

)

=

n∑

i=1

Var(Xi) = np(1 − p).

Moment generating functions. Let X be a Bernoulli random variable with success probability p. Then themgf MX(t) is given by

MX(t) = e(0)t × (1 − p) + e(1)t × p = (1 − p) + pet.

Then we can derive the mgf of a binomial random variable Y =∑n

i=1 Xi from the mgf of Bernoulli randomvariable by applying the mgf property repeatedly.

MY (t) = MX1(t) × MX2(t) × · · · × MXn(t) = [(1 − p) + pet]n.

6.3 Geometric distribution

For 0 < a < 1, we can establish the following formulas.

n∑

k=1

ak−1 =1 − an

1 − a;

∞∑

k=1

ak−1 =1

1 − a; (5)

∞∑

k=1

kak−1 =d

dx

( ∞∑

k=0

xk

)∣∣∣∣∣x=a

=d

dx

(1

1 − x

)∣∣∣∣x=a

=1

(1 − a)2; (6)

∞∑

k=1

k2ak−1 =d

dx

( ∞∑

k=1

kxk

)∣∣∣∣∣x=a

=d

dx

(x

(1 − x)2

)∣∣∣∣x=a

=1

(1 − a)2+

2a

(1 − a)3

Geometric random variables. In sampling independent Bernoulli trials, each with probability of success p,the frequency function of the trial number of the first success is

p(k) = (1 − p)k−1p k = 1, 2, . . . .

This is called a geometric distribution.

By using geometric series formula (5) we can immediately see the following:

∞∑

k=1

p(k) = p∞∑

k=1

(1 − p)k−1 =p

1 − (1 − p)= 1;

M(t) =

∞∑

k=1

ektp(k) = petn∑

k=1

[(1 − p)et]k−1 =pet

1 − (1 − p)et.

Lecture Note


Example 3. Consider a roulette wheel consisting of 38 numbers—1 through 36, 0, and double 0. If Mr.Smithalways bets that the outcome will be one of the numbers 1 through 12, what is the probability that (a) his firstwin will occur on his fourth bet; (b) he will lose his first 5 bets?

Expectation and variance of geometric distributions. Let X be a geometric random variable with successprobability p. To calculate the expectation E[X ] and the variance Var(X), we apply geometric series formula(6) to obtain

E[X ] =∞∑

k=1

kp(k) = p∞∑

k=1

k(1 − p)k−1 =1

p;

Var(X) = E[X2] − (E[X ])2 =

∞∑

k=1

k2p(k) −(

1

p

)2

= p

(1

p2+

2(1 − p)

p3

)

− 1

p2=

1 − p

p2.

6.4 Negative binomial distribution

Given independent Bernoulli trials with probability of success p, the frequency function of the number of trialsuntil the r-th success is

p(k) =

(k − 1

r − 1

)

pr(1 − p)k−r , k = r, r + 1, . . . .

This is called a negative binomial distribution with parameter (r, p).

The geometric distribution is a special case of negative binomial distribution when r = 1. Moreover, if X1, . . . , Xr

are independent and identically distributed (iid) geometric random variables with parameter p, then the sum

Y =

r∑

i=1

Xi (7)

becomes a negative binomial random variable with parameter (r, p).

Expectation, variance and mgf of negative binomial distribution. By using the sum of iid geometricrv’s (see Equation 7) we can compute the expectation, the variance, and the mgf of negative binomial randomvariable Y .

E[Y ] = E

[r∑

i=1

Xi

]

=

r∑

i=1

E[Xi] =r

p;

Var(Y ) = Var

(r∑

i=1

Xi

)

=

r∑

i=1

Var(Xi) =r(1 − p)

p2;

MY (t) = MX1(t) × MX2(t) × · · · × MXn(t) =

[pet

1 − (1 − p)et

]r

.

Example 4. What is the average number of times one must throw a die until the outcome “1” has occurred 4times?

6.5 Exercises

1. Appending three extra bits to a 4-bit word in a particular way (a Hamming code) allows detection andcorrection of up to one error in any of the bits. (a) If each bit has probability .05 of being changed duringcommunication, and the bits are changed independently of each other, what is the probability that theword is correctly received (that is, 0 or 1 bit is in error)? (b) How does this probability compare to theprobability that the word will be transmitted correctly with no check bits, in which case all four bits wouldhave to be transmitted correctly for the word to be correct?

Lecture Note


2. Which is more likely: 9 heads in 10 tosses of a fair coin or 18 heads in 20 tosses?

3. Two boys play basketball in the following way. They take turns shooting and stop when a basket is made.Player A goes first and has probability p1 of making a basket on any throw. Player B goes, who shootssecond, has probability p2 of making a basket. The outcomes of the successive trials are assumed to beindependent.

(a) Find the frequency function for the total number of attempts.

(b) What is the probability that player A wins?

4. Suppose that in a sequence of independent Bernoulli trials each with probability of success p, the numberof failure up to the first success is counted. (a) What is the frequency function for this random variable?(b) Find the frequency function for the number of failures up to the rth success.

Hint: In (b) let X be the number of failures up to the rth success, and let Y be the number of trials untilthe r-th success. What is the relation between the random variables X and Y ?

5. Find an expression for the cumulative distribution function of a geometric random variable.

6. In a sequence of independent trials with probability p of success, what is the probability that there areexactly r successes before the kth failure?

Optional Problems.

1. (Banach Match Problem) A pipe smoker carries one box of matches in his left pocket and one in his right.Initially, each box contains n matches. If he needs a match, the smoker is equally likely to choose eitherpocket. What is the frequency function for the number of matches in the other box when he first discoversthat one box is empty?

Hint: Let Ak denote the event that the pipe smoker first discovers that the right-hand box is empty andthat there are k matches in the left-hand box at the time, and similarly let Bk denote the event that hefirst discovers that the left-hand box is empty and that there are k matches in the right-hand box at thetime. Then use Ak’s and Bk’s to express the frequency function of interest.

7 Poisson Distribution

The number e = limn→∞

(

1 +1

n

)n

= 2.7182 · · · is known as the base of the natural logarithm (or, called Napier’s

number), and is associated with the Taylor series

ex =∞∑

k=0

xk

k!. (8)

The Poisson distribution with parameter λ > 0 has the frequency function

p(k) = e−λ λk

k!, k = 0, 1, 2, . . .

By applying Taylor series (8) we can immediately see the following:

∞∑

k=0

p(k) = e−λ∞∑

k=0

λk

k!= e−λeλ = 1;

M(t) =

∞∑

k=0

ektp(k) = e−λ∞∑

k=0

(etλ)k

k!= e−λeλet

= eλ(et−1).

Lecture Note


Example 1. Suppose that the number of typographical errors per page has a Poisson distribution with parameterλ = 0.5. Find the probability that there is at least one error on a single page.

7.1 Poisson approximation to binomial distribution

The Poisson distribution may be used as an approximation for a binomial distribution with parameter (n, p)when n is large and p is small enough so that np is a moderate number λ. Let λ = np. Then the binomialfrequency function becomes

p(k) =n!

k!(n − k)!

(λ

n

)k (

1 − λ

n

)n−k

=λk

k!· n!

(n − k)!nk·(

1 − λ

n

)n

·(

1 − λ

n

)−k

→ λk

k!· 1 · e−λ · 1 = e−λ λk

k!as n → ∞.

Example 2. Suppose that the a microchip is defective with probability 0.02. Find the probability that a sampleof 100 microchips will contain at most 1 defective microchip.

7.2 Expectation and variance

Let Xn, n = 1, 2, . . ., be a sequence of binomial random variables with parameter (n, λ/n). Then we regard thelimit of the sequence as a Poisson random variable Y , and use the limit to find E[Y ] and Var(Y ). Thus, weobtain

E[Xn] = λ → E[Y ] = λ as n → ∞;

Var(Xn) = λ

(

1 − λ

n

)

→ Var(Y ) = λ as n → ∞.

Sum of independent Poisson random variables. The sum of independent Poisson random variables is aPoisson random variable: If X and Y are independent Poisson random variables with respective parameters λ1

and λ2, then Z = X + Y is distributed as the Poisson distribution with parameter λ1 + λ2.

Rough explanation. Suppose that Xn and Yn are independent binomial random variables with respective param-eter (kn, pn) and (ln, pn) for each n = 1, 2, . . .. Then the sum Xn + Yn of the random variables has the binomialdistribution with parameter (kn + ln, pn). By letting knpn → λ1 and lnpn → λ2 with kn, ln → ∞ and pn → 0,the respective limits of Xn and Yn are Poisson random variables X and Y with respective parameters λ1 andλ2. Moreover, the limit of Xn + Yn is the sum of the random variables X and Y , and has a Poisson distributionwith parameter λ1 + λ2.

7.3 Exercises

1. Use the mgf to show that the expectation and the variance of a Poisson random variable are the same andequal to the parameter λ.

2. The probability of being dealt a royal straight flush (ace, king, queen, jack, and ten of the same suit) inpoker is about 1.54× 10−6. Suppose that an avid poker player sees 100 hands a week, 52 weeks a year, for20 years.

(a) What is the probability that she never sees a royal straight flush dealt?

(b) What is the probability that she sees at least two royal straight flushes dealt?

Lecture Note


3. Professor Rice was told that he has only 1 chance in 10,000 of being trapped in a much-maligned elevatorin the mathematics building. If he goes to work 5 days a week, 52 weeks a year, for 10 years and alwaysrides the elevator up to his office when he first arrives. What is the probability that he will never betrapped? That he will be trapped once? Twice? Assume that the outcomes on all the days are mutuallyindependent.

4. Suppose that a rare disease has an incidence of 1 in 1000. Assuming that members of the population areaffected independently, find the probability of k cases in a population of 100,000 for k = 0, 1, 2.

5. Suppose that in a city the number of suicides can be approximated by a Poisson random variable withλ = 0.33 per month.

(a) What is the distribution for the number of suicides per year? What is the average number of suicidesin one year?

(b) What is the probability of two suicides in one month?

Optional Problems.

1. Let X be a negative binomial random variable with parameter (r, p).

(a) When r = 2 and p = 0.2, find the probability P (X ≤ 4).

(b) Let Y be a binomial random variable with parameter (n, p). When n = 4 and p = 0.2, find theprobability P (Y ≥ 2).

(c) By comparing the results above, what can you find about the relationship between P (X ≤ 4) andP (Y ≥ 2)? Generalize the relationship to the one between P (X ≤ n) and P (Y ≥ 2), and find theprobability P (X ≤ 10) when r = 2 and p = 0.2.

(d) By using the Poisson approximation, find the probability P (X ≤ 1000) when r = 2 and p = 0.001.

8 Solutions to Exercises

8.1 Binomial and related distributions

1. The number X of error bits has a binomial distribution with n = 4 and p = 0.95.

(a) P (X ≤ 1) =

(4

0

)

(0.95)4 +

(4

1

)

(0.05)(0.95)3 ≈ 0.986.

(b) P (X = 0) =

(4

0

)

(0.95)4 ≈ 0.815.

2. “9 heads in 10 tosses” [(109

)(1/2)10 ≈ 9.77×10−3] is more likely than “18 heads in 20 tosses” [

(2018

)(1/2)20 ≈

1.81× 10−4].

3. Let X be the total number of attempts.

(a) P (X = n) =

(1 − p1)k−1(1 − p2)

k−1p1 if n = 2k − 1;

(1 − p1)k(1 − p2)

k−1p2 if n = 2k.

(b) P (Player A wins) =

∞∑

k=1

P (X = 2k − 1) = p1

∞∑

k=1

[(1 − p1)(1 − p2)]k−1 =

p1

p1 + p2 − p1p2

Lecture Note


4. Let X be the number of failures up to the rth success, and let Y be the number of trials until the r-thsuccess. Then Y has a negative binomial distribution, and has the relation X = Y − r.

(a) Here r = 1 (that is, Y has a geometric distribution), and P (X = k) = P (Y = k + 1) = (1 − p)kp.

(b) P (X = k) = P (Y = k + r) =

(k + r − 1

r − 1

)

pr(1 − p)k.

5. Let X be a geometric random variable.

P (X ≤ k) =

k∑

i=1

p(1 − p)i−1 = 1 − (1 − p)k

6. Let X be the number of trials up to the k-th failure. Then X has a negative binomial distribution with“success probability” (1 − p).

P (Exactly r successes before the k-th failure) = P (X = r + k) =

(k + r − 1

k − 1

)

(1 − p)kpr

Optional Problem (Banach Match Problem). Let X be the number of matches in the other box when he firstdiscovered that one box is empty. Furthermore, let Ak denote the event that the pipe smoker first discovers thatthe right-hand box is empty and that there are k matches in the left-hand box at the time, and similarly let Bk

denote the event that he first discovers that the left-hand box is empty and that there are k matches in the right-hand box at the time. Then we can observe that P (Ak) = P (Bk) and P (X = k) = P (Ak) + P (Bk) = 2P (Ak).Now in order for Ak to happen he has to make exactly (2n+1−k) attempts when he choose the right-hand box in

(n+1)st time. Thus, P (Ak) =

(2n − k

n

)

(1/2)2n+1−k, and therefore, we obtain P (X = k) =

(2n − k

n

)

(1/2)2n−k.

8.2 Poisson distribution

1. We first calculate M ′(t) = λeteλ(et−1) and M ′′(t) = (λet)2eλ(et−1) + λeteλ(et−1). Then we obtain E[X ] =M ′(0) = λ and E[X2] = M ′′(0) = λ2 + λ. Finally we get Var(X) = E[X2] − (E[X ])2 = λ.

2. The number X of royal straight flushes has approximately a Poisson distribution with λ = (1.54× 10−6 ×(100× 52× 20) ≈ 0.16.

(a) P (X = 0) = e−λ ≈ 0.852.

(b) P (X ≥ 2) = 1− P (X ≤ 1) = 1 − (e−λ + λe−λ) ≈ 0.0115.

3. The number X of misfortunes for Professor Rice being trapped has approximately a Poisson distributionwith λ = 1

10000 × (5 × 52 × 10) ≈ 0.26. Then P (X = 0) = e−λ ≈ 0.77, P (X = 1) = λe−λ ≈ 0.20, andP (X = 2) = (λ2/2)e−λ ≈ 0.026.

4. The number X of this disease cases has approximately a Poisson distribution with λ = 11000 ×100000 = 100.

Then it has the frequency function p(k) ≈ e−100 100k

k!. And we obtain p(0) ≈ 3 × 10−44, p(1) ≈ 4 × 10−42,

and p(2) ≈ 2× 10−40.

5. Let Xk be the number of suicides in the i-th month, and let Y be the number of suicides in a year. ThenY = X1 + · · · + X12 has a Poisson distribution with 12 × 0.33 = 3.96. Thus, (a) E[Y ] = 3.96 and (b)P (X1 = 2) = (0.332/2)e−0.33 ≈ 0.039.

Optional Problem.

(a) P (X ≤ 4) =∑4

k=2

(k−11

)(0.2)2(0.8)k−2 = 0.1808.

Lecture Note


(b) P (Y ≥ 2) = 1 −∑1k=0

(4k

)(0.2)k(0.8)4−k = 0.1808.

(c) By using P (X ≤ n) = P (Y ≥ 2), P (X ≤ 10) = P (Y ≥ 2) = 1 −∑1k=0

(10k

)(0.2)k(0.8)10−k ≈ 0.6242.

(d) by using P (X ≤ n) = P (Y ≥ 2) and the Poisson approximation with λ = np = 1, P (X ≤ 1000) = P (Y ≥2) ≈ 1 − e−λ − λe−λ ≈ 0.2642.

9 Summary Diagram

Bernoulli trial

P (X = 1) = p and P (X = 0) = q := 1 − p

E[X] = p, V ar(X) = pq

The probability p of success andthe probability q := 1 − p of failure

Geometric(p)

P (X = i) = qi−1p, i = 1, 2, . . .

E[X] =1

p, V ar(X) =

q

p2

Negative binomial(r, p)

P (X = i) =

i − 1

r − 1

!

prq

i−r, i = r, r + 1, . . .

E[X] =r

p, V ar(X) =

rq

p2

Binomial(n, p)

P (X = i) =

n

i

!

piq

n−i, i = 0, . . . , n

E[X] = np, V ar(X) = npq

Poisson(λ)

P (X = i) = e−λ λi

i!, i = 0, 1, . . .

E[X] = λ, V ar(X) = λ

?

6

?

?

6

? ?

(1)

(2) (3)

(4)

(5)

(6)

(7)

1. X is the number of Bernoulli trials until the first success occurs.

2. X is the number of Bernoulli trials until the rth success occurs. If X1, · · · , Xr are independent geometricrandom variables with parameter p, then X =

∑rk=1 Xk becomes a negative binomial random variable

with parameter (r, p).

3. r = 1.

4. X is the number of successes in n Bernoulli trials. If X1, · · · , Xn are independent Bernoulli trials, thenX =

∑nk=1 Xk becomes a binomial random variable with (n, p).

Lecture Note


5. If X1 and X2 are independent binomial random variables with respective parameters (n1, p) and (n2, p),then X = X1 + X2 is also a binomial random variable with (n1 + n2, p).

6. Poisson approximation is used for binomial distribution by letting n → ∞ and p → 0 while λ = np.

7. If X1 and X2 are independent Poisson random variables with respective parameters λ1 and λ2, thenX = X1 + X2 is also a Poisson random variable with λ1 + λ2.

10 Exponential and Gamma Distributions

10.1 Exponential distribution

The exponential density is defined by

f(x) =

λe−λx x ≥ 0;

0 x < 0

Then, the cdf is computed as

F (t) =

1 − e−λt t ≥ 0;

0 t < 0

Memoryless property. The function S(t) = 1 − F (t) = P (X > t) is called the survival function. When F (t)is an exponential distribution, we have S(t) = e−λt for t ≥ 0. Furthermore, we can find that

P (X > t + s | X > s) =P (X > t + s)

P (X > s)=

S(t + s)

S(s)= S(t) = P (X > t), (9)

which is referred as the memoryless property of exponential distribution.

10.2 Gamma distribution

The gamma function, denoted by Γ(α), is defined as

Γ(α) =

∫ ∞

0

uα−1e−u du, α > 0.

Then the gamma density is defined as

f(x) =λα

Γ(α)xα−1e−λx x ≥ 0

which depends on two parameters α > 0 and λ > 0. We call the parameter α a shape parameter, because changingα changes the shape of the density. We call the parameter λ a scale parameter, because changing λ merely rescalesthe density without changing its shape. This is equivalent to changing the units of measurement (feet to meters,or seconds to minutes). Suppose that the parameter α is an integer n. Then we have Γ(n) = (n − 1)!. Inparticular, the gamma distribution with α = 1 becomes an exponential distribution (with parameter λ).

Chi-square distribution. The gamma distribution with α =n

2and λ = 1

2 is called the chi-square distribution

with n degrees of freedom.

Lecture Note


Expectation, variance and mgf of a gamma random variable Let X be a gamma random variable withparameter (α, λ). For t < λ, the mgf MX(t) of X can be computed as follows.

MX(t) =

∫ ∞

0

etx λα

Γ(α)xα−1e−λx dx =

λα

Γ(α)

∫ ∞

0

xα−1e−(λ−t)x dx

=λα

(λ − t)α

1

Γ(α)

∫ ∞

0

((λ − t)x)α−1e−(λ−t)x (λ − t)dx =

(λ

λ − t

)α1

Γ(α)

∫ ∞

0

uα−1e−u du =

(λ

λ − t

)α

Then we can find the first and second moment

E[X ] = M ′X(0) =

α

λand E[X2] = M ′′

X(0) =α(α + 1)

λ2

Thus, we can obtain Var(X) =α(α + 1)

λ2− α2

λ2=

α

λ2.

10.3 Exercises

1. Let X be a gamma random variable with parameter (α, λ), and let Y = cX with c > 0.

(a) Show that

P (Y ≤ t) =

∫ t

−∞f(x) dx,

where

f(x) =(λ/c)α

Γ(α)xα−1e−(λ/c)x

(b) Conclude that Y is a gamma random variable with parameter (α, λ/c). Thus, only λ is affected bythe transformation Y = cX , which justifies to call λ a scale parameter.

2. Suppose that the lifetime of an electronic component follows an exponential distribution with λ = 0.2.

(a) Find the probability that the lifetime is less than 10.

(b) Find the probability that the lifetime is between 5 and 15.

3. X is an exponential random variable, and P (X < 1) = 0.05. What is λ?

4. The lifetime in months of each component in a microchip is exponentially distributed with parameterµ = 1

10000 .

(a) For a particular component, we want to find the probability that the component breaks down be-fore 5 months, and the probability that the component breaks down before 50 months. Use theapproximation e−x ≈ 1 − x for a small number x to calculate these probabilities.

(b) Suppose that the microchip contains 1000 of such components and is operational until three of thesecomponents break down. The lifetimes of these components are independent and exponentially dis-tributed with parameter µ = 1

10000 . By using the Poisson approximation, find the probability thatthe microchip operates functionally for 5 months, and the probability that the microchip operatesfunctionally for 50 months.

5. The length of a phone call can be represented by an exponential distribution. Then answer to the followingquestions.

(a) A report says that 50% of phone call ends in 20 minutes. Assuming that this report is true, calculatethe parameter λ for the exponential distribution.

Lecture Note


(b) Suppose that someone arrived immediately ahead of you at a public telephone booth. Using the resultof (a), find the probability that you have to wait between 10 and 20 minutes.

Optional Problems.

1. Suppose that a non-negative random variable X has the memoryless property

P (X > t + s | X > s) = P (X > t).

By completing the following questions, we will show that X must be an exponential random variable.

(a) Let S(t) is the survival function of X . Show that the memoryless property implies that S(s + t) =S(s)S(t) for s, t ≥ 0.

(b) Let κ = S(1). Show that κ > 0.

(c) Show that S(

1n

)= κ

1n and S

(mn

)= κ

mn .

Therefore, we can obtain S(t) = κt for any t ≥ 0. By letting λ = − lnκ, we can write S(t) = e−λt.

2. The gamma function is a generalized factorial function.

(a) Show that Γ(1) = 1.

(b) Show that Γ(α + 1) = αΓ(α). Hint: Use integration by parts.

(c) Conclude that Γ(n) = (n − 1)! for n = 1, 2, 3, . . ..

(d) Use the formula Γ( 12 ) =

√π to show that if n is an odd integer then we have

Γ(n

2

)

=

√π(n − 1)!

2n−1(

n−12

)!

11 Normal Distributions

A normal distribution is represented by a family of distributions which have the same general shape, sometimesdescribed as “bell shaped.” The normal distribution has the pdf

f(x) :=1

σ√

2πexp

[

− (x − µ)2

2σ2

]

, (10)

which depends upon two parameters µ and σ2. Here π = 3.14159 . . . is the famous “pi” (the ratio of thecircumference of a circle to its diameter), and exp(u) is the exponential function eu with the base e = 2.71828 . . .of the natural logarithm.

Much celebrated integrations are the following:

1

σ√

2π

∫ ∞

−∞e−

(x−µ)2

2σ2 dx = 1, (11)

1

σ√

2π

∫ ∞

−∞xe−

(x−µ)2

2σ2 dx = µ,

1

σ√

2π

∫ ∞

−∞(x − µ)2e−

(x−µ)2

2σ2 dx = σ2.

The integration (11) guarantees that the density function (10) always represents a “probability density” nomatter what values the parameters µ and σ2 are chosen. We say that a random variable X is normally distributedwith parameter (µ, σ2) when X has the density function (10). The parameter µ, called a mean (or, location

Lecture Note


parameter), provides the center of the density, and the density function f(x) is symmetric around µ. Theparameter σ is a standard deviation (or, a scale parameter); small values of σ lead to high peaks but sharp drops.Larger values of σ lead to flatter densities. The shorthand notation

X ∼ N(µ, σ2)

is often used to express that X is a normal random variable with parameter (µ, σ2).

11.1 Linear transform of normal random variable

One of the important properties of normal distribution is that if X is a normal random variable with parameter(µ, σ2) [that is, the pdf of X is given by density function (10)]. then Y = aX + b is also a normal randomvariable having parameter (aµ + b, (aσ)2). In particular,

X − µ

σ(12)

becomes a normal random variable with parameter (0, 1), called the z-score. Then the normal density φ(x) withparameter (0, 1) becomes

φ(x) :=1√2π

e−x2

2 ,

which is known as the standard normal density. We define the cdf Φ by

Φ(t) :=

∫ t

−∞φ(x) dx.


Let X be a standard normal random variable. Then we can compute the mgf of X as follows.

MX(t) =1√2π

∫ ∞

−∞etx e−

x2

2 dx =1√2π

∫ ∞

−∞e−

12 (x−t)2+ t2

2 dx = et2

21√2π

∫ ∞

−∞e−

(x−t)2

2 dx = exp

(t2

2

)

Since X = (Y − µ)/σ becomes a standard normal random variable and Y = σX + µ, the mgf M(t) of Y can begiven by

M(t) = eµtMX(σt) = exp

(σ2t2

2+ µt

)

.

Then we can find the first and second moment

E[X ] = M ′(0) = µ and E[X2] = M ′′(0) = µ2 + σ2

Thus, we can obtain Var(X) = (µ2 + σ2) − µ2 = σ2.

11.3 Calculating probability

Suppose that a tomato plant height X is normally distributed with parameter (µ, σ2). Then what is the proba-bility that the tomato plant height is between a and b?

How to calculate. The integration

P (a ≤ X ≤ b) =

∫ b

a

1

σ√

2πe−

(x−µ)2

2σ2 dx

Lecture Note


seems too complicated. But if we consider the random variableX − µ

σ, then the event a ≤ X ≤ b is equivalent

to the event a − µ

σ≤ X − µ

σ≤ b − µ

σ

.

In terms of probability, this means that

P (a ≤ X ≤ b) = P

(a − µ

σ≤ X − µ

σ≤ b − µ

σ

)

. (13)

And it is easy to calculate

a′ =a − µ

σand b′ =

b − µ

σ.

Since the z-score (12) is distributed as the standard normal distribution, the above probability becomes

P (a ≤ X ≤ b) = P

(

a′ ≤ X − µ

σ≤ b′

)

=

∫ b′

a′

φ(x) dx = Φ(b′) − Φ(a′). (14)

Then we will be able to look at the values for Φ(a′) and Φ(b′) from the table of standard normal distribution.For example, if a′ = −0.19 and b′ = 0.29, then Φ(−0.19) = 0.4247 and Φ(0.29) = 0.6141, and the probability ofinterest becomes 0.1894, or approximately 0.19.

Suppose that a random variable X is normally distributed with mean µ and standard deviation σ. Then we canobtain

1. P (a ≤ X ≤ b) = Φ

(b − µ

σ

)

− Φ

(a − µ

σ

)

;

2. P (X ≤ b) = Φ

(b − µ

σ

)

;

3. P (a ≤ X) = 1− Φ

(a − µ

σ

)

.

11.4 Exercises

1. Let X be a normal random variable with parameter (µ, σ2), and let Y = aX + b with a 6= 0.

(a) Show that

P (Y ≤ t) =

∫ t

−∞f(x) dx,

where

f(x) =1

|a|σ√

2πexp

[

− (x − aµ − b)2

2(aσ)2

]

.

(b) Conclude that Y is a normal random variable with parameter (aµ + b, (aσ)2).

2. Let X be a normal random variable with parameter (µ, σ2), and let Y = eX . Find the density function fof Y by forming

P (Y ≤ t) =

∫ t

0

f(x) dx.

This density function f is called the lognormal density since ln Y is normally distributed.

3. Show that the standard normal density integrates to unity via (a) and (b).

Lecture Note


(a) Show that

∫ ∞

−∞

∫ ∞

−∞e−

x2+y2

2 dx dy =

∫ ∞

0

∫ 2π

0

e−r2

2 r dθ dr = 2π.

(b) Show that1

σ√

2π

∫ ∞

−∞e−

(x−µ)2

2σ2 dx =1√2π

∫ ∞

−∞e−

y2

2 dy = 1.

4. Suppose that in a certain population, individual’s heights are approximately normally distributed withparameters µ = 70 and σ = 3in.

(a) What proportion of the population is over 6ft. tall?

(b) What is the distribution of heights if they are expressed in centimeters?

5. Let X be a normal random variable with µ = 5 and σ = 10. Find (a) P (X > 10), (b) P (−20 < X < 15),and (c) the value of x such that P (X > x) = 0.05.

6. If X ∼ N(µ, σ2), then show that P (|X − µ| ≤ 0.675σ) = 0.5.

7. X ∼ N(µ, σ2), find the value of c in terms of σ such that P (µ − c ≤ X ≤ µ + c) = 0.95.

12 Solutions to exercises

12.1 Exponential and gamma distributions

1. (a) P (Y ≤ t) = P (cX ≤ t) = P (X ≤ t/c) =

∫ t/c

−∞

λα


∫ t

−∞

(λ/c)α

Γ(α)yα−1e−(λ/c)y dy by

substituting x = y/c. (b) The integrand f(y) is a gamma density with (α, λ/c).

2. Let X be the lifetime of the electronic component.

(a) P (X ≤ 10) = 1 − e−(0.2)(10) ≈ 0.8647.

(b) P (5 ≤ X ≤ 15) = P (X ≤ 15) − P (X ≤ 5) = e−(0.2)(5) − e−(0.2)(15) ≈ 0.3181.

3. By solving P (X < 1) = 1 − e−λ = 0.05, we obtain λ = − ln 0.95 ≈ 0.0513.

4. (a) Let p1 be the probability that the component breaks down before 5 months, and let p2 be the probabilitythat the component breaks down before 50 months. Then,

p1 = 1 − e−5µ ≈ 1 − (1 − 0.0005) = 0.0005;

p2 = 1 − e−50µ ≈ 1 − (1 − 0.005) = 0.005.

(b) Let X1 be the number of components breaking down before 5 months, and let X2 be the numberof components breaking down before 50 months. Then Xi has the binomial distribution with parameter(pi, 1000) for i = 1, 2. By using the Poisson approximation with λ1 = 1000p1 = 0.5 and λ2 = 1000p2 = 5,we can obtain

P (X1 ≤ 2) = e−λ1

(

1 + λ1 +λ1

2

2

)

≈ 0.986;

P (X2 ≤ 2) = e−λ2

(

1 + λ2 +λ2

2

2

)

≈ 0.125.

5. Let X be the length of phone call in minutes.

(a) By solving P (X ≤ 20) = 1 − e−20λ = 0.5, we obtain λ = ln 0.520 ≈ 0.0347.

(b) P (10 ≤ X ≤ 20) = (1 − e−20λ) − (1 − e−10λ) ≈ 0.207.

Lecture Note


Optional Problems.

1. (a) Since P (X > t + s | X > s) =P (X > t + s)

P (X > s)=

S(t + s)

S(s), we obtain S(t) = S(t + s)/S(s)

(b) Let κ = S(1) = S(1/2 + 1/2) = [S(1/2)]2 > 0.

(c) Since κ = S(1) = S(1/n + · · · + 1/n) = [S(1/n)]n, we have S(1/n) = κ1/n. Furthermore, S(m/n) =S(1/n + · · · + 1/n) = [S(1/n)]m = κm/n.

2. (a) Γ(1) =

∫ ∞

0

eu du = 1.

(b) Γ(α + 1) =

∫ ∞

0

uαe−u du =[−uαe−u

]∞0

+ α

∫ ∞

0

uα−1e−u du = αΓ(α).

(c) Γ(n) = (n − 1)Γ(n − 1) = · · · = (n − 1)!

(d) Γ(n/2) = (n/2 − 1)(n/2− 2) · · · (1/2)Γ(1/2) = (n − 2)(n − 4) · · · (1)Γ(1/2)/2(n−1)/2 =

√π(n − 1)!

2n−1(

n−12

)!

12.2 Normal distributions

1. (a) Assume a > 0. Then

P (Y ≤ t) = P (aX + b ≤ t) = P (X ≤ (t − b)/a) =

∫ (t−b)/a

−∞

1

σ√

2πexp

[

− (x − µ)2

2σ2

]

dx

=

∫ t

−∞

1

aσ√

2πexp

[

− (y − aµ − b)2

2(aσ)2

]

dy

by substituting x = (y − b)/a. When a < 0, we can similarly obtain

P (Y ≤ t) =

∫ t

−∞

1

|a|σ√

2πexp

[

− (y − aµ − b)2

2(aσ)2

]

dy

(b) The integrand f(y) is a normal density with (aµ + b, (aσ)2).

2. For x > 0 we obtain

P (Y ≤ t) = P (eX ≤ t) = P (X ≤ ln t) =

∫ ln t

−∞

1

σ√

2πexp

[

− (x − µ)2

2σ2

]

dx

=

∫ t

−∞

1

yσ√

2πexp

[

− (ln y − µ)2

2σ2

]

dy

by substituting x = ln y. Thus, the density is given by f(y) = 1yσ

√2π

exp[

− (ln y−µ)2

2σ2

]

.

3. (a)

∫ ∞

0

∫ 2π

0

e−r2

2 r dθ dr = 2π

∫ ∞

0

r e−r2

2 dr = 2π

∫ ∞

0

e−u du = 2π.

(b) Substitute x = σy + µ.

4. Let X be the height in inches normally distributed with µ = 70in and σ = 3in.

(a) P (X ≥ (6)(12)) ≈ 1 − Φ(0.67) ≈ 0.25.

(b) (2.54)X is the height in centimeters, and normally distributed with µ = (2.54)(70) = 177.8cm andσ = (2.54)(3) = 7.62cm.

Lecture Note


5. (a) P (X > 10) = 1 − Φ(0.5) ≈ 0.3085. (b) P (−20 < X < 15) = Φ(1) − Φ(−2.5) ≈ 0.8351.

(c) Observe that P (X > x) = 1−Φ((x−5)/10) = 0.05. Since Φ(1.64) ≈ 0.95, we can find (x−5)/10 ≈ 1.64,and therefore, x ≈ 21.4.

6. P (|X − µ| ≤ 0.675σ) = P (−0.675σ ≤ X − µ ≤ 0.675σ) = P (−0.675 ≤ (X − µ)/σ ≤ 0.675) = Φ(0.675) −Φ(−0.675) ≈ 0.5.

7. Observing that P (µ − c ≤ X ≤ µ + c) = P (|(X − µ)/σ| ≤ c/σ) = 0.95, we obtain P ((X − µ)/σ ≤ c/σ) =Φ(c/σ) = 0.975. Since Φ(1.96) ≈ 0.975, we can find c/σ ≈ 1.96, and therefore, c = 1.96σ.

13 Summary Diagram

Normal(µ, σ2)

f(x) =1√2πσ

e−(x−µ)2

2σ2 , −∞ < x < ∞

E[X ] = µ, V ar(X) = σ2

Exponential(λ)

f(x) = λe−λx x ≥ 0

E[X ] =1

λ, V ar(X) =

1

λ2

Gamma(α, λ)

f(x) =λα

Γ(α)xα−1e−λx

†, x ≥ 0

E[X ] =α

λ, V ar(X) =

α

λ2

Chi-square (with m degrees of freedom)

f(x) =12e−

x2 (x

2 )m2 −1

Γ(m2 )

, x ≥ 0

E[X ] = m, V ar(X) = 2m † Γ(n) = (n − 1)!, when n is

a positive integer.

6

?

?

6

6

?

?

(1)

(2)

(3)

(4)

(5) (6)

(7)

1. If X1 and X2 are independent normal random variables with respective parameters (µ1, σ21) and (µ2, σ

22),

then X = X1 + X2 is a normal random variable with (µ1 + µ2, σ21 + σ2

2).

2. If X1, · · · , Xm are iid normal random variables with parameter (µ, σ2), then X =

m∑

k=1

(Xk − µ

σ

)2

is a

chi-square (χ2) random variable with m degrees of freedom.

3. α = m2 and λ = 1

2 .

4. If X1, . . . , Xn are independent exponential random variables with respective parameters λ1, · · · , λn, thenX = minX1, . . . , Xn is an exponential random variable with parameter

∑ni=1 λi.

Lecture Note


5. If X1, · · · , Xn are iid exponential random variables with parameter λ, then X =

n∑

k=1

Xk is a gamma random

variable with (α = n, λ).

6. α = 1.

7. If X1 and X2 are independent random variables having gamma distributions with respective parameters(α1, λ) and (α2, λ), then X = X1 + X2 is a gamma random variable with parameter (α1 + α2, λ).

14 Bivariate Normal Distribution

14.1 Review of matrix algebra.

Matrices are usually denoted by capital letters, A, B, C, . . ., while scalars (real numbers) are denoted by lowercase letters, a, b, c, . . ..

1. Scalar multiplication: k

[a bc d

]

=

[ka kbkc kd

]

2. Product of two matrices:

[a bc d

] [e fg h

]

=

[ae + bg af + bhce + dg cf + dh

]

.

3. Determinant of a matrix: det

[a bc d

]

= ad − bc.

4. The inverse of a matrix: If A =

[a bc d

]

and det A 6= 0, then the inverse A−1 of A is given by

A−1 =1

det A

[d −b−c a

]

=1

ad − bc

[d −b−c a

]

,

and satisfies AA−1 = A−1A =

[1 00 1

]

.

5. The transpose of a matrix:

[a bc d

]T

=

[a cb d

]

.

14.2 Quadratic forms

If a 2-by-2 matrix A satisfies AT = A, then it is of the form

[a bb c

]

, and is said to be symmetric. A 2-by-1 matrix

x =

[xy

]

is called a column vector, usually denoted by boldface letters x, y, . . ., and the transpose xT =

[x y

]

is called the row vector. Then, the bivariate function

Q(x, y) = xT Ax =

[x y

][a bb c

][xy

]

= ax2 + 2bxy + cy2 (15)

is called a quadratic form.

Lecture Note


If ac − b2 6= 0 and ac 6= 0, then the symmetric matrix can be decomposed into the following forms known asCholesky decompositions :

[a bb c

]

=

[ √a 0

b/√

a√

(ac − b2)/a

] [√a b/

√a

0√

(ac − b2)/a

]

(16)

=

[√

(ac − b2)/c b/√

c0

√c

][√

(ac − b2)/c 0b/√

c√

c

]

(17)

Corresponding to the Cholesky decompositions of the matrix A, the quadratic form Q(x, y) can be also decom-posed as follows:

Q(x, y) =

(√a x +

b√a

y

)2

+

(√

ac − b2

ay

)2

(18)

=

(√

ac − b2

cx

)2

+

(b√c

x +√

c y

)2

(19)

14.3 Bivariate normal density

The function

f(x, y) =1

2πσxσy

√

1 − ρ2exp

(

−1

2Q(x, y)

)

(20)

with the quadratic form

Q(x, y) =1

1 − ρ2

[(x − µx

σx

)2

+

(y − µy

σy

)2

− 2ρ(x − µx)(y − µy)

σxσy

]

(21)

gives the joint density function of a bivariate normal distribution. Note that the parameters σ2x, σ2

y , and ρ mustsatisfy σ2

x > 0, σ2y > 0, and −1 < ρ < 1. By defining the 2-by-2 symmetric matrix (also known as covariance

matrix ) and the two column vectors

Σ =

[σ2

x ρσxσy

ρσxσy σ2y

]

, x =

[xy

]

and µ =

[µx

µy

]

,

the quadratic form can be expressed as

Q(x, y) = (x − µ)T Σ−1(x − µ) =[x − µx y − µy

][

σ2x ρσxσy

ρσxσy σ2y

]−1 [x − µx

y − µy

]

. (22)

Suppose that two random variables X and Y has the bivariate normal distribution (20). Then we can find outthe following properties.

1. Their marginal distributions fX(x) and fY (x) become

fX(x) =1√

2πσx

exp

−1

2

(x − µx

σx

)2

and fY (y) =1√

2πσy

exp

−1

2

(y − µy

σy

)2

.

Thus, X and Y are normally distributed with respective parameters (µx, σ2x) and (µy, σ2

y).

2. If ρ = 0, then the quadratic form (21) becomes

Q(x, y) =

(x − µx

σx

)2

+

(y − µy

σy

)2

,

and consequently we have f(x, y) = fX(x)fY (y). Thus, X and Y are independent if and only if ρ = 0.

Lecture Note


3. If X and Y are not independent (that is, ρ 6= 0), we can compute the conditional density function fY |X(y|x)given X = x as

fY |X(y|x) =1√

2πσy

√

1 − ρ2exp

−1

2

(

y − µy − ρσy

σx(x − µx)

σy

√

1 − ρ2

)2

which is the normal density function with parameter (µy+ρσy

σx(x−µx), σ2

y(1−ρ2)). Similarly, the conditionaldensity function fX|Y (y|x) given Y = y becomes

fX|Y (x|y) =1√

2πσx

√

1 − ρ2exp

−1

2

(x − µx − ρσx

σy(y − µy)

σx

√

1 − ρ2

)2

which is the normal density function with parameter (µx + ρσx

σy(y − µy), σ2

x(1 − ρ2)).

14.4 Covariance of bivariate normal random variables

Let (X, Y ) be a bivariate normal random variables with parameters (µx, µy, σ2x, σ2

y , ρ). Recall that f(x, y) =f(x)f(y|x), and that f(y|x) is the normal density with mean µy + ρ

σy

σx(x − µx) and variance σ2

y(1 − ρ2). Firstwe can compute

∫ ∞

−∞(y − µy)f(y|x) dy = ρ

σy

σx(x − µx).

Then we can apply it to obtain

Cov(X, Y ) =

∫ ∞

−∞

∫ ∞

−∞(x − µx)(y − µy)f(x, y) dy dx

=

∫ ∞

−∞(x − µx)

[∫ ∞

−∞(y − µy)f(y|x) dy

]

f(x) dx

= ρσy

σx

∫ ∞

−∞(x − µx)2f(x) dx = ρ

σy

σxσ2

x = ρσxσy

The correlation coefficient of X and Y is defined by

ρ =Cov(X, Y )

√

Var(X)Var(Y ).

It implies that the parameter ρ of bivariate normal distribution represents the correlation coefficient of X andY .

14.5 Exercises

1. Verify the Cholesky decompositions by multiplying the matrices in each of (16) and (17).

2. Show that the right-hand side of (18) equals (15) by calculating

Q(x, y) =[x y

][ √

a 0

b/√

a√

(ac − b2)/a

][√a b/

√a

0√

(ac − b2)/a

] [xy

]

.

Similarly show that the right-hand side of (19) equals (15).

3. Show that (21) is equal to (22).

Lecture Note


4. Show that (22) has the following decomposition

Q(x, y) =

(x − µx − ρσx

σy(y − µy)

σx

√

1 − ρ2

)2

+

(y − µy

σy

)2

(E1)

=

(x − µx

σx

)2

+

(

y − µy − ρσy

σx(x − µx)

σy

√

1 − ρ2

)2

(E2)

by completing the following:

(a) Find the Cholesky decompositions of Σ−1.

(b) Verify (E1) and (E2) by finding the Cholesky decompositions of Σ−1.

5. Compute fX(x) by completing the following:

(a) Show that∫∞−∞ f(x, y) dy becomes

1√2πσx

exp

−1

2

(x − µx

σx

)2

× 1√2πσy

√

1 − ρ2

∫ ∞

−∞exp

−1

2

(

y − µy − ρσy

σx(x − µx)

σy

√

1 − ρ2

)2

dy.

(b) Show that

1√2πσy

√

1 − ρ2

∫ ∞

−∞exp

−1

2

(

y − µy − ρσy

σx(x − µx)

σy

√

1 − ρ2

)2

dy =

1√2π

∫ ∞

−∞exp

−1

2y2

dy = 1.

Now find fY |X(y|x) given X = x.

15 Conditional Expectation

Let f(x, y) be a joint density function for random variables X and Y , and let fY |X(y|x) be the conditionaldensity function of Y given X = x (see Lecture Note No. ?? for definition). Then the conditional expectation ofY given X = x, denoted by E(Y |X = x), is defined as the function h(x) of x:

E(Y |X = x) := h(x) =

∫ ∞

−∞y fY |X(y|x) dy.

In particular, h(X) is the function of the random variable X , and is called the conditional expectation of Y givenX , denoted by E(Y |X). Similarly, we can define the conditional expectation for a function g(Y ) of the randomvariable Y by

E(g(Y )|X = x) :=

∫ ∞

−∞g(y) fY |X(y|x) dy.

Since E(g(Y )|X = x) is a function, say h(x), of x, we can define E(g(Y )|X) as the function h(X) of the randomvariable X .

15.1 Properties of conditional expectation

(a) Linearity. Let a and b be constants. By the definition of conditional expectation, it clearly follows thatE(aY + b|X = x) = aE(Y |X = x) + b. Consequently,

E(aY + b|X) = aE(Y |X) + b.

Lecture Note


(b) Law of total expectation. Since E(g(Y )|X) is a function h(X) of random variable of X , we can consider “theexpectation E[E(g(Y )|X)] of the conditional expectation E(g(Y )|X),” and compute it as follows.

E[E(g(Y )|X)] =

∫ ∞

−∞h(x)fX (x) dx =

∫ ∞

−∞

[∫ ∞

−∞g(y) fY |X(y|x) dy

]

fX(x) dx

=

∫ ∞

−∞

[∫ ∞

−∞fY |X(y|x)fX(x) dx

]

g(y) dy =

∫ ∞

−∞g(y)fY (y) dy = E[g(Y )].

Thus, we have E[E(g(Y )|X)] = E[g(Y )].

(c) Conditional variance formula. The conditional variance, denoted by Var(Y |X = x), can be defined as

Var(Y |X = x) := E((Y − E(Y |X = x))2|X = x) = E(Y 2|X = x) − (E(Y |X = x))2,

where the second equality can be obtained from the linearity property in (a). Since Var(Y |X = x) is a func-tion, say h(x), of x, we can define Var(Y |X) as the function h(X) of the random variable X . Now compute“the variance Var(E(Y |X)) of the conditional expectation E(Y |X)” and “the expectation E[Var(Y |X)] of theconditional variance Var(Y |X)” as follows.

Var(E(Y |X)) = E[(E(Y |X))2] − (E[E(Y |X)])2 = E[(E(Y |X))2] − (E[Y ])2 (1)

E[Var(Y |X)] = E[E(Y 2|X)] − E[(E(Y |X))2] = E[Y 2] − E[(E(Y |X))2] (2)

By adding (1) and (2) together, we can obtain

Var(E(Y |X)) + E[Var(Y |X)] = E[Y 2] − (E[Y ])2 = Var(Y ).

(d) In calculating the conditional expectation E(g1(X)g2(Y )|X = x) given X = x, we can treat g1(X) as theconstant g1(x); thus, we have E(g1(X)g2(Y )|X = x) = g1(x)E(g2(Y )|X = x). This justifies

E(g1(X)g2(Y )|X) = g1(X)E(g2(Y )|X).

15.2 Prediction

Suppose that we have two random variables X and Y . Having observed X = x, one may attempt to “predict”the value of Y as a function g(x) of x. The function g(X) of the random variable X can be constructed forthis purpose, and is called a predictor of Y . The criterion for the “best” predictor g(X) is to find a functiong to minimize E[(Y − g(X))2]. We can compute E[(Y − g(X))2] by applying the properties (a)–(d) in Lecturesummary No.21.

E[(Y − g(X))2] = E[(Y − E(Y |X) + E(Y |X) − g(X))2

]

= E[(Y − E(Y |X))2 + 2(Y − E(Y |X))(E(Y |X) − g(X)) + (E(Y |X) − g(X))2

]

= E[E((Y − E(Y |X))2|X)] + E[

2(E(Y |X) − g(X))E((Y − E(Y |X))|X)]

+ E[(E(Y |X) − g(X))2]

= E[Var(Y |X)] + E[(E(Y |X) − g(X))2],

which is minimized when g(x) = E(Y |X = x). Thus, we can find g(X) = E(Y |X) as the best predictor of Y .

The best predictor for bivariate normal distributions. When X and Y has the bivariate normal distri-bution with parameter (µx, µy, σ2

x, σ2y , ρ), we can find the best predictor

E(Y |X) = µy + ρσy

σx(X − µx).

Lecture Note


15.3 The best linear predictor

Except for the case of bivariate normal distribution, the conditional expectation E(Y |X) could be a complicatednonlinear function of X . So we can instead consider the best linear predictor

g(X) = α + βX

of Y by minimizing E[(Y − (α + βX))2]. Since

E[(Y − (α + βX))2] = E[Y 2] − 2αµy − 2βE[XY ] + α2 + 2αβµx + β2E[X2]

where µx = E[X ] and µy = E[Y ], we can obtain the desired values α and β by solving

∂

∂αE[(Y − (α + βX))2] = −2µy + 2α + 2βµx = 0

∂

∂βE[(Y − (α + βX))2] = −2E[XY ] + 2αµx + 2βE[X2] = 0

The solution can be expressed by using σ2x = Var(X), σ2

y = Var(Y ), and ρ = Cov(X, Y )/(σxσy), and obtain

β =E[XY ] − E[X ]E[Y ]

E[X2] − (E[X ])2= ρ

σy

σx

α = µy − βµx = µy − ρσy

σxµx

In summary, the best linear predictor g(X) becomes

g(X) = µy + ρσy

σx(X − µx).

15.4 Exercises

1. Suppose that the joint density function f(x, y) of X and Y is given by

f(x, y) =

2 if 0 ≤ x ≤ y ≤ 1;

0 otherwise.

(a) Find the covariance and the correlation of X and Y .

(b) Find the conditional expectations E(X |Y = y) and E(Y |X = x).

(c) Find the best linear predictor of Y in terms of X .

(d) Find the best predictor of Y .

2. Suppose that we have the random variables X and Y having the joint density

f(x, y) =

67 (x + y)2 if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1;

0 otherwise.

(a) Find the covariance and correlation of X and Y .

(b) Find the conditional expectation E(Y |X = x).

(c) Find the best linear predictor of Y .

3. Let X be the age of the pregnant woman, and let Y be the age of the prospective father at a certain hospital.Suppose that the distribution of X and Y can be approximated by a bivariate normal distribution withparameters µx = 28, µy = 32, σx = 6, σy = 8 and ρ = 0.8.

(a) Find the probability that the prospective father is over 30.

(b) When the pregnant woman is 31 years old, what is the best guess of the prospective father’s age?

(c) When the pregnant woman is 31 years old, find the probability that the prospective father is over 30.

Lecture Note


16 Transformations of Variables

16.1 Transformations of a random variable

For an interval A = [a, b] on real line with a < b, the integration of a function f(x) over A is formally written as

∫

A

f(x) dx =

∫ b

a

f(x) dx. (23)

Let h(u) be differentiable and strictly monotonic function (that is, either strictly increasing or strictly decreasing)on real line. Then there is the inverse function g = h−1, so that we can define the inverse image h−1(A) ofA = [a, b] by

h−1(A) = g(A) = g(x) : a ≤ x ≤ b.The change of variable in the integral (23) provides the formula

∫

A

f(x) dx =

∫

g(A)

f(h(u))

∣∣∣∣

d

duh(u)

∣∣∣∣du. (24)

Theorem A. Let X be a random variable with pdf fX(x). Suppose that a function g(x) is differentiable andstrictly monotonic. Then the random variable Y = g(X) has the pdf

fY (y) = fX(g−1(y))

∣∣∣∣

d

dyg−1(y)

∣∣∣∣. (25)

A sketch of proof. Since g(x) is strictly monotonic, we can find the inverse function h = g−1, and apply theformula (24). Thus, by letting A = (−∞, t], we can rewrite the cdf for Y as

FY (t) = P (Y ≤ t) = P (Y ∈ A) = P (g(X) ∈ A) = P (X ∈ h(A))

=

∫

h(A)

fX(x) dx =

∫

g(h(A))

fX(h(y))

∣∣∣∣

d

dyh(y)

∣∣∣∣dy =

∫

A

fX(g−1(y))

∣∣∣∣

d

dyg−1(y)

∣∣∣∣

dy

By comparing it with the form

∫ t

−∞fY (y) dy, we can obtain (25).

Linear transform of normal random variable. Let X be a normal random variable with parameter (µ, σ2).Then we can define the linear transform Y = aX + b with real values a and b. To compute the density functionfY (y) of the random variable Y , we can apply Theorem A, and obtain

fY (y) =1

(aσ)√

2πexp

[

− (y − (aµ + b))2

2(aσ)2

]

.

This is the normal density function with parameter (aµ + b, (aσ)2).

Example. Let X be a uniform random variable on [0, 1]. Suppose that X takes nonnegative values [andtherefore, f(x) = 0 for x ≤ 0]. Find the pdf fY (y) for the random variable Y = Xn.

16.2 Transformations of bivariate random variables

For a rectangle A = (x, y) : a1 ≤ x ≤ b1, a2 ≤ y ≤ b2 on a plane, the integration of a function f(x, y) over Ais formally written as

∫ ∫

A

f(x, y) dx dy =

∫ b2

a2

∫ b1

a1

f(x, y) dx dy (26)

Lecture Note


Suppose that a transformation (g1(x, y), g2(x, y)) is differentiable and has the inverse transformation (h1(u, v), h2(u, v))satisfying

x = h1(u, v)

y = h2(u, v)if and only if

u = g1(x, y)

v = g2(x, y). (27)

Then we can state the change of variable in the double integral (26) as follows: (i) define the inverse image of Aby

h−1(A) = g(A) = (g1(x, y), g2(x, y)) : (x, y) ∈ A,(ii) calculate the Jacobian

Jh(u, v) = det

[∂

∂uh1(u, v) ∂∂v h1(u, v)

∂∂uh2(u, v) ∂

∂v h2(u, v)

]

, (28)

(iii) finally, if it does not vanish for all x, y, then we can obtain∫

A

f(x, y) dx dy =

∫

g(A)

f(h1(u, v), h2(u, v)) |Jh(u, v)| du dv. (29)

Theorem B. Suppose that (a) random variables X and Y has a joint density function fX,Y (x, y), (b) a trans-formation (g1(x, y), g2(x, y)) is differentiable, and (c) its inverse transformation (h1(u, v), h2(u, v)) satisfies (27).Then the random variables U = g1(X, Y ) and V = g2(X, Y ) has the joint density function

fU,V (u, v) = fX,Y (h1(u, v), h2(u, v)) |Jh(u, v)| . (30)

A sketch of proof. Define A = (x, y) : x ≤ s, y ≤ t and the image of A by the inverse transformation(h1(x, y), h2(x, y)) by

h(A) = (h1(u, v), h2(u, v)) : (u, v) ∈ A.Then we can rewrite the joint distribution function FU,V for (U, V ) by applying the formula (29).

FU,V (s, t) = P ((U, V ) ∈ A) = P ((g1(X, Y ), g2(X, Y )) ∈ A) = P ((X, Y ) ∈ h(A))

=

∫

h(A)

fX,Y (x, y) dx dy =

∫

g(h(A))

fX,Y (h1(u, v), h2(u, v)) |Jh(u, v)| du dv

=

∫

A

fX,Y (h1(u, v), h2(u, v)) |Jh(u, v)| du dv

By comparing it with the form

∫ t

−∞

∫ s

−∞fU,V (u, v) du dv, we can obtain (30).

16.3 Exercises

1. Suppose that (X1, X2) has a bivariate normal distribution with mean vector µ =

[µx

µy

]

and covariance

matrix Σ =

[σ2

x ρσxσy

ρσxσy σ2y

]

. Consider the linear transformation

Y1 = g1(X1, X2) = a1X1 + b1;

Y2 = g2(X1, X2) = a2X2 + b2.

(a) Find the inverse transformation (h1(u, v), h2(u, v)), and compute the Jacobian Jh(u, v).

(b) Find the joint density function fY1,Y2(u, v) for (Y1, Y2).

(c) Conclude that the joint density function fY1,Y2(u, v) is the bivariate normal density with mean vector

µ =

[a1µx + b1

a2µy + b2

]

and covariance matrix Σ =

[(a1σx)2 ρ(a1σx)(a2σy)

ρ(a1σx)(a2σy) (a2σy)2

]

.

Lecture Note


2. Let X1 and X2 be iid standard normal random variables. Consider the linear transformation

Y1 = g1(X1, X2) = a11X1 + a12X2 + b1;

Y2 = g2(X1, X2) = a21X1 + a22X2 + b2,

where a11a22 − a12a21 6= 0. Find the joint density function fY1,Y2(u, v) for (Y1, Y2), and conclude that it

is the bivariate normal density with mean vector µ =

[b1

b2

]

and covariance matrix Σ =

[σ2

x ρσxσy

ρσxσy σ2y

]

where σ2x = a2

11 + a212, σ2

y = a221 + a2

22, and ρ =a11a21 + a12a22

√

(a211 + a2

12)(a221 + a2

22).

3. Suppose that X and Y are independent random variables with respective pdf’s fX(x) and fY (y). Considerthe transformation

U = g1(X, Y ) = X + Y ;

V = g2(X, Y ) =X

X + Y.

(a) Find the inverse transformation (h1(u, v), h2(u, v)), and compute the Jacobian Jh(u, v).

(b) Find the joint density function fXY (x, y) for (X, Y ).

(c) Now let fX(x) and fY (y) be the gamma density functions with respective parameters (a1, λ) and(a2, λ). Then show that U = X + Y has the gamma distribution with parameter (α1 + α2, λ) and

that V =X

X + Yhas the density function

fV (v) =Γ(α1 + α2)

Γ(α1)Γ(α2)vα1−1(1 − v)α2−1. (31)

The density function (31) is called the beta distribution with parameter (α1, α2).

4. Let X and Y be random variables. The joint density function of X and Y is given by

f(x, y) =

xe−(x+y) if 0 ≤ x and 0 ≤ y;

0 otherwise.

(a) Are X and Y independent? Describe the marginal distributions for X and Y with specific parameters.

(b) Find the distribution for the random variable U = X + Y .

5. Consider an investment for three years in which you plan to invest $500 in average at the beginning ofevery year. The fund yields the fixed annual interest of 10%; thus, the fund invested at the beginning ofthe first year yields 33.1% at the end of term (that is, at the end of the third year), the fund of the secondyear yields 21%, and the fund of the third year yields 10%.

(a) How much do you expect to have in your investment account at the end of term?

(b) From your past spending behavior, you assume that the annual investment you decide is independentevery year, and that its standard deviation is $200. (That is, the standard deviation for the amountof money you invest each year is $200.) Find the standard deviation for the total amount of moneyyou have in the account at the end of term.

(c) Furthermore, suppose that your annual investment is normally distributed. (That is, the amount ofmoney you invest each year is normally distributed.) Then what is the probability that you have morethan $2000 in the account at the end of term?

Lecture Note


6. Let X be an exponential random variable with parameter λ. And define

Y = Xn

where n is a positive integer.

(a) Compute the pdf fY (y) for Y .

(b) Find c so that

f(x) =

cxne−λx if x ≥ 0;

0 otherwise.

is a pdf. Hint: You should know the name of the pdf f(x).

(c) By using (b), find the expectation E[Y ].

7. Let X and Y be independent normal random variables with mean zero and the respective variances σ2x and

σ2y. Then consider the change of variables via

U = a11X + a12Y

V = a21X + a22Y(32)

where D = a11a22 − a12a21 6= 0.

(a) Find the variances σ2u and σ2

v for U and V , respectively.

(b) Find the inverse transformation (h1(u, v), h2(u, v)) so that X and Y can be expressed in terms of Uand V via

X = h1(U, V );

Y = h2(U, V ).

And compute the Jacobian Jh(u, v).

(c) Find ρ so that the joint density function fU,V (u, v) of U and V can be expressed as

fU,V (u, v) =1

2πσuσv

√

1 − ρ2exp

(

−1

2Q(u, v)

)

.

(d) Express Q(u, v) by using σu, σv , and ρ to conclude that the joint density is the bivariate normaldensity with parameters µu = µv = 0, σu, σv , and ρ.

(e) Let X ′ = X + µx and Y ′ = Y + µy. Then consider the change of variables via

U ′ = a11X′ + a12Y

′

V ′ = a21X′ + a22Y

′

Express U ′ and V ′ in terms of U and V , and describe the joint distribution for U ′ and V ′ with specificparameters.

(f) Now suppose that X and Y are independent normal random variables with the respective parameters(µx, σ2

x) and (µy, σ2y). Then what is the joint distribution for U and V given by (32)?

8. To complete this problem, you may use the following theorem:

Theorem C. Let X and Y be independent normal random variables with the respective param-eters (µx, σ2

x) and (µy, σ2y). Then if a11a22 − a12a21 6= 0, then the random variables

U = a11X + a12Y

V = a21X + a22Y

Lecture Note


has the bivariate normal distribution with parameter µu = a11µx + a12µy, µv = a21µx + a22µy,σ2

u = a211σ

2x + a2

12σ2y , σ2

v = a221σ

2x + a2

22σ2y, and

ρ =a11a21σ

2x + a12a22σ

2y

√

(a211σ

2x + a2

12σ2y)(a2

21σ2x + a2

22σ2y)

.

Now suppose that a real valued data X is transmitted though a noisy channel from location A to location B,and that the value U received at location B is given by

U = X + Y

where Y is a normal random variable with mean zero and variance σ2y. Furthermore, we assume that X is

normally distributed with mean µ and variance σ2x, and is independent of Y .

(a) Find the expectation E[U ] and the variance Var(U) of the received value U at location B.

(b) Describe the joint distribution for X and U with specific parameters.

(c) Provided that U = t is observed at location B, find the best predictor g(t) of X .

(d) Suppose that we have µ = 1, σx = 1 and σy = 0.5. When U = 1.3 was observed, find the bestpredictor of X given U = 1.3.

17 Solutions to exercises

17.1 Conditional expectation

1. (a) Cov(X, Y ) = E[XY ] − E[X ]E[Y ] = (1/4)− (1/3)(2/3) = 1/36, and ρ = (1/36)/√

(1/18)2 = 1/2.

(b) Since f(x|y) = 1y for 0 ≤ x ≤ y and f(y|x) = 1

1−x for x ≤ y ≤ 1, we obtain E(X |Y = y) = y2 and

E(Y |X = x) = x+12 .

(c) The best linear predictor g(X) of Y is given by g(X) = µy + ρσy

σx(X − µx) = 1

2X + 12 .

(d) Then best predictor of Y is E(Y |X) = X+12 . [This is the same as (c). Why?]

2. (a) Cov(X, Y ) = E[XY ] − E[X ]E[Y ] = 17/42− (9/14)2 = −5/588 and ρ = (−5/588)(199/2940) = −25/199.

(b) E(Y |X = x) = 6x2+8x+34(3x2+3x+1) .

(c) The best linear predictor g(X) of Y is given by g(X) = − 25119X + 144

199 .

3. (a) P (Y ≥ 30) = P(

Y −328 ≥ − 1

4

)= 1 − Φ(−0.25) ≈ 0.60.

(b) E(Y | X = 31) = (32) + (0.8)(

86

)(31 − 28) = 35.2

(c) Since the conditional density fY |X(y | x) has the normal density with mean µ = 35.2 and varianceσ2 = (8)2(1 − (0.8)2) = 23.04, we can obtain

P (Y ≥ 30 | X = 31) ≈ P

(Y − 35.2

4.8≥ −1.08

∣∣∣∣

X = 31

)

= 1 − Φ(−1.08) ≈ 0.86

Lecture Note


17.2 Transformations of variables

1. (a) h1(u, v) =u − b1

a1and h2(u, v) =

v − b2

a2. Thus, Jh(u, v) = det

[ 1a1

0

0 1a2

]

=1

a1a2.

(b) fY1,Y2(u, v) = fX1,X2

(u − b1

a1,v − b2

a2

)

·∣∣∣∣

1

a1a2

∣∣∣∣=

1

2π(|a1|σx)(|a2|σy)√

1 − ρ2exp

(

−1

2Q(u, v)

)

, where

Q(u, v) =1

1− ρ2

[(u − (a1µx + b1)

a1σx

)2

+

(v − (a2µy + b2)

a2σy

)2

− 2ρ(u − (a1µx + b1))(v − (a2µy + b2))

(a1σx)(a2σy)

]

.

(c) From (b), fY1,Y2(u, v) is a bivariate normal distribution with parameter µu = a1µx +b1, µv = a2µy +b2,σu = |a1|σx, σv = |a2|σy , and ρ if a1a2 > 0 (or, −ρ if a1a2 < 0).

2. (a) h1(u, v) =a22(u − b1) − a12(v − b2)

a11a22 − a12a21and h2(u, v) =

−a21(u − b1) + a11(v − b2)

a11a22 − a12a21. Thus, Jh(u, v) =

1

a11a22 − a12a21.

(b) Note that fX1,X2(x, y) =1

2πexp

(

−1

2(x2 + y2)

)

. Then we obtain

fY1,Y2(u, v) = fX1,X2(h1(u, v), h2(u, v)) ·∣∣∣∣

1

a11a22 − a12a21

∣∣∣∣=

1

2π|a11a22 − a12a21|exp

(

−1

2Q(u, v)

)

,

where

Q(u, v) =[(u − b1) (v − b2)

][

a211 + a2

12 a11a21 + a12a22

a11a21 + a12a22 a221 + a2

22

]−1 [u − b1

v − b2

]

.

(c) By comparing fY1,Y2(u, v) with a bivariate normal density, we can find that fY1,Y2(u, v) is a bi-variate normal distribution with parameter µu = b1, µv = b2, σ2

u = a211 + a2

12, σ2v = a2

21 + a222, and

ρ =a11a21 + a12a22

√

(a211 + a2

12)(a221 + a2

22).

3. (a) h1(u, v) = uv and h2(u, v) = u − uv. Thus, Jh(u, v) = det

[v u

1 − v −u

]

= −u.

(b) fXY (x, y) = fX(x)fY (y).

(c) We can obtain

fUV (u, v) = | − u| × fX(uv) × fY (u − uv)

= u × 1

Γ(α1)λα1e−λ(uv)(uv)α1−1 × 1

Γ(α2)λα2e−λ(u−uv)(u − uv)α2−1

= fU (u)fV (v)

where fU (u) =1

Γ(α1 + α2)λα1+α2e−λuuα1+α2−1 and fV (v) =

Γ(α1 + α2)

Γ(α1)Γ(α2)vα1−1(1 − v)α2−1. Thus, fU (u)

has a gamma distribution with parameter (α1 + α2, λ), and fV (v) has a beta distribution with parameter(α1, α2).

4. (a) Observe that f(x, y) = (xe−x) · (e−y) and that xe−x, x ≥ 0, is the gamma density with parameter (2, 1)and e−y, y ≥ 0, is the exponential density with λ = 1. Thus, we can obtain

fX(x) =

∫ ∞

0

f(x, y) dy = xe−x

∫ ∞

0

e−y dy = xe−x, x ≥ 0;

fY (y) =

∫ ∞

0

f(x, y) dx = e−y

∫ ∞

0

xe−x dx = e−y, y ≥ 0.

Lecture Note


Since f(x, y) = fX(x)fY (y), X and Y are independent.

(b) Notice that the exponential density fY (y) with λ = 1 can be seen as the gamma distribution with(1, 1). Since the sum of independent gamma random variables has again a gamma distribution, U has thegamma distribution with (3, 1).

5. Let X1 be the amount of money you invest in the first year, let X2 be the amount in the second year, andlet X2 be the amount in the third year. Then the total amount of money you have at the end of term, sayY , becomes

Y = 1.331X1 + 1.21X2 + 1.1X3.

(a) E(1.331X1 + 1.21X2 + 1.1X3) = 1.331E(X1) + 1.21E(X2) + 1.1E(X3) = 3.641× 500 = 1820.5.

(b) Var(1.331X1 +1.21X2 +1.1X3) = (1.331)2Var(X1)+(1.21)2Var(X2)+(1.1)2Var(X3) = 4.446×2002 ≈177800. Thus, the standard deviation is

√

Var(Y ) ≈√

177800 ≈ 421.7.

(c) P (Y ≥ 2000) = P

(Y − 1820.5

421.7≥ 2000− 1820.5

421.7

)

≈ 1 − Φ(0.43) ≈ 0.334.

6. (a) By using the change-of-variable formula with g(x) = xn, we can obtain

fY (y) = fX(g−1(y))

∣∣∣∣

d

dyg−1(y)

∣∣∣∣= fX(y

1n ) · 1

ny−(n−1

n ) =λ

ny−(n−1

n ) exp(

−λy1n

)

, y ≥ 0

(b) Since the pdf f(x) is of the form of gamma density with (n+1, λ), we must have f(x) =λn+1

Γ(n + 1)xne−λx,

x ≥ 0. Therefore, c =λn+1

Γ(n + 1)(= λn+1

n! , since n is an integer).

(c) Since f(x) in (b) is a pdf, we have∫∞0 f(x) dx =

∫∞0

λn+1

Γ(n+1) xne−λx dx = 1. Thus, we can obtain

E[Y ] = E[Xn] =

∫ ∞

0

xn(λe−λx

)dx =

Γ(n + 1)

λn(=

n!

λn, since n is an integer).

7. (a) σ2u = Var(U) = a2

11σ2x + a2

12σ2y and σ2

v = Var(V ) = a221σ

2x + a2

22σ2y .

(b) h1(u, v) = 1D (a22u − a12v) and h2(u, v) = 1

D (−a21u + a11v). And we have Jh(u, v) = 1D .

(c) ρ =a11a21σ

2x + a12a22σ

2y

√

(a211σ

2x + a2

12σ2y)(a2

21σ2x + a2

22σ2y)

.

(d) Q(u, v) can be expressed as

Q(u, v) =1

1 − ρ2

[(u

σu

)2

+

(v

σv

)2

− 2ρuv

σuσv

]

Thus, the joint density function fU,V (u, v) is the bivariate normal density with parameters µu = µv = 0together with σu, σv , ρ as calculated in (a) and (c).

(e) U ′ = U +a11µx +a12µy and V ′ = V +a21µx +a22µy. Then U ′ and V ′ has the bivariate normal densitywith parameters µu = a11µx + a12µy, µv = a21µx + a22µy, σu, σv , and ρ.

(f) By (d) and (e), (U, V ) has a bivariate normal distribution with parameter µu = a11µx + a12µy, µv =a21µx + a22µy, σ2

u = a211σ

2x + a2

12σ2y, σ2

v = a221σ

2x + a2

22σ2y, and

ρ =a11a21σ

2x + a12a22σ

2y

√

(a211σ

2x + a2

12σ2y)(a2

21σ2x + a2

22σ2y)

.

Lecture Note


8. (a) E[U ] = E[X ] + E[Y ] = µ and Var(U) = Var(X) + Var(Y ) = σ2x + σ2

y.

(b) By using Theorem C, we can find that X and U have the bivariate normal distribution with means

µx = µ and µu = µ, variances σ2x = σ2

x and σ2u = σ2

x + σ2y, and ρ =

σx√

σ2x + σ2

y

.

(c) E(X | U = t) = µx + ρσx

σu(t − µu) = µ +

σ2x

σ2x + σ2

y

(t − µ).

(d) E(X | U = 1.3) = 1 +(1)2

(1)2 + (0.5)2(1.3− 1) = 1.24.

18 Order statistics

18.1 Extrema

Let X1, . . . , Xn be independent and identically distributed random variables (iid random variables) with thecommon cdf F (x) and the pdf f(x). Then we can define the random variable V = min(X1, . . . , Xn) of theminimum of X1, . . . , Xn. To find the distribution of V , consider the survival function P (V > x) of V andcalculate as follows.

P (V > x) = P (X1 > x, . . . , Xn > x) = P (X1 > x) × · · · × P (Xn > x) = [1 − F (x)]n.

Thus, we can obtain the cdf FV (x) and the pdf fV (x) of V

FV (x) = 1 − [1 − F (x)]n and fV (x) =d

dxFV (x) = nf(x)[1 − F (x)]n−1.

In particular, if X1, . . . , Xn be independent exponential random variables with the common parameter λ, thenthe minimum min(X1, . . . , Xn) has the density (nλ)e−(nλ)x which is again the exponential density function withparameter (nλ).

18.2 Order statistics

Let X1, . . . , Xn be iid random variables with the common cdf F (x) and the pdf f(x). When we sort X1, . . . , Xn

asX(1) < X(2) < · · · < X(n),

the random variable X(k) is called the k-th order statistic. Then we have the following theorem

Theorem. The pdf of the k-th order statistic X(k) is given by

fk(x) =n!

(k − 1)!(n − k)!f(x)F k−1(x)[1 − F (x)]n−k . (33)

Probability of small interval—the proof of theorem. If f is continuous at x and δ > 0 is small, then the probabilitythat a random variable X falls into the small interval [x, x+δ] is approximated by δf(x). In differential notation,we can write

P (x ≤ X ≤ x + dx) = f(x)dx.

Now consider the k-th order statistic X(k) and the event that x ≤ X(k) ≤ x+dx. This event occurs when (k−1)

of Xi’s are less than x and (n− k) of Xi’s are greater than x ≈ x+ dx. Since there are(

nk−1,1,n−k

)arrangements

for Xi’s, we can obtain the probability of this event as

fk(x)dx =n!

(k − 1)!(n − k)!(F (x))k−1(f(x)dx)[1 − F (x)]n−k,

Lecture Note


which implies (33).

18.3 Beta distribution

When X1, . . . , Xn be iid uniform random variables on [0, 1], the pdf of the k-th order statistic X(k) becomes

fk(x) =n!

(k − 1)!(n − k)!xk−1(1 − x)n−k, 0 ≤ x ≤ 1,

which is known as the beta distribution with parameters α = k and β = n − k + 1.

18.4 Exercises

1. Suppose that system’s components are connected in series and have lifetimes that are independent expo-nential random variables X1, . . . , Xn with respective parameters λ1, . . . , λn.

(a) Let V be the lifetime of the system. Show that V = min(X1, . . . , Xn).

(b) Show that the random variable V is exponentially distributed with parameter∑n

i=1 λi.

2. Each component Ai of the following system has lifetime that is iid exponential random variable Xi withparameter λ.

@

@

@@

A1

A3

A5

A2

A4

A6

(a) Let V be the lifetime of the system. Show that V = max(min(X1, X2), min(X3, X4), min(X5, X6)).

(b) Find the cdf and the pdf of the random variable V .

3. A card contains n chips and has an error-correcting mechanism such that the card still functions if a singlechip fails but does not function if two or more chips fail. If each chip has an independent and exponentiallydistributed lifetime with parameter λ, find the pdf of the card’s lifetime.

4. Suppose that a queue has n servers and that the length of time to complete a job is an exponential randomvariable. If a job is at the top of the queue and will be handled by the next available server, what is thedistribution of the waiting time until service? What is the distribution of the waiting time until service ofthe next job in the queue?

5. Let U and V be iid uniform random variable on [0, 1], and let X = min(U, V ) and Y = max(U, V ).

(a) We have discussed that the probability P (x ≤ X ≤ x+ dx) is approximated by f(x) dx. Similarly, wecan find the following formula in the case of joint distribution.

P (x ≤ X ≤ x + dx, y ≤ Y ≤ y + dy) = f(x, y) dx dy.

In particular, we can obtain

P (x ≤ U ≤ x + dx, y ≤ V ≤ y + dy) = dx dy.

Now by showing that

P (x ≤ X ≤ x + dx, y ≤ Y ≤ y + dy)

= P (x ≤ U ≤ x + dx, y ≤ V ≤ y + dy) + P (x ≤ V ≤ x + dx, y ≤ U ≤ y + dy),

Lecture Note


justify that the joint density function f(x, y) of X and Y is given by

f(x, y) =

2 if 0 ≤ x ≤ y ≤ 1;

0 otherwise.

(b) Find the conditional expectations E(X |Y = y) and E(Y |X = x).

19 MGF techniques


Let X be a random variable. Then the moment generating function (mgf) of X is given by

M(t) = E[etX].

The mgf M(t) is a function of t defined on some open interval (c0, c1) with c0 < 0 < c1. In particular when Xis a continuous random variable having the pdf f(x), the mgf M(t) can be expressed as

M(t) =

∫ ∞

−∞etxf(x) dx.

The most significant property of moment generating function is that the moment generating function uniquelydetermines the distribution.

Suppose that X is a standard normal random variable. Then we can compute the mgf of X as follows.

MX(t) =1√2π

∫ ∞

−∞etx e−

x2

2 dx =1√2π

∫ ∞

−∞e−

12 (x−t)2+ t2

2 dx = et2

21√2π

∫ ∞

−∞e−

(x−t)2

2 dx = exp

(t2

2

)

Since X = (Y − µ)/σ becomes a standard normal random variable and Y = σX + µ, the mgf of Y can be givenby

MY (t) = eµtMX(σt) = exp

(σ2t2

2+ µt

)

.

Similarly suppose now that X is a gamma random variable with parameter (α, λ). For t < λ, the mgf MX(t) ofX can be computed as follows.

MX(t) =

∫ ∞

0

etx λα


λα

Γ(α)

∫ ∞

0

xα−1e−(λ−t)x dx

=λα

(λ − t)α

1

Γ(α)

∫ ∞

0

((λ − t)x)α−1e−(λ−t)x (λ − t)dx =

(λ

λ − t

)α1

Γ(α)

∫ ∞

0

uα−1e−u du =

(λ

λ − t

)α

19.2 Joint moment generating function

Let X and Y be random variables having a joint density function f(x, y). Then we can define the joint momentgenerating function M(s, t) of X and Y by

M(s, t) = E[esX+tY

]=

∫ ∞

−∞

∫ ∞

−∞esx+tyf(x, y) dx dy.

If X and Y are independent, then the joint mgf M(s, t) becomes

M(s, t) = E[esX+tY

]= E

[esX

]· E[etY]

= MX(s) · MY (t).

Moreover, it has been known that “if the joint mgf M(s, t) of X and Y is of the form M1(s) ·M2(t), then X andY are independent and have the respective mgf M1(t) and M2(t).”

Lecture Note


19.3 Sum of independent random variables

Let X and Y be independent random variables having the respective probability density functions fX(x) andfY (y). Then the cumulative distribution function FZ(z) of the random variable Z = X + Y can be given asfollows.

FZ(z) = P (X + Y ≤ z) =

∫ ∞

−∞

[∫ z−x

−∞fX(x)fY (y) dy

]

dx =

∫ ∞

−∞fX(x)FY (z − x) dx,

where FY is the cdf of Y . By differentiating FZ(z), we can obtain the pdf fZ(z) of Z as

fZ(z) =d

dzFZ(z) =

∫ ∞

−∞fX(x)

(d

dzFY (z − x)

)

dx =

∫ ∞

−∞fX(x)fY (z − x) dx.

The form of integration∫∞−∞ f(x)g(z − x) dx is called the convolution. Thus, the pdf fZ(z) is given by the

convolution of the pdf’s fX(x) and fY (y). However, the use of moment generating function makes it easier to“find the distribution of the sum of independent random variables.”

Let X and Y be independent normal random variables with the respective parameters (µx, σ2x) and (µy, σ2

y).Then the sum Z = X + Y of random variables has the mgf

MZ(t) = MX(t) · MY (t) = exp

(σ2

xt2

2+ µxt

)

· exp

(

σ2yt2

2+ µyt

)

= exp

(

(σ2x + σ2

y)t2

2+ (µx + µy)t

)

which is the mgf of normal distribution with parameter (µx + µy, σ2x + σ2

y). By the property (a) of mgf, we canfind that Z is a normal random variable with parameter (µx + µy, σ2

x + σ2y).

Let X and Y be independent gamma random variables with the respective parameters (α1, λ) and (α2, λ). Thenthe sum Z = X + Y of random variables has the mgf

MZ(t) = MX(t) · MY (t) =

(λ

λ − t

)α1

·(

λ

λ − t

)α2

=

(λ

λ − t

)α1+α2

which is the mgf of gamma distribution with parameter (α1 + α2, λ). Thus, Z is a gamma random variable withparameter (α1 + α2, λ).

20 Chi-square, t-, and F -distributions

20.1 Chi-square distribution

The gamma distribution

f(x) =1

2n/2Γ(n/2)xn/2−1e−x/2, x ≥ 0

with parameter (n/2, 1/2) is called the chi-square distribution with n degrees of freedom.

(a) Let X be a standard normal random variable, and let

Y = X2

be the square of X . Then we can express the cdf FY of Y as FY (t) = P (X2 ≤ t) = P (−√

t ≤ X ≤√

t) =Φ(

√t) −Φ(−

√t) in terms of the standard normal cdf Φ. By differentiating the cdf, we can obtain the pdf

fY as follows.

fY (t) =d

dtFY (t) =

1√2π

t−12 e−

t2 =

1

21/2Γ(1/2)t−1/2e−t/2,

where we use Γ(1/2) =√

π. Thus, the square X2 of X is the chi-square random variable with 1 degree offreedom.

Lecture Note


(b) Let X1, . . . , Xn be iid standard normal random variables. Since X2i ’s have the gamma distribution with

parameter (1/2, 1/2), the sum

Y =

n∑

i=1

X2i

has the gamma distribution with parameter (n/2, 1/2). That is, the sum Y has the chi-square distributionwith n degree of freedom.

20.2 Student’s t-distribution

Let X be a chi-square random variable with n degrees of freedom, and let Y be a standard normal randomvariable. Suppose that X and Y are independent. Then the distribution of the quotient

Z =Y

√

X/n

is called the Student’s t-distribution with n degrees of freedom. The pdf fZ(z) is given by

fZ(z) =Γ((n + 1)/2)√

nπΓ(n/2)(1 + z2/n)−(n+1)/2, −∞ < z < ∞.

20.3 F -distribution

Let U and V be independent chi-square random variables with r1 and r2 degrees of freedom, respectively. Thenthe distribution of the quotient

Z =U/r1

V/r2

is called the F-distribution with (r1, r2) degree of freedom. The pdf fZ(z) is given by

fZ(z) =Γ((r1 + r2)/2)

Γ(r1/2)Γ(r2/2)(r1/r2)

r1/2zr1/2−1

(

1 +r1

r2z

)−(r1+r2)/2

, 0 < z < ∞.

20.4 Quotient of two random variables

Let X and Y be independent random variables having the respective pdf’s fX(x) and fY (y). Then the cdf FZ(z)of the quotient

Z = Y/X

can be computed as follows.

FZ(z) = P (Y/X ≤ z) = P (Y ≥ zX, X < 0) + P (Y ≤ zX, X > 0)

=

∫ 0

−∞

[∫ ∞

xz

fY (y) dy

]

fX(x) dx +

∫ ∞

0

[∫ xz

−∞fY (y) dy

]

fX(x) dx

By differentiating, we can obtain

fZ(z) =d

dzFZ(z) =

∫ 0

−∞[−xfY (xz)]fX(x) dx +

∫ ∞

0

[xfY (xz)]fX(x) dx =

∫ ∞

−∞|x|fY (xz)fX(x) dx.

Lecture Note


Let X be a chi-square random variable with n degrees of freedom. Then the pdf of the random variableU =

√

X/n is given by

fU (x) =d

dxP (√

X/n ≤ x) =d

dxFX(nx2) = 2nxfX(nx2) =

nn/2

2n/2−1Γ(n/2)xn−1e−nx2/2

for t ≥ 0; otherwise, fU (t) = 0. Suppose that Y is a standard normal random variable and independent of X .Then the quotient

Z =Y

√

X/n

has a t-distribution with n degrees of freedom. The pdf fZ(z) can be computed as follows.

fZ(z) =

∫ ∞

−∞|x|fY (xz)fU (x) dx =

nn/2

2(n−1)/2√

πΓ(n/2)

∫ ∞

0

xne−(z2+n)x2/2 dx

=1√

nπΓ(n/2)(1 + z2/n)−(n+1)/2

∫ ∞

0

t(n−1)/2e−t dt =Γ((n + 1)/2)√

nπΓ(n/2)(1 + z2/n)−(n+1)/2

for −∞ < z < ∞.

Let U and V be independent chi-square random variables with r1 and r2 degrees of freedom, respectively. Thenthe quotient

Z =U/r1

V/r2

has F -distribution with (r1, r2) degree of freedom. By putting Y = U/r1 and X = V/r2, we can compute thepdf fZ(z) as follows.

fZ(z) =

∫ ∞

−∞|x|fY (xz)fX(x) dx =

rr1/21 r

r2/22

2(r1+r2)/2Γ(r1/2)Γ(r2/2)zr1/2−1

∫ ∞

0

x(r1+r2)/2−1e−(r1z+r2)x/2 dx

=r

r1/21 r

r2/22

Γ(r1/2)Γ(r2/2)(r1z + r2)

−(r1+r2)/2 zr1/2−1

∫ ∞

0

t(r1+r2)/2−1e−t dt

=Γ((r1 + r2)/2)

Γ(r1/2)Γ(r2/2)(r1/r2)

r1/2zr1/2−1

(

1 +r1

r2z

)−(r1+r2)/2

, 0 < z < ∞.

21 Sampling distributions

21.1 Population and random sample

The data values recorded x1, . . . , xn are typically considered as the observed values of iid random variablesX1, . . . , Xn having a common probability distribution f(x). To judge the quality of data, it is useful to envisagea population from which the sample should be drawn. A random sample is chosen at random from the populationto ensure that the sample is representative of the population.

Let X1, . . . , Xn be iid normal random variables with parameter (µ, σ2). Then the random variable

X =1

n

n∑

i=1

Xi

is called the sample mean, and the random variable

S2 =1

n − 1

n∑

i=1

(Xi − X)2

is called the sample variance. And they have the following properties.

Lecture Note


1. E[X] = µ and Var(X) =σ2

n.

2. The sample variance S2 can be rewritten as

S2 =1

n − 1

(n∑

i=1

X2i − nX2

)

=1

n − 1

(n∑

i=1

(Xi − µ)2 − n(X − µ)2

)

.

In particular, we can obtain

E[S2] =1

n − 1

(n∑

i=1

E[(Xi − µ)2] − nE[(X − µ)2]

)

=1

n − 1

(n∑

i=1

Var(Xi) − nVar(X)

)

= σ2.

3. X has the normal distribution with parameter (µ, σ2/n), and X and S2 are independent. Furthermore,

Wn−1 =(n − 1)S2

σ2=

1

σ2

n∑

i=1

(Xi − X)2

has the chi-square distribution with (n − 1) degrees of freedom.

4. The random variable

T =X − µ

S/√

n=

(X−µσ/

√n

)

√(n−1)S2/σ2

n−1

has the t-distribution with (n − 1) degrees of freedom.

Rough explanation of Property 3. Assuming that X and S2 are independent (which is not so easy to prove), wecan show that the random variable Wn−1 has the chi-square distribution with (n − 1) degrees of freedom. Byintroducing V = 1

σ2

∑ni=1(Xi − µ)2 and W1 = n

σ2 (X − µ)2, we can write V = Wn−1 + W1. Since Wn−1 and W1

are independent, the mgf MV (t) of V can be expressed as

MV (t) = MWn−1(t) × MW1(t),

where MWn−1(t) and MW1(t) are the mgf’s of Wn−1 and W1, respectively. Observe that W1 has the chi-squaredistribution with 1 degrees of freedom. Thus, we can obtain

MWn−1(t) =MV (t)

MW1

=(1 − 2t)−n/2

(1 − 2t)−1/2= (1 − 2t)−(n−1)/2

which is the mgf of the chi-square distribution with (n − 1) degrees of freedom.

21.2 Sampling distributions under normal assumption

Suppose that the dataX1, . . . , Xn

are iid normal random variables with parameter (µ, σ2).

1. The sample mean X is normally distributed with parameter (µ, σ2/n), and the random variable

√n(X − µ)

σ∼ N(0, 1) (34)

has the standard normal distribution.

Lecture Note


2. The random variable1

σ2

n∑

i=1

(Xi − X)2 =(n − 1)S2

σ2∼ χ2

n−1

is distributed as the chi-square distribution with (n − 1) degrees of freedom, and is independent of therandom variable (34).

3. The random variableX − µ

S/√

n=

√n(X−µ)

σ√

(n−1)S2/σ2

n−1

∼ tn−1

has the t-distribution with (n − 1) degrees of freedom, where S =√

S2 is called the sample standarddeviation.

21.3 Critical points

Given a random variable X , the critical point for level α is defined as the value c satisfying P (X > c) = α.

1. Suppose that a random variable X has the chi-square distribution with n degrees of freedom. Then thecritical point of chi-square distribution, denoted by χα,n, is defined by

P (X > χα,n) = α.

2. Suppose that a random variable Z has the t-distribution with m degrees of freedom. Then the criticalpoint is defined as the value tα,m satisfying

P (Z > tα,m) = α.

Since the t-distribution is symmetric, the critical point tα,m can be equivalently given by

P (Z < −tα,m) = α.

Together we can obtain the expression

P (−tα/2,m ≤ Z ≤ tα/2,m) = 1 − P (Z < −tα/2,m) − P (Z > tα/2,m) = 1 − α.

21.4 Exercises

1. Consider a sample X1, . . . , Xn of normally distributed random variables with variance σ2 = 5. Supposethat n = 31.

(a) What is the value of c for which P (S2 ≤ c) = 0.90?

(b) What is the value of c for which P (S2 ≤ c) = 0.95?

2. Consider a sample X1, . . . , Xn of normally distributed random variables with mean µ. Suppose that n = 16.

(a) What is the value of c for which P (|4(X − µ)/S| ≤ c) = 0.95?

(b) What is the value of c for which P (|4(X − µ)/S| ≤ c) = 0.99?

3. Suppose that X1, . . . , X10 are iid normal random variables with parameters µ = 0 and σ2 > 0.

(a) Let Z = X1 + X2 + X3 + X4. Then what is the distribution for Z?

Lecture Note


(b) Let W = X25 + X2

6 + X27 + X2

8 + X29 + X2

10. Then what is the distribution forW

σ2?

(c) Describe the distribution for the random variableZ/2√

W/6.

(d) Find the value c for which P

(Z√W

≤ c

)

= 0.95.

Joint Distributions January 17, 2006

22 Law of large numbers

22.1 Chebyshev’s inequality

We can define an indicator function IA of a subset A of real line as

IA(x) =

1 x ∈ A;

0 otherwise.

Let X be a random variable having the pdf f(x). Then we can express the probability

P (X ∈ A) =

∫

A

f(x) dx =

∫ ∞

−∞IA(x)f(x) dx

Now suppose that X has the mean µ = E[X ] and the variance σ2 = Var(X), and that we choose the particularsubset

A = x : |x − µ| > t.Then we can observe that

IA(x) ≤ 1

t2(x − µ)2 for all x,

which can be used to derive

P (|X − µ| > t) = P (X ∈ A) =

∫ ∞

−∞IA(x)f(x) dx ≤ 1

t2

∫ ∞

−∞(x − µ)2f(x) dx =

1

t2Var(X) =

σ2

t2.

The inequality P (|X − µ| > t) ≤ σ2/t2 is called the Chebyshev’s inequality.

22.2 Weak law of large numbers

Let Y1, Y2, . . . be a sequence of random variables. Then we can define the event An = |Yn − α| > ε and theprobability P (An) = P (|Yn − α| > ε). We say that the sequence Yn converges to α in probability, if we have

limn→∞

P (|Yn − α| > ε) = 0,

for any choice of ε > 0.

Now let X1, X2, . . . be a sequence of iid random variables with mean µ and variance σ2. Then for each n = 1, 2, . . .,the sample mean

Xn =1

n

n∑

i=1

Xi

Lecture Note


has the mean µ and the variance σ2/n. Given a fixed ε > 0, we can apply the Chebyshev’s inequality to obtain

P (|Xn − µ| > ε) ≤ σ2/n

ε2→ 0 as n → ∞.

Thus, we have established

“the sample mean Xn converges to µ in probability”

which is called the weak law of large numbers. A stronger result of the above, “the sample mean Xn convergesto µ almost surely,” was established by Kolmogorov, and is called the strong law of large numbers.

23 Central limit theorems

23.1 Convergence in distribution

Let X1, X2, . . . be a sequence of random variables having the cdf’s F1, F2, . . ., and let X be a random variablehaving the cdf F . Then we say that the sequence Xn converges to X in distribution (in short, Xn converges toF ) if

limn→∞

Fn(x) = F (x) for every x at which F (x) is continuous. (35)

Note that the convergence in (35) is completely characterized in terms of the distributions F1, F2, . . . and F .Recall that the distributions F1(x), F2(x), . . . and F (x) are uniquely determined by the respective momentgenerating functions, say M1(t), M2(t), . . . and M(t). Furthermore, we have an “equivalent” version of theconvergence in terms of the m.g.f’s

Continuity Theorem. Iflim

n→∞Mn(t) = M(t) for every t (around 0), (36)

then the corresponding distributions Fn’s converge to F in the sense of (35).

Alternative criterion for convergence in distribution. Let Y1, Y2, . . . be a sequence of random variables,and let Y be a random variable. Then the sequence Yn converges to Y in distribution if and only if

limn→∞

E[g(Yn)] = E[g(Y )]

for every continuous function g.

Slutsky’s theorem. Let Z1, Z2, . . . and W1, W2, . . . be two sequences of random variables, and let c be a constantvalue. Suppose that the sequence Zn converges to Z in distribution, and that the sequence Wn converges to c inprobability. Then

1. The sequence Zn + Wn converges to Z + c in distribution.

2. The sequence ZnWn converges to cZ in distribution.

3. The sequence Zn/Wn converges to Z/c in distribution.

23.2 Central Limit Theorem

Let X1, X2, . . . be a sequence of iid random variables having the common distribution F with mean 0 andvariance 1. Then

Zn =

n∑

i=1

Xi

/√n n = 1, 2, . . .

Lecture Note


converges to the standard normal distribution.

A sketch of proof. Let M be the m.g.f. for the distribution F , and let Mn be the m.g.f. for the randomvariable Zn for n = 1, 2, . . .. Since Xi’s are iid random variables, we have

Mn(t) = E[etZn

]= E

[

e

“

t√n

”

(X1+···+Xn)]

= E

[

e

“

t√n

”

X1

]

× · · · × E

[

e

“

t√n

”

Xn

]

=

[

M

(t√n

)]n

Observing that M ′(0) = E[X1] = 0 and M ′′(0) = E[X21 ] = 1, we can apply a Taylor expansion to M

(t√n

)

around 0.

M

(t√n

)

= M(0) +t√n

M ′(0) +t2

2nM ′′(0) + εn(t) = 1 +

t2

2n+ εn(t).

Recall that limn→∞

(

1 +a

n

)n

= ea; in a similar manner, we can obtain

limn→∞

Mn(t) = limn→∞

[

1 +t2/2

n+ εn(t)

]n

= et2/2,

which is the m.g.f. for the standard normal distribution.

23.3 Normal approximation

Let X1, X2, . . . be a sequence of iid random variables having the common distribution F with mean µ andvariance σ2. Then

Zn =

n∑

i=1

Xi − nµ

σ√

n=

n∑

i=1

(Xi − µ)/σ

√n

n = 1, 2, . . .

converges to the standard normal distribution.

Let X1, X2, . . . , Xn be finitely many iid random variables with mean µ and variance σ2. If the size n is adequatelylarge, then the distribution of the sum

Y =

n∑

i=1

Xi

can be approximated by the normal distribution with parameter (nµ, nσ2). A general rule for “adequately large”n is about n ≥ 30, but it is often good for much smaller n. Similarly the sample mean

Xn =1

n

n∑

i=1

Xi

has approximately the normal distribution with parameter (µ, σ2/n). Then the following application of normalapproximation to probability computation can be applied.

Random variable Normal approximation Probability

Y = X1 + · · · + Xn N(nµ, nσ2) P (a ≤ Y ≤ b) ≈ Φ

(b − nµ

σ√

n

)

− Φ

(a − nµ

σ√

n

)

X =X1 + · · · + Xn

nN

(

µ,σ2

n

)

P (a ≤ X ≤ b) ≈ Φ

(b − µ

σ/√

n

)

− Φ

(a − µ

σ/√

n

)

Example. An insurance company has 10,000 automobile policyholders. The expected yearly claim per policy-holder is $240 with a standard deviation of $800. Find approximately the probability that the total yearly claimexceeds $2.7 million. Can you say that such event is statistically highly unlikely?

Lecture Note


23.4 Normal approximation to binomial distribution

Suppose that X1, . . . , Xn are iid Bernoulli random variables with the mean p = E(X) and the variance p(1−p) =Var(X). If the size n is adequately large, then the distribution of the sum

Y =

n∑

i=1

Xi

can be approximated by the normal distribution with parameter (np, np(1− p)). Thus, the normal distributionN(np, np(1 − p)) approximates the binomial distribution B(n, p). A general rule for “adequately large” n is tosatisfy np ≥ 5 and n(1 − p) ≥ 5.

Let Y be a binomial random variable with parameter (n, p), and let Z be a normal random variable withparameter (np, np(1−p)). Then the distribution of Y can be approximated by that of Z. Since Z is a continuousrandom variable, the approximation of probability should improve when the following formula of continuitycorrection is considered.

P (i ≤ Y ≤ j) ≈ P (i − 0.5 ≤ Z ≤ j + 0.5) = Φ

(

j + 0.5 − np√

np(1 − p)

)

− Φ

(

i − 0.5 − np√

np(1 − p)

)

.

23.5 Exercises

1. An actual voltage of new a 1.5-volt battery has the probability density function

f(x) = 5, 1.4 ≤ x ≤ 1.6.

Estimate the probability that the sum of the voltages from 120 new batteries lies between 170 and 190volts.

2. The germination time in days of a newly planted seed has the probability density function

f(x) = 0.3e−0.3x, x ≥ 0.

If the germination times of different seeds are independent of one another, estimate the probability thatthe average germination time of 2000 seeds is between 3.1 and 3.4 days.

3. Calculate the following probabilities by using normal approximation with continuity correction.

(a) Let X be a binomial random variable with n = 10 and p = 0.7. Find P (X ≥ 8).

(b) Let X be a binomial random variable with n = 15 and p = 0.3. Find P (2 ≤ X ≤ 7).

(c) Let X be a binomial random variable with n = 9 and p = 0.4. Find P (X ≤ 4).

(d) Let X be a binomial random variable with n = 14 and p = 0.6. Find P (8 ≤ X ≤ 11).

Lecture Note

introduction to probability - tntech.edu...introduction to probability motoya machida january 17,...

Documents