stat 515, statistics i acknowledgement contentsjchen/stat515/stat515_lecture.pdfstat 515, statistics...

75
STAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS AMHERST EMAIL: [email protected] Acknowledgement Some contents of the lecture notes are based on Spring 2016 course notes by Prof. Yao Li and Dr. Zheng Wei. Contents 1 Introduction 4 2 Probability 4 2.1 Set Notation ............................... 4 2.2 Probabilistic Model - Discrete Case .................. 5 2.3 Calculating the Probability of an Event - The Sample-Point Method . 8 2.4 Tools for Counting Sample Points ................... 9 2.4.1 Multiplication Principle .................... 9 2.4.2 Permutation ........................... 10 2.4.3 Combination .......................... 11 2.5 Conditional Probability and Independence ............... 12 2.5.1 Conditional Probability ..................... 12 1

Upload: others

Post on 27-Sep-2020

40 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

STAT 515, Statistics I

JIANYU CHEN

DEPARTMENT OF MATHEMATICS AND STATISTICS

UNIVERSITY OF MASSACHUSETTS AMHERST

EMAIL: [email protected]

Acknowledgement

Some contents of the lecture notes are based on Spring 2016 course notes by Prof. YaoLi and Dr. Zheng Wei.

Contents

1 Introduction 4

2 Probability 4

2.1 Set Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Probabilistic Model - Discrete Case . . . . . . . . . . . . . . . . . . 5

2.3 Calculating the Probability of an Event - The Sample-Point Method . 8

2.4 Tools for Counting Sample Points . . . . . . . . . . . . . . . . . . . 9

2.4.1 Multiplication Principle . . . . . . . . . . . . . . . . . . . . 9

2.4.2 Permutation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.3 Combination . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Conditional Probability and Independence . . . . . . . . . . . . . . . 12

2.5.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . 12

1

Page 2: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

2.5.2 Independence of Events . . . . . . . . . . . . . . . . . . . . 14

2.5.3 Multiplicative Law of Probability . . . . . . . . . . . . . . . 15

2.6 Calculating the Probability of an Event - The Event-Composition Method 16

2.7 The Law of Total Probability and Bayes’ Rule . . . . . . . . . . . . . 17

3 Discrete Random Variables 19

3.1 Definition of Random Variables . . . . . . . . . . . . . . . . . . . . 19

3.2 Probability Distribution of a Discrete Random Variable . . . . . . . . 20

3.3 The Expected Value of a Random Variable, or a Function of a RandomVariable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Well-known Discrete Probability Distributions . . . . . . . . . . . . . 25

3.4.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . 25

3.4.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . 26

3.4.3 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . 28

3.4.4 Hypergeometric Distribution . . . . . . . . . . . . . . . . . . 30

3.4.5 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Tchebysheff’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Continuous Random Variables 35

4.1 The Characteristics of a Continuous Random Variable . . . . . . . . . 36

4.2 Well-known Continuous Probability Distributions . . . . . . . . . . . 40

4.2.1 Uniform Distribution on an Interval . . . . . . . . . . . . . . 40

4.2.2 Gaussian/Normal Distribution . . . . . . . . . . . . . . . . . 41

4.2.3 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . 43

4.3 Tchebysheff’s Theorem (Revisited) . . . . . . . . . . . . . . . . . . 46

2

Page 3: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

5 Multivariate Probability Distributions 46

5.1 Bivariate Probability Distributions . . . . . . . . . . . . . . . . . . . 47

5.1.1 Discrete Case . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.1.2 Continuous Case . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2 Marginal and Conditional Probability Distributions . . . . . . . . . . 50

5.2.1 Discrete Case . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2.2 Continuous Case . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3 Independence of Random Variables . . . . . . . . . . . . . . . . . . 53

5.4 Expected Value of a Function of Random Variables . . . . . . . . . . 54

5.5 Covariance of Random Variables . . . . . . . . . . . . . . . . . . . . 56

5.6 Conditional Expectations . . . . . . . . . . . . . . . . . . . . . . . . 59

6 Functions of Random Variables 61

6.1 The Method of Distribution Functions . . . . . . . . . . . . . . . . . 61

6.2 The Method of Transformations . . . . . . . . . . . . . . . . . . . . 62

6.3 The Method of Moment-generating Functions . . . . . . . . . . . . . 63

7 Sampling Distributions 69

7.1 Normally Distributed Sample Mean . . . . . . . . . . . . . . . . . . 70

7.2 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . 71

7.3 The Normal Approximation to the Binomial Distribution . . . . . . . 73

3

Page 4: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

1 Introduction

Statistics is a theory of information, with inference making as its objective.

Data⇒ Probability measurement⇒ Inference/Prediction

Example 1.1 Based on the latest presidential election polls, who is going to be thenext US president?

Example 1.2 Roll a six-sided dice and observe the number on the upper face. Onecan ask the following questions:

(1) What is the probability to get 1? 2? 5?

(2) What is the probability to get odd numbers?

(3) What is the probability to get 1, if the observed number is known to be odd?

(4) What is the mean value of the observed number?

2 Probability

2.1 Set Notation

• Set: a collection of elements, denoted by capital letters, like A, B, etc.

• A = {a1, a2, . . . }: the elements of A are a1, a2, . . . .

• ∅: the empty set - a set that contains no elements;

• S: the universal/full set - a set of all elements under consideration;

• A ⊂ B: A is a subset of B, that is, every element in A is also in B.

• A ∪ B: union of A and B, that is,

A ∪ B = {x ∈ S : x ∈ A or x ∈ B}.

We say that A and B are exhaustive, if A ∪ B = S .

• A ∩ B: intersection of A and B, that is,

A ∩ B = {x ∈ S : x ∈ A and x ∈ B}.

We say that A and B are mutually exclusive or mutually disjoint, if A ∩ B = ∅.

4

Page 5: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

• A or Ac or S\A: the complement of A, that is,

A = {x ∈ S : x 6∈ A}.

We also denote the complement of A in B by

B\A = B ∩ A = {x ∈ B : x 6∈ A}.

We can visualize all these operations via Venn Diagram.

Exercise 2.1 S = {0, 1, · · · , 9}. A = {1, 2, 3, 4, 5}. B = {3, 5}. C = {5, 7, 9}.

(1) Find A ∪ C , A ∩ C , B ∪ C and B ∩ C .

(2) Find A, C , A\B, B\C .

Proposition 2.2

• Distributive law:

A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C), A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)

• DeMorgan’s law:A ∩ B = A ∪ B, A ∪ B = A ∩ B

Proof By definition. Exercise.

2.2 Probabilistic Model - Discrete Case

Definition 2.3 An experiment is the process by which an observation is made.

Definition 2.4 The events are the outcomes of an experiments.

• A simple event is an event that cannot be decomposed, and happens in only oneway. Usually denoted by E with some subscript.

• A compound event is an event that can be decomposed into other events, andcan happen in more than one distinctive way.

Definition 2.5 We refer a distinct outcome to a sample point. A sample space Sassociated with an experiment is the set consisting of all possible sample points.

Mathematically, events can be represented by sets.

5

Page 6: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Example 2.6 Below are some events associated with a single toss of a six-sided dice:

• E1 : Observe a 1. ⇔ E1 = {1}.

• E2 : Observe a 2. ⇔ E2 = {2}.

• E3 : Observe a 3. ⇔ E3 = {3}.

• E4 : Observe a 4. ⇔ E4 = {4}.

• E5 : Observe a 5. ⇔ E5 = {5}.

• E6 : Observe a 6. ⇔ E6 = {6}.

• A: Observe an odd number. ⇔ A = {1, 3, 5}.

• B: Observe a number less than 5. ⇔ B = {1, 2, 3, 4}.

• C : Observe a 2 or a 3. ⇔ C = {2, 3}.

• S: All possible observation. ⇔ S = {1, 2, 3, 4, 5, 6}.

It is easy to see that

(1) A simple event corresponds to a sample point, that is, a simple event can berepresented by a set consisting of exactly one sample point. Therefore, twodistinct simple events are mutually exclusive.

(2) Any compound event can be written as unions of simple events, for instance, inthe above example, A = E1 ∪ E3 ∪ E5 .

(3) Any event A is a subset of the sample space S .

Definition 2.7 A discrete sample space is one that contains finite or countably manysample points.

Definition 2.8 (Probability) Let S be a sample space associated to some experiment.To every event A in S (A is a subset of S), we assign a number, P(A), called theprobability of A, so that the following axioms hold:

Axiom 1: P(A) ≥ 0;

Axiom 2: P(S) = 1;

Axiom 3: If A1,A2,A3, . . . form a sequence of mutually exclusive events in S(that is, Ai ∩ Aj = ∅ if i 6= j), then

P(A1 ∪ A2 ∪ A3 ∪ . . . ) =∞∑

i=1

P(Ai).

6

Page 7: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Example 2.9 Let Ei be the event of observing i in a single toss of a fair dice, thenwe assign P(Ei) = 1/6. For arbitrary event A, we can decompose A = Ei1 ∪ Ei2 ∪ . . .and then assign P(A) by Axiom 3. For instance, if A is the event to observe an oddnumber, then A = E1 ∪ E3 ∪ E5 , and hence

P(A) = P(E1) + P(E3) + P(E5) = 1/6 + 1/6 + 1/6 = 1/2.

In such assignment, Axiom 2, that is, P(S) = 1 is satisfied.

Exercise 2.10 Every person’s blood type is A, B, AB, O. In addition, each individualeither has the Rhesus (Rh) factor (+) or does not (-). A medical technician records aperson’s blood type and Rh factor.

(1) List the sample space for this experiment.

(2) Of the volunteers in a blood center, 1in 3 have O+ , 1 in 15 have O− , 1 in 3have A+ , and 1 in 16 have A− . What is the probability that a randomly chosenvolunteer has (a) type O blood? (b) neither type A nor type O blood?

Proposition 2.11 (Basic Properties of Probability)

(1) Law of Complement: For any event A, P(A) = 1− P(A). Consequently,

(a) P(∅) = 0;(b) For any event A, P(A) ≤ 1.

(2) For any events A and B,

P(B) = P(B ∩ A) + P(B\A).

In particular, if A ⊂ B, then P(A) ≤ P(B).

(3) Additive Law: For any events A and B,

P(A ∪ B) = P(A) + P(B)− P(A ∩ B).

Proof Exercise.

Exercise 2.12 If A and B are exhaustive, P(A) = 0.7 and P(B) = 0.9, find P(A∩B),P(A\B), P(B\A) and P(A ∪ B).

Solution Since A ∪ B = S , then P(A ∪ B) = P(S) = 1.

P(A ∩ B) = P(A) + P(B)− P(A ∪ B) = 0.7 + 0.9− 1 = 0.6;

P(A\B) = P(A)− P(A ∩ B) = 0.7− 0.6 = 0.1;

7

Page 8: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

P(B\A) = P(B)− P(B ∩ A) = 0.9− 0.6 = 0.3;

P(A ∪ B) = P(A ∩ B) = 1− P(A ∩ B) = 1− 0.6 = 0.4.

Exercise 2.13 For any two events A and B, prove that

P(A ∩ B) ≥ 1− P(A)− P(B).

2.3 Calculating the Probability of an Event - The Sample-Point Method

As shown in Example 2.9, once the probability is assigned to each simple event in adiscrete sample space, we can determine the probability of any event. More precisely,

Method 2.14 (Sample-Point Method)

(1) Define the experiment and determine the sample space S .

(2) Assign reasonable probabilities to the sample points in S such that P(Ei) ≥ 0and

∑i P(Ei) = 1.

(3) Any event A is a union of simple events, that is, A = Ei1 ∪ Ei2 ∪ . . . , then

P(A) =∑

j

P(Eij).

Example 2.15 A balanced coin is tossed three times. Calculate the probability thatexactly two of the three tosses result in heads.

Solution Denote the head by H, and tail by T. Then the three-time tossing of a balancedcoin has 8 sample points, or equivalently, 8 simple events:

E1 = {HHH}, E2 = {HHT}, E3 = {HTH}, E4 = {HTT},

E5 = {THH}, E6 = {THT}, E7 = {TTH}, E8 = {TTT}.

Since the coin is balanced, then each simple event is equiprobable, that is, P(Ei) = 1/8.Let A denote the event that exactly two of the three tosses result in heads, that is,

A = {HHT, HTH, THH} = E2 ∪ E3 ∪ E5,

and hence

P(A) = P(E2) + P(E3) + P(E5) = 1/8 + 1/8 + 1/8 = 3/8.

8

Page 9: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Remark 2.16 The sample-point method is very effective when every simple event hasequal chance. In such case, P(Ei) = 1/N where N is the sample size. For an event Athat contains exactly NA sample points, we must have P(A) = NA

N .

Exercise 2.17 A vehicle arriving an intersection can turn right, turn left or go straight.An experiment consists of observing two vehicles moving through the intersection.Assuming equally likely sample points, what is the probability that at least one vehicleturns? What is the probability that at most one vehicle turns?

2.4 Tools for Counting Sample Points

2.4.1 Multiplication Principle

Tool 2.18 (Multiplication Principle) With m elements a1, . . . , am and n elementsb1, . . . , bn , there are mn = m× n pairs of elements of the product-type (ai, bj).

More generally, with mi elements ai1, . . . , a

imi

, i = 1, 2, . . . , k , there are m1m2 . . .mk

k-tuples of elements of the product-type (a1j1 , a

2j2 , . . . , a

kjk ).

Example 2.19 An experiment invokes flipping a coin and rolling a 6-sided dicesimultaneously. Then each sample point is of the form (∗, i), where ∗ is either H or T,and i ranges from 1 to 6. So there are N = 2× 6 = 12 sample points in total.

If we further assume that both the coin and the dice is fair, then all sample points haveequal chance. Let A be the event that we flip a head and roll an even number, that is,

A = {(∗, i) : ∗ = H, i = 2, 4, 6}.

Then NA = 1× 3 = 3, and so P(A) = NAN = 3

12 = 0.25.

Example 2.20 Flip a coin 10 times. Then there are totally N = 210 sample points.

Exercise 2.21 How many subsets are there for a set of 10 elements?(Hint: This is in some sense the same as flipping a coin 10 times. A subset A is formedby determining whether the i-th element belongs to A.)

9

Page 10: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

2.4.2 Permutation

Recall that n! = 1× 2× · · · × n, and 0! = 1 in convention.

Definition 2.22 (Permutation)

• A permutation of n distinct objects is an ordered arrangement of them. Choosingobjects one by one, we find that there are

n× (n− 1)× (n− 2)× · · · × 1 = n!

permutations of n distinct objects in total.

• Let 1 ≤ r ≤ n. The number of permutations of r objects from n distinct onesis given by

n× (n− 1)× · · · × (n− r + 1) =n!

(n− r)!=: Pn

r .

Note that Pn0 = 1 and Pn

n = n!.

Example 2.23 Randomly choose a president, a vice president and a secretary in a50-person club. Since each job is distinct, there should be

P503 = 50 · 49 · 48 = 117, 600

possible choices.

Example 2.24 (Birthday Problem) There are n people in a class. Ignoring leap years,twins, seasonal and weekday variation, and assume that the 365 possible birthdays areequally likely. What is the probability that at least two people in the classroom havethe same birthday?

On the one hand, each sample point is any n dates from 365 dates, then the sample sizeis 365n by the multiplication principle. On the other hand, let A be the event that atleast two people have the same birthday, then the complement A is the event that everypeople has distinct birthday. In other words, A contains all permutations of n datesfrom 365 dates, and thus NA = P365

n . So

P(A) = 1− P(A) = 1− NAN

= 1− P365n

365n = 1− 365!365n(365− n)!

.

However, this formula is not easy to evaluate. Alternatively, we write

P(A) = 1− 365× 364× · · · × (365− n + 1)365n = 1−

n−1∏k=0

(1− k

365

).

10

Page 11: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Using the Taylor expansion of ex , we get

ex = 1 + x +x2

2+

x3

3!+ · · · ≈ 1 + x, as x close to 0.

Thus, we could approximate P(A) as:

P(A) ≈ 1−n−1∏k=0

e−k

365 = 1− e−1

365

∑n−1k=0 k = 1− e−

n(n−1)730 ≈ 1− e−

n2730 .

Thus, we have

• if there are n = 23 people, then P(A) ≈ 50.7% > 50%;

• if there are n = 40 people, then P(A) ≈ 89.1%;

• if there are n = 50 people, then P(A) ≈ 97%.

See more details in https://en.wikipedia.org/wiki/Birthday_problem .

2.4.3 Combination

Definition 2.25 (Combination)

• The number of combinations of r objects from n distinct ones is given by

Cnr =

(nr

):=

Pnr

r!=

n!r!(n− r)!

.

The difference between permutation and combination is that permutation caresabout the ordering of the r objects, while combination does not.

• In general, the number of ways of partitioning n distinct objects into k distinctgroups containing n1, n2, . . . , nk objects, where

∑ki=1 ni = n, is given by(

nn1 n2 . . . nk

)=

n!n1! n2! . . . nk!

.1

Example 2.26 There are 3 basketballs, 2 footballs and 5 volleyballs. Align all theseballs in a row, then how many possible arrangements are there?

Suppose that these 10 balls are already aligned and labelled in a row, then we just needto determine which group ball i belongs to. In other words, we partite 10 objects into3 distinct groups containing 3, 2, 5 objects, so there are

( 103 2 5

)= 2, 520 arrangements.

1 To understand this formula, take all n! permutations of the n objects and divide them intok distinct groups containing n1, n2, . . . , nk objects in order. This would cover all the partitionswith permutation repetitions in each subgroup, that is, we count the same combination in thei-th group for ni! times.

11

Page 12: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Exercise 2.27 (Binomial and Mutli-nomial Expansion)

(x + y)n =

n∑i=0

(ni

)xiyn−i

(x1 + x2 + · · ·+ xk)n =∑

n1+n2+···+nk=n

(n

n1 n2 . . . nk

)xn1

1 xn22 . . . xnk

k .

Use the above expansion to show thatn∑

i=0

(ni

)= 2n,

∑n1+n2+···+nk=n

(n

n1 n2 . . . nk

)= kn.

Example 2.28 A balanced coin is tossed 100 times. What is the probability thatexactly 50 of the 100 tosses result in heads?

There are totally 2100 sample points. The number of tossing 50 heads in the 100 tosses

is(100

50

), so the probability is

(10050

)2100 ≈ 0.0796.

Example 2.29 Lottery: From the numbers 1, 2, . . . , 44, a person may pick any sixnumbers for a ticket. The winning number is then randomly chosen from the 44numbers. How many possible tickets are there in the following situations?

(1) Ordered and with replacement: 446 ;

(2) Ordered and without replacement: P446 ;

(3) Unordered and without replacement: C446 =

(446

)= 7, 059, 052.

2.5 Conditional Probability and Independence

2.5.1 Conditional Probability

The probability of an event will sometimes depend on whether we know that otherevents have occurred.

Example 2.30 The (unconditional) probability of a 1 in the toss of a fair dice is 1/6.If we know that an odd number is obtained, then the number on the upper face of thedice must be 1, 3, or 5, and thus the conditional probability of a 1 given that informationwill instead be 1/3.

12

Page 13: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Definition 2.31 The conditional probability of an event A, given that an event B hasoccurred, is equal to

P(A|B) =P(A ∩ B)

P(B),

provided P(B) > 0. The symbol P(A|B) is read "probability of A given B". Note thatthe outcome of the event B allows us to update the probability of A.

Example 2.32 Suppose a fair dice is tossed once. The probability of a 1, given thatan odd number was obtained, is then

P(E1|A) =P(E1 ∩ A)

P(A)=

P(E1)P(E1 ∪ E3 ∪ E5)

=P(E1)

P(E1) + P(E3) + P(E5)

=1/6

1/6 + 1/6 + 1/6= 1/3.

The probability of a 2, given that an odd number was obtained, is then

P(E2|A) =P(E2 ∩ A)

P(A)=

P(∅)P(A)

= 0.

Remark 2.33

• if two events A and B are mutually exclusive, that is, A∩B = ∅, then P(A|B) =

P(B|A) = 0;

• if A ⊂ B, then P(B|A) = 1, that is, B definitely happens given that A happens.

Example 2.34 If two events A and B are such that P(A) = 0.5, P(B) = 0.4, andP(A ∩ B) = 0.3, then

P(A|B) =P(A ∩ B)

P(B)=

0.30.4

= 0.75, P(A|A ∩ B) = 1,

P(A|A ∪ B) =P(A)

P(A ∪ B)=

P(A)P(A) + P(B)− P(A ∩ B)

=0.5

0.5 + 0.4− 0.3=

56.

Example 2.35 A certain population of employees are taking job competency exams.The following table shows the percentage passing or failing the exam, listed accordingto the sex.

Male Female TotalPass 24 % 40% 64%Fail 16% 20% 36%Total 40% 60% 100%

13

Page 14: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Choose an employee randomly.

• Given a chosen employee is female, the probability that she passes is

P(Pass|Female) =P(Pass ∩ Female)

P(Female)=

40%60%

=23.

• Given an employee who has passed the exam, the probability that the employeeis male is

P(Male|Pass) =P(Male ∩ Pass)

P(Pass)=

24%64%

=38.

2.5.2 Independence of Events

If the probability of an event A is unaffected by the occurrence or nonoccurence ofanother event B, then we would say that the events A and B are independent.

Definition 2.36 Two events A are B are said to be independent if

P(A ∩ B) = P(A)P(B),

or equivalently, if P(A|B) = P(A), or if P(B|A) = P(B).

Otherwise, the events are said to be dependent.

Example 2.37 Consider the following events in the toss of a dice:

A: Observe an odd number —— A = {1, 3, 5};

B: Observe an even number —— B = {2, 4, 6};

C: Observe a 1 or 2 —— C = {1, 2}.

Note that P(A) = P(B) = 12 , and P(C) = 1

3 .

• A and B are dependent (in fact, mutually exclusive), since

P(A ∩ B) = P(∅) = 0, P(A)P(B) =12· 1

2=

14.

• A and C are independent since

P(A ∩ C) = P({1}) =16, P(A)P(C) =

12· 1

3=

16.

Proposition 2.38

(1) If two events A and B are independent, then the following pairs of events arealso independent: (a) A and B; (b) A and B; (c) A and B.

14

Page 15: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

(2) The events A and A are dependent if 0 < P(A) < 1.

(3) If an event A is self-independent, that is, A is independent with itself, thenP(A) = 0 or 1.

Proof

(1) P(A∩B) = P(A)−P(A∩B) = P(A)−P(A)P(B) = P(A)[1−P(B)] = P(A)P(B).Similar arguments for others.

(2) Since P(A ∩ A) = P(∅) = 0, but P(A)P(A) = P(A)[1− P(A)] > 0.

(3) If A is self-independent, then P(A) = P(A ∩ A) = P(A)2 , so P(A) = 0 or 1.

Definition 2.39 A collection of events A1,A2, . . . ,An are said to be mutually inde-pendent if for any subindices i1, i2, . . . , ik ∈ {1, . . . , n}, we have

P(Ai1 ∩ · · · ∩ Aik ) = P(Ai1) · · · · · P(Aik ).

There are (2n − n − 1) identities to check for the mutual independence of n events,which is usually too complicated when n is large. Nevertheless, in most applications,you can use common sense to determine that some events are mutually independent.

2.5.3 Multiplicative Law of Probability

Theorem 2.40 (Multiplicative Law of Probability) The probability of the intersectionof two events A and B is

P(A ∩ B) = P(A)P(B ∩ A) = P(B)P(A|B).

If A and B are independent, then

P(A ∩ B) = P(A)P(B).

More generally, the probability of the intersection of n events A1,A2,A3, . . . ,An is

P(A1∩A2∩A3∩· · ·∩An) = P(A1)P(A2|A1)P(A3|A1∩A2) . . .P(An|A1∩A2∩· · ·∩An−1).

Example 2.41 Suppose that A and B are two independent events with P(A) = 0.5and P(B) = 0.4, then

• P(A ∩ B) = P(A)P(B) = 0.5× 0.4 = 0.2;

15

Page 16: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

• P(A ∪ B) = P(A) + P(B)− P(A ∩ B) = 0.5 + 0.4− 0.2 = 0.7;

• P(A|B) = P(A∩B)P(B) = 0.2

0.4 = 0.5.

Exercise 2.42 A smoke detector system uses two devices, A and B. If the smoke ispresent, the probability that it will be detected by device A is 0.95; by device B is 0.90;by both is 0.88.

(a) If the smoke is present, find the probability that it will be detected by at leastone of the devices.

(b) Given device A detects the smoke, what is the probability that device B doesnot?

2.6 Calculating the Probability of an Event - The Event-CompositionMethod

Method 2.43 (Event-Composition Method)

(1) Define the experiment and determine the sample space S .

(2) Express an event A as a composition of two or more events with known proba-bility, using unions, intersections, and complements.

(3) Compute P(A) by Additive Law, Multiplicative Law, Law of Complements, theConditional Probability, Independences, etc.

Example 2.44 The odds are two to one that, when Alice and Bob play tennis, Alicewins. Suppose that Alice and Bob play two matches, and the result of the first matchwould not affect the second. What is the probability that Alice wins at least one match?

Denote by Ai the event that Alice wins the i-th match, i = 1, 2, then P(Ai) = 2/3.Note that A1 and A2 are independent, and so the probability that Alice wins at leastone match is

P(A1 ∪ A2) = P(A1) + P(A2)− P(A1 ∩ A2)

= P(A1) + P(A2)− P(A1)P(A2)

= 2/3 + 2/3− 2/3 · 2/3 = 8/9.

Example 2.45 It is known that a patient with a disease will respond to treatmentwith probability equal to 0.6. If five patients with disease are treated and respondindependently, find the probability that at least one will respond.

16

Page 17: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Let A be the event that at least one patient will respond, and Bi be the event that thei-th patient will not respond, i = 1, 2, . . . 5. Then

P(Bi) = 1− P(Bi) = 1− 0.6 = 0.4.

Note thatA = B1 ∩ B2 ∩ B3 ∩ B4 ∩ B5.

Note that B1,B2, . . . ,B5 are mutually independent, then

P(A) = 1− P(A) = 1− P(B1 ∩ B2 ∩ B3 ∩ B4 ∩ B5)

= 1− P(B1)P(B2) . . .P(B5)

= 1− 0.45 ≈ 0.9898.

Example 2.46 Observation of a waiting line at a local bank indicates that the proba-bility that a new arrival will be a VIP member is p = 1/5. Find the probability thatthe r-th client is the first VIP. (Assume that conditions of arriving clients representindependent events)

Let Ei be the event that the i-th client is VIP, and Ar be the event that the r-th client isthe first VIP, then

Ar = E1 ∩ E2 ∩ Er−1 ∩ Er.

Note that P(Ei) = p and P(Ei) = 1 − p, and E1,E2, . . . ,Er−1,Er are mutuallyindependent. Thus,

P(Ar) = P(E1) . . .P(Er−1)P(Er) = (1− p)r−1p = 0.2 · 0.8r−1.

2.7 The Law of Total Probability and Bayes’ Rule

The event-composition approach is sometimes facilitated by viewing the sample spaceS as a union of mutually exclusive subsets.

Definition 2.47 If a collection of events B1,B2, . . . ,Bk is such that

• B1,B2, . . . ,Bk are mutually exclusive, that is, Bi ∩ Bj = ∅ if i 6= j;

• B1,B2, . . . ,Bk are exhaustive, that is, B1 ∪ B2 ∪ · · · ∪ Bk = S ,

then the collection of events {B1,B2, . . . ,Bk} is called a partition of sample space S .

Given a partition {B1,B2, . . . ,Bk} of S , any event A can be decomposed as

(2–1) A = (A ∩ B1) ∪ (A ∩ B2) ∪ · · · ∪ (A ∩ Bk),

and note that A ∩ B1, . . . ,A ∩ Bk are mutually exclusive.

17

Page 18: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Theorem 2.48 Given a partition {B1,B2, . . . ,Bk} of S such that P(Bi) > 0 for alli = 1, . . . , k , any event A has probability

P(A) =k∑

i=1

P(Bi)P(A|Bi).

Proof It directly follows from (2–1), Axiom 3 in the definition of Probability and thedefinition of conditional probability.

As a direct consequence, we have the so-called Bayes’ rule as a corollary.

Corollary 2.49 (Bayes’ Rule) Given a partition {B1,B2, . . . ,Bk} of S such thatP(Bi) > 0 for all i = 1, . . . , k , and an event A with P(A) > 0, we have

P(Bj|A) =P(Bj)P(A|Bj)

k∑i=1

P(Bi)P(A|Bi).

Example 2.50 A population of voters contains 40% Republicans and 60% Democrats.Itis reported that 30% of the Republicans and 70% of the Democrats favor an electionissue.

(1) If a voter is randomly chosen, what is the probability that he or she will favorthis election issue?

(2) If A randomly chosen voter is found to favor the issue, what is the probabilitythat this person is a Democrat?

Let R or D be the event that a voter is Republican or Democrat respectively, and letF be the event that the voter favors the issue. Note that {R,D} is a partition of thesample space S .

(1) By Law of Total Probability,

P(F) = P(R)P(F|R)+P(D)P(F|D) = 40%×30%+60%×70% = 0.12+0.42 = 0.54.

(2) By Bayes’ Rule, the probability that a voter is Democrat given that he or shefavors the issue is

P(D|F) =P(D)P(F|D)

P(R)P(F|R) + P(D)P(F|D)=

60%× 70%40%× 30% + 60%× 70%

=0.420.54

= 7/9.

18

Page 19: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Exercise 2.51 A package, say P1 , of 24 balls, contains 8 green, 8 white, and 8 purpleballs. Another Package, say P2 , of 24 balls, contains 6 green, 6 white and 12 purpleballs. One of the two packages is selected at random.

(1) If 3 balls from this package (with replacement) were selected, all 3 are purple.Compute the conditional probability that package P2 was selected.

(2) If 3 balls from this package (with replacement) were selected, they are 1 green,1 white, and 1 purple. Compute the conditional probability that package P2 wasselected.

3 Discrete Random Variables

3.1 Definition of Random Variables

We start with the following intuitive example.

Example 3.1 (Opinion Poll) In an opinion poll, we decide to ask 100 people whetherthey agree or disagree with a certain bill.

If we record a "1" as for agree and "0" for disagree, then the sample space S for thisexperiment has 2100 sample points, each an ordered string of 1’s and 0’s of length 100.It is tedious to list all sample points.

Suppose our interest is the number of people who agree out of 100. If we definevariable Y to be the number of 1’s recorded out of 100, then the new sample space forY is the set SY = {0, 1, . . . , 100}.

It frequently occurs that we are mainly interested in some functions of the outcomes asopposed to the outcome itself. In the example what we do is to define a new variableY , the quantity of our interest.

Definition 3.2 A random variable (RV) Y is a function from S into R, where R isthe set of real numbers. Notationally, we write Y : S→ R, which assign each samplepoint s ∈ S to a real number y = Y(s).

We usually use the late-alphabet capital letters, like X,Y,Z,W , to denote a randomvariable, and corresponding letters in lower case, like x, y, z,w, to denote a value thatthe random variable may assume. Also, we denote

(Y = y) = {s ∈ S : Y(s) = y},

19

Page 20: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

that is, (Y = y) is the set of all sample points assigned to the value y by the RV Y .Similarly, we denote

(Y > y) = {s ∈ S : Y(s) > y},

that is, (Y > y) is the set of all sample points assigned to a value greater than y by theRV Y .

Example 3.3 (Opinion Poll) Recall that

S = {s = a1a2 . . . a100 : ai = 0 or 1},

and Y is the random variable that represents the number of people who agree out of100, that is,

Y(s) = Y(a1a2 . . . a100) = a1 + a2 + · · ·+ a100 =100∑i=1

ai.

Suppose that the minimum number to pass the bill is 60, then we need to consider thechance of the event (Y ≥ 60).

If f : R → R is a function from R to itself, and Y is a random variable oversome sample space S , then we can define a new random variable Z = f (Y) given byZ(s) = f ◦ Y(s) = f (Y(s)).

Example 3.4 In a 100-person club dinner, each person must order only one out of thetwo choices: beef or fish. Let Y be the number of people who choose beef.

Assume that the price is $15 for the beef dish, and $18 for the fish. Then the total billis a random variable Z given by

Z = 15Y + 18(100− Y).

Definition 3.5 A RV Y is said to be discrete if R(Y), the range of Y - the set of allpossible values for Y , is either finite or countably infinitely many.

For instance, the random variables in all the above examples are finite.

3.2 Probability Distribution of a Discrete Random Variable

Definition 3.6 (Probability distribution function) Given a discrete random variableY , the function given by

pY (y) := P(Y = y),

20

Page 21: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

for all possible values y of Y , is called the probability distribution function (pdf) of Y .

If Y is the only random variable under consideration, we shall drop the subscript andsimply denote the pdf by p(y).

Example 3.7 A supervisor in a manufacturing plant has three men and three womenworking for him. He wants to choose two workers randomly for a special job. Let Ydenote the number of women in his selection. Find the probability distribution for Yand represent it by formula, a table, and a graph.

The total number of choices is(6

2

), and the number of choices with y women workers,

where y = 0, 1, 2, is given by(3

y

)( 32−y

). Thus,

p(y) = P(Y = y) =

(3y

)( 32−y

)(62

) =

3!y!(3− y)!

3!(2− y)!(1 + y)!

6!2!4!

=

36y!(1 + y)!(2− y)!(3− y)!

15,

so

p(0) =315

= 0.2, p(1) =9

15= 0.6, p(2) =

315

= 0.2.

The probability distribution can then be represented by the following table:

y p(y)0 0.21 0.62 0.2

Proposition 3.8 Given a discrete random variable Y with probability distributionfunction p(y),

(1) 0 ≤ p(y) ≤ 1;

(2)∑

y p(y) = 1, where the summation ranges over all possible values y for Y .

(3) for any set B ⊂ R, P(Y ∈ B) =∑

y∈B p(y).

Proof This is a direct consequence of there axioms in the definition of probability,and the fact that the events (Y = y1), (Y = y2) . . . , (Y = yk) are mutually exclusivefor all distinct values y1, y2, . . . , yk .

Exercise 3.9 Let p(y) = cy2 , y = 1, 2, 3, 4, be the pdf of some discrete randomvariable, find the value of c.

21

Page 22: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Definition 3.10 (Cumulative distribution function) Given a discrete random variableY , the function given by

FY (a) := P(Y ≤ a) =∑y≤a

pY (y), for any a ∈ R,

is called the cumulative distribution function (cdf) for Y .

If Y is the only random variable under consideration, we simply denote the cdf byF(a).

Example 3.11 Toss two fair coins simultaneously, and record head and tail by 0 and1 respectively. Let Y denote the sum of the observed numbers. Find the pdf and cdffor Y .

The sample points are 00, 01, 10, 11, each of which is of chance 1/4. Then the possiblevalues for Y are y = 0, 1, 2, and the pdf of Y is given by

p(y) = P(Y = y) =

P(Y = 0) = P({00} = 1/4, y = 0,

P(Y = 1) = P({01, 10)) = 1/2, y = 1,

P(Y = 2) = P({11}) = 1/4, y = 2

The cdf of Y is then given by

F(a) = P(Y ≤ a) =

0, a < 0,

p(0) = 1/4, 0 ≤ a < 1,

p(0) + p(1) = 3/4, 1 ≤ a < 2,

1, a ≥ 2.

3.3 The Expected Value of a Random Variable, or a Function of a RandomVariable

In this subsection, we always assume that Y is a discrete random variable with theprobability distribution function p(y).

Definition 3.12 (Expected value of a random variable) The expected value (expec-tation) of Y is defined to be

E(Y) =∑

y

yp(y).

22

Page 23: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Theorem 3.13 (Expected value of a function of a random variable) Let f : R → Rbe a function. Then the expected value of Z = f (Y) is given by

E(Z) = E(f (Y)) =∑

y

f (y)p(y).

Definition 3.14 (Variance and standard deviation) The variance of Y is defined to be

V(Y) = E[(Y − µ)2] =∑

y

(y− µ)2p(y),

where µ = E(Y). The standard deviation is then given by√

V(Y).

Example 3.15 Roll a fair dice, and let X be the number observed. Find E(X) andV(X).

Note that p(x) = P(X = x) = 1/6, x = 1, 2, . . . , 6, then

µ = E(X) =∑

x

xp(x) =1 + 2 + · · ·+ 6

6= 3.5;

V(X) =∑

x

(x− µ)2p(x) =16

∑x

(x− 3.5)2 =3512.

Remark 3.16 Recall that in a sample with outcomes y1, y2, . . . , yn , the sample meanis given by

y =y1 + y2 + · · ·+ yn

n=

1n

n∑i=1

yi =

n∑i=1

yi1n.

The above formula can be theoretically viewed as the expected value of the samplingY with outcomes y1, y2, . . . , yn with equidisribution for large n. Similarly, one can seethat V(X) is also a good theoretical generalization of the sample variance.

Example 3.17 In a gambling game a person draws a single card from an ordinary52-card playing deck. A person is paid $50 for drawing a jack, a queen or a king, and$150 for drawing an ace. A person who draws any other card pays $40. If a personplays this game, what is the expected gain?

Let Y be the gain/loss of drawing cards, then it has pdf

p(50) =1252, p(150) =

452, p(−40) =

3652.

Then the expected gain is

E(Y) =∑

y

yp(y) = 50 · 1252

+ 150 · 452

+ (−40) · 3652

=−240

52= −4.6153.

23

Page 24: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Theorem 3.18 (Properties of the expected value)

(1) E(c) = c for any constant c ∈ R (viewed as a constant random variable);

(2) Linearity: let Y1,Y2, . . . ,Yk be random variables over the same sample space S ,and c1, c2, . . . , ck ∈ R, then

E(c1Y1 + c2Y2 + · · ·+ ckYk) = c1E(Y1) + c2E(Y2) + · · ·+ ckE(Yk).

(3) The variance can be computed by

V(Y) = E(X2)− [E(X)]2.

Example 3.19 Let Y be a random variable such that p(y) = y10 , y = 1, 2, 3, 4. Find

E(Y), E(Y2), E[Y(5− Y)] and V(Y).

E(Y) =∑

y

yp(y) = 1 · 110

+ 2 · 210

+ 3 · 310

+ 4 · 410

= 3;

E(Y2) =∑

y

y2p(y) = 12 · 110

+ 22 · 210

+ 32 · 310

+ 42 · 410

= 10;

E[Y(5− Y)] = E(5Y − Y2) = 5E(Y)− E(Y2) = 5× 3− 10 = 5;

V(Y) = E(Y2)− [E(Y)]2 = 10− 32 = 1.

Remark 3.20 If Y has expected value µ and variance σ2 , then we can renormalize

Y to a new random variable Z =Y − µσ

such that expected value 0 and variance 1.Indeed,

E(Z) = E(

Y − µσ

)=

E(Y − µ)σ

=E(Y)− µ

σ= 0;

V(Z) = E[(Z − E(Z))2] = E(Z2) = E(

(Y − µ)2

σ2

)=

E[(Y − µ)2]σ2 = 1.

Example 3.21 The manager of a a stockroom in a factory has constructed the followingprobability distribution for the daily demand (number of times used) for a particulartool.

y 0 1 2p(y) 0.1 0.5 0.4

24

Page 25: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

It costs the factory $10 each time the tool is used. Find the mean and variance of thedaily cost for use of the tool.

Let Y be the daily demand of the tool, and the daily cost is given by 10Y . Then

E(10Y) = 10E(Y) = 10∑

y

yp(y) = 10(0 · 0.1 + 1 · 0.5 + 2 · 0.4) = 13.

V(10Y) = E((10Y)2)− [E(10Y)]2 = 100∑

y

y2p(y)− 132

= 100(02 · 0.1 + 12 · 0.5 + 22 · 0.4)− 169

= 210− 169 = 41.

3.4 Well-known Discrete Probability Distributions

In practice, many experiments generate random variables with the same types ofprobability distribution. Note that a type of probability distribution may have someunknown constant(s) that determine its specific form, called parameters.

We are interested in the expected value, variance and standard deviation of theseprototypes.

3.4.1 Uniform Distribution

Definition 3.22 A random variable Y is said to have a discrete uniform distributionon R(Y) = {y1, y2, . . . , yn} if p(yi) = 1

n for each i.

Note that the parameter here is the size n of R(Y) - the set of outcomes for Y .

Example 3.23 Let Y be the number observed in a fair dice toss, then Y has a discreteuniform distribution on {1, 2, 3, 4, 5, 6} with the pdf p(y) = 1/6 for any y.

Then the expected value is

E(Y) =∑

y

yp(y) =1 + 2 + · · ·+ 6

6= 3.5;

V(Y) = E(Y2)− (E(Y))2 =12 + 22 + · · ·+ 62

6− 3.52 =

3512.

25

Page 26: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Example 3.24 Let Y be with a discrete uniform distribution on

{1, 2, 3, . . . , n},

find E(Y) and V(Y).

Note that p(y) = 1/n, so

E(Y) =1 + 2 + · · ·+ n

n=

12 n(n + 1)

n=

n + 12

.

E(Y2) =12 + 22 + · · ·+ n2

n=

16 n(n + 1)(2n + 1)

n=

(n + 1)(2n + 1)6

,

and so

V(Y) = E(Y2)− (E(Y))2 =(n + 1)(2n + 1)

6−(

n + 12

)2

=n2 − 1

12.

3.4.2 Binomial Distribution

A binomial experiment is described as follows:

(1) The experiment consists of n identical and independent trials.

(2) Each trial results in one of two outcomes: success (S) or failure (F);

(3) The probability of success in each single trial is equal to p ∈ [0, 1], and that offailure is equal to q = 1− p;

(4) The random variable under consideration is Y , the number of successes observedduring the n trials.

The sample point in a binomial experiment with n trials is a string of length n withentries being either "S" or "F". And the random variable Y is the number of S’s in asample string.

Example 3.25 Toss a coin 100 times. We call success if we toss head, and failure iftail. Assume that in a single toss, the chance to get a head is given by 2/3 (and hencethe chance for tail is 1− 2/3 = 1/3.)

It is not hard to check this is a binomial experiment with 100 trials and p = 2/3.

Definition 3.26 (Binomial Distribution) A random variable Y is said to have abinomial distribution based on n trials and success probability p ∈ [0, 1] if and only if

p(y) =

(ny

)py(1− p)n−y, y = 0, 1, 2, . . . , n.

We denote Y ∼ b(n, p).

26

Page 27: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Remark 3.27 Note that p(y) corresponds to the y-th term of the binomial expansion,that is,

(p + q)n =n∑

y=0

(ny

)pyqn−y,

where q = 1− p.

Theorem 3.28 Let Y be the number of successes observed in a binomial experimentwith n trials and single success probability p, then Y ∼ b(n, p).

Proof (Y = y) represents the set of sample points with y successes and (n − y)failures, each of which is of probability py(1− p)n−y . And there are

(ny

)sample points

of this type, so

p(y) = P(Y = y) =

(ny

)py(1− p)n−y.

Example 3.29 Suppose that a lot of 5000 electrical fuses contains 5% defectives. Ifa sample of 20 fuses is tested, find

(a) the probability of observing at least one defective;

(b) the probability of observing exactly four defective;

(c) the probability of observing at least four defective;

(d) the probability of observing at least two but at most five defective.

We call a defective fuse to be a success, then we are dealing with a binomial experimentwith 5 trials and success probability p = 0.05. Let Y the number of successes, that is,the number of defective fuses observed, then Y ∼ b(20, 0.05). Note that the outcomesof Y are {0, 1, 2, . . . , 20}.

(a) The event is to observe at least one defective, that is, (Y ≥ 1). We have that

P(Y ≥ 1) = 1−P(Y = 0) = 1−p(0) = 1−(

200

)0.050·0.9520 = 1−0.9520 = 0.6415.

(b) The event is to observe exactly 4 defective, that is, (Y = 4). We have that

P(Y = 4) = p(4) =

(204

)0.054 · 0.9516 = 0.0133

Here I am using the software called R to compute binomial distribution. More precisely,you can input the command "dbinom(y, n, p)" to find p(y). For instance, here I inputdbinom(4, 20, .05).

27

Page 28: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

(c) The event is to observe at least four defective, that is, (Y ≥ 4). We have that

P(Y ≥ 4) = 1− P(Y ≤ 3) = 1− FY (3) = 1− .984 = 0.016.

Recall that here FY (a) = P(Y ≤ a) is the cumulative distribution function (cdf) of Y .We use the R command "pbinom(a, n, p)" to find FY (a) for a binomial random variableY .

(d) The event is to observe at least two but at most five defective, that is,

(2 ≤ Y ≤ 5) = (Y ≤ 5)\(Y ≤ 1).

We have that

P(2 ≤ Y ≤ 5) = P(Y ≤ 5)−P(Y ≤ 1) = FY (5)−FY (1) = 0.9997−0.7358 = 0.2639.

Theorem 3.30 Let Y ∼ b(n, p). Then

µ = E(Y) = np, σ2 = V(Y) = np(1− p).

Proof We postpone the proof till we introduce the moment-generating functions. Onemay prove this using elementary combinatoric arguments.

Example 3.31 An unfair coin has three-to-one odds for head versus tail. Toss thiscoin 10 times, find the expected value and variance of Y , the number of heads observed.

Here n = 10 and p = 3/4, then

E(Y) = 10 · 3/4 = 7.5, V(Y) = 10 · 3/4(1− 3/4) = 1.875.

3.4.3 Geometric Distribution

A geometric experiment is described as follows:

(1) The experiment consists of identical and independent trials, but the number oftrials is not given;

(2) Each trial results in one of two outcomes: success (S) or failure (F);

(3) The probability of success in each single trial is equal to p ∈ [0, 1], and that offailure is equal to q = 1− p;

(4) The random variable under consideration is Y , the number of trials on first thesuccess occurs.

28

Page 29: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

There are infinitely many sample points in this experiment:

S, FS, FFS, FFFS, FFFFS, . . .

Example 3.32 Toss a fair coin until a head is observed. Then this is a geometricexperiment with success probability p = 1

2 .

Definition 3.33 (Geometric Distribution) A random variable Y is said to have ageometric distribution with single success probability p ∈ [0, 1] if and only if

p(y) = p(1− p)y−1, y = 1, 2, 3, . . . .

We denote Y ∼ Geo(p).

Clearly, Let Y the number of trials on which the first success occurs in a geometricexperiment, we have Y ∼ Geo(p).

Theorem 3.34 Let Y ∼ Geo(p), then

µ = E(Y) =1p, σ2 = V(Y) =

1− pp2 .

Proof Postpone.

Example 3.35 Suppose that 30% of the applicants for a job have a master degree.Applicants are interviewed sequentially and are selected at random from the pool. Find

(a) the probability that the first applicant with master is found on the fifth interview.

(b) the probability to meet the first applicant with master within 10 interviews.

(c) the expected number of interviews to meet the first applicant with master degree.

Let Y be the number of interviews on which the first applicant with master degree isfound. Then Y ∼ Geo(0.3).

(a) The event is (Y = 5), then

P(Y = 5) = p(5) = 0.3 · (1− 0.3)5−1 = 0.0720.

One can also use the R command "dgeom(y − 1, p)" to compute p(y). For example,here we input dgeom(4, 0.3) to calculate p(5).

(b) The event is (Y ≤ 10), then

P(Y ≤ 10) =10∑

y=1

p(y) = FY (10) = 0.9718.

29

Page 30: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

One use the R command "pgeom(y− 1, p)" to compute FY (y). For example, here weinput pgeom(9, 0.3) to calculate FY (10).

(c) E(Y) = 10.3 = 3.3333, so in average we are about to meet the first applicant with

master in the 3rd or 4th interview.

Remark 3.36 In fact, we can compute the cumulative distribution function as follows:If Y ∼ Geo(p), then for any positive integer a,

P(Y > a) =∑y>a

p(y) = p∑

y=a+1

(1− p)y−1 = p(1− p)a

1− (1− p)= (1− p)a,

and henceFY (a) = P(Y ≤ a) = 1− P(Y > a) = 1− (1− p)a.

Also, we have the following memoryless property for a geometric distribution:

P(Y > a + b|Y > a) =P(Y > a + b)

P(Y > a)=

(1− p)a+b

(1− p)a = (1− p)b = P(Y > b).

It means that the failures in the first a trials would not increase or decrease the successrate in the next b trials.

3.4.4 Hypergeometric Distribution

A hypergeometric experiment is described as follows:

(1) In a population of N elements, there are two distinct types: r success (S) and(N − r) failure (F);

(2) A sample of size n is randomly selected without replacement from this popula-tion;

(3) The random variable under consideration is Y , the number of successes in thesample.

Clearly Y in the hypergeometric experiment satisfies the following probability distri-bution:

Definition 3.37 (Hypergeometric Distribution) A random variable Y is said to havea geometric distribution with population size N , success size r and sampling size n,or briefly, with parameters (N, r, n), if and only if

p(y) =

(ry

)(N−rn−y

)(Nn

) ,

for all integer y such that 0 ≤ y ≤ r and 0 ≤ n − y ≤ N − r . We denote Y ∼Hyp(N, r, n).

30

Page 31: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Example 3.38 A bowl contains 100 chips, of which 40 are white, and the rest aregreen chips. Randomly select 70 chips from the bowl without replacement. Let Y bethe number of white chips chosen.

(a) Find the probability distribution function of Y .

(b) What is the probability to choose exactly 25 white chips.

(c) What is the probability to choose at most 30 white chips.

(a) N = 100, r = 40 and n = 70. If y is an outcome of Y , we have 0 ≤ y ≤ 40 and0 ≤ 70− y ≤ 100− 40, then 10 ≤ y ≤ 40. Then

p(y) =

(40y

)( 6070−y

)(10070

) , 10 ≤ y ≤ 40.

(b) The event is (Y = 25), so

P(Y = 25) = p(25) =

(4025

)(6045

)(10070

) = 0.0728.

We can use the R command "dhyper(y, r,N − r, n)" to compute p(y), like here, weinput dhyper(25, 40, 60, 70) to compute p(25).

(c) The event is (Y ≤ 30), so

P(Y ≤ 30) =∑

10≤y≤30

p(y) = FY (30) = 0.8676.

We use the R command "phyper(a, r,N − r, n)" to compute FY (a), like here, we inputphyper(30, 40, 60, 70) to compute FY (30).

Theorem 3.39 Let Y ∼ Hyp(N, r, n), then

E(Y) =rnN, V(Y) = n · r

N· N − r

N· N − n

N − 1.

Proof Postpone.

The hypergeometric distribution Hyp(N, r, n) tends to a binomial distribution b(n, p)with p = r/N when N →∞, that is,

limN→∞

(ry

)(N−rn−y

)(Nn

) =

(ny

)py(1− p)n−y.

31

Page 32: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

A transparent explanation is as follows: When the population size N is very largecompared to the sampling size n, the probability for picking a success sample wouldalmost be p = r/N . Indeed, after m sample picking, the success rate may become

r − jN − j

=pN − jN − j

=p− j/N1− j/N

≈ p, j = 0, 1, . . . ,m, when N →∞.

In other words, when the population size is large, it does not matter to take replacementor not. One can also check that

E(Y) =rnN

= np, V(Y) = np(1− p)1− n/N1− 1/N

→ np(1− p).

Conversely, we might use a binomial distribution b(n, p) to approximate a hypergeo-metric distribution Hyp(N, pN, n) for large N and fixed n.

3.4.5 Poisson Distribution

The Poisson distribution is a good model for counting the number that events of acertain type occur in a given time period or in a fixed space, assuming that

• The probability of an event in an interval is proportional to the length of theinterval.

• Two events cannot occur at the same time, and the occurrence of a first eventwould not affect the occurrence of a second event.

Example 3.40 The number of clients arriving at a local bank on 9-11AM.

Definition 3.41 (Poisson Distribution) A random variable Y is said to have a Poissondistribution with a rate parameter λ > 0 if and only if

p(y) =λy

y!e−λ, y = 0, 1, 2, 3, . . . .

We denote Y ∼ Poisson(λ).

We first show that the definition of p(y) is legit. Recall that the exponential functionhas the following Taylor expansion:

ex =

∞∑k=0

xk

k!, for all x ∈ R,

and therefore, ∑y

p(y) =∞∑

y=0

λy

y!e−λ = e−λ

∞∑y=0

λy

y!= e−λeλ = 1.

32

Page 33: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Theorem 3.42 Let Y ∼ Poisson(λ), then

µ = E(Y) = λ, σ2 = V(Y) = λ.

Proof Postpone.

This result tells us the the parameter λ is in fact the expected value, or average of Y .

Example 3.43 Suppose the average daily number of automobile accidents is 3, usePoisson model to find

(a) the probability that there will be 5 accidents tomorrow.

(b) the probability that there will be less than 3 accidents tomorrow.

(a) Let Y be the number of accidents tomorrow, then Y can be modeled by a Poissondistribution with λ = 3, and hence

P(Y = 5) = p(5) =35

5!e−3 = 0.1008.

One can use the R command "dpois(y, λ)" to compute p(y), like here, we input"dpois(5, 3)" to get p(5).

(b) The event is (Y ≤ 2), then

P(Y ≤ 2) =2∑

y=0

p(y) = FY (2) = 0.4232.

One can use the R command "ppois(a, λ)" to compute FY (a), like here, input "ppois(2, 3)"to get FY (2).

Example 3.44 The number of typing errorsmade by a typist has a poisson distributionwith an average of four errors per page. If more than four errors appear on a givenpage, the typist must retype the whole page. What is the probability that a certain pagedoes not have to be retyped?

Let Y be the number of errors in a page, which can be modeled by a Poisson distributionwith λ = 4. Then the probability that a certain page does not have to be retyped is

P(Y ≤ 4) =4∑

y=0

4y

y!e−4 = FY (4) = 0.6288.

Here we use "ppois(4, 4)" to compute FY (4).

33

Page 34: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Example 3.45 If Y ∼ Poisson(λ), and 3P(Y = 1) = P(Y = 2), find P(Y = 3).

Since 3p(1) = p(2), get

3λ1

1!e−λ =

λ2

2!e−λ =⇒ λ = 6.

Then

P(Y = 3) =63

3!e−6 = 36e−6.

Example 3.46 Given a Poisson random variable Y with standard deviation 2, anddenote its mean (expected value) by µ, find P(µ− σ < Y < µ+ σ).

Since λ = µ = σ2 = 22 = 4, we have p(y) =4y

y!e−4 , y = 1, 2, . . . . Then

P(µ−σ < Y < µ+σ) = P(2 < Y < 6) = p(3)+p(4)+p(5) = FY (5)−FY (2) = 0.5470.

Remark 3.47 One can show that a binomial distribution b(n, p) tends to a Poissondistribution Poisson(λ) with λ = np, when n→∞, that is,

limn→∞

(ny

)py(1− p)n−y =

λy

y!e−λ.

It means that we can approximate a binomial distribution by a Poisson distributionwhen n is large, while λ = np is fixed.

3.5 Tchebysheff’s Theorem

Given a random variable Y with mean/expectation µ and variance σ2 (without knowingthe probability distribution p(y)), we would like to estimate the probability that Y fallsin the interval (µ− kσ, µ+ kσ) for any k > 0, that is,

P(µ− kσ < Y < µ+ kσ) = P(|Y − µ| < kσ) =?

In general, we can obtain a lower bound for P(|Y−µ| < kσ) by the following theorem.

Theorem 3.48 (Tchebysheff’s Theorem) Let Y be a random variable with mean µand finite variance σ2 <∞. Then for any k > 0,

P(|Y − µ| < kσ) ≥ 1− 1k2 , or P(|Y − µ| ≥ kσ) ≤ 1

k2 .

This theorem is true for both discrete and continuous random variables. Here wepresent the proof (optional) in the discrete case.

34

Page 35: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Proof Let p(y) be the pdf of Y , then

σ2 = V(Y) = E(Y − µ)2 =∑

y

(y− µ)2p(y)

≥∑

y:|y−µ|≥kσ

(y− µ)2p(y)

≥∑

y:|y−µ|≥kσ

(kσ)2p(y)

= k2σ2∑

y:|y−µ|≥kσ

p(y)

= k2σ2P(|Y − µ| ≥ kσ).

Example 3.49 The number of customers per day at a sales counter, Y , has beenobserved for a long period of time and found to have mean 20 and standard deviation 2.The probability distribution of Y is not known. What can be said about the probabilitythat, tomorrow, Y will be greater than 16 but less than 24?

Note that µ = 20 and σ = 2, then by Tchebysheff’s theorem,

P(16 < Y < 24) = P(µ− 2σ < Y < µ+ 2σ) = P(|Y − µ| < 2σ) ≥ 1− 122 = 0.75.

4 Continuous Random Variables

NOT all random variables of interest are discrete.

Example 4.1 Let Y be the daily rainfall in London. The amount of rainfall could takeon any value between 0 and 5 inches, that is, Y can take any value in the interval (0, 5).

Example 4.2 Let Y be the lifespan of a light bulb (measured in hours). Theoretically,Y can take any value in (0,∞).

Definition 4.3 A random variable Y is said to be continuous if Range(Y) is an interval.

Remark 4.4 For a continuous random variable Y , its range Range(Y) - the set ofall possible outcomes - is an interval and thus uncountably infinite. For most typicaly ∈ Range(Y), the probability distribution at y given by P(Y = y) is zero, whichbecomes useless.

35

Page 36: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

4.1 The Characteristics of a Continuous Random Variable

Definition 4.5 Let Y be a random variable. The cumulative distribution function (cdf)of Y , denoted by FY (y), is such that FY (y) = P(Y ≤ y) for −∞ < y <∞.

If Y is the only random variable under consideration, we simply denote F(y).

Example 4.6 Let Y ∼ b(1, 0.5), that is, outcome of flipping a fair coin once, find cdfF(y).

Recall that Y is discrete with pdf p(y) =

{0.5, y = 0,

0.5, y = 1.Then

F(y) =∑z≤y

p(z) =

0, y < 0,

0.5, 0 ≤ y < 1,

1, y ≥ 2.

In this example, F(y) is a piecewise constant function.

Moreover, F(y) is nondecreasing with limy→−∞

F(y) = 0 and limy→∞

F(y) = 1, which is a

general fact.

Theorem 4.7 Let F(y) be the cdf of a random variable Y .

(1) F(−∞) := limy→−∞

F(y) = 0, and F(∞) := limy→∞

F(y) = 1;

(2) F(y) is a nondecreasing function of y, that is, if y1 < y2 , then F(y1) ≤ F(y2).

Remark 4.8 Classical Calculus tells us that the discontinuity point of F(y), if any,should be of jump type.

We can now use the cumulative distribution F(y) to describe a continuous randomvariable.

Definition 4.9 A random variable Y is continuous if and only if its cdf F(y) iscontinuous for all −∞ < y <∞.

Remark 4.10 We can now characterize random variables as follows:

• Y is discrete⇐⇒ F(y) is piecewise constant with jump discontinuities. Further,at each jump point y0 , P(Y = y0) > 0;

• Y is continuous ⇐⇒ F(y) is continuous. Further, P(Y = y) = 0 for all y;

36

Page 37: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

• Y is a mixture of discrete and continuous parts ⇐⇒ F(y) is not piecewiseconstant but has jump discontinuities.

Definition 4.11 Let F(y) be the cdf of a continuous random variable Y . Then

f (y) :=dF(y)

dy= F′(y),

whenever the derivative exists, is called the probability density function (pdf) for Y .

Remark 4.12

• Classical Calculus shows that F′(y) actually exists for “almost every" y.

• By fundamental theorem of calculus and that F(−∞) = 0,

F(y) =

∫ y

−∞f (t) dt.

• We use “pdf" for short to denote both the probability distribution function for adiscrete RV and the probability density function for a continuous RV. We shallsee that they kind of serves the same job in calculating probabilities.

Example 4.13 Let Y be a continuous RV with cdf

F(y) =

0, y < 0,

y, 0 ≤ y ≤ 1,

1, y > 1.

Find its pdf f (y).

To compute f (y) = F′(y), we need consider the subintervals separately:

• if y < 0, then f (y) = F′(y) = 0′ = 0;

• if 0 < y < 1, then f (y) = F′(y) = y′ = 1;

• if y > 1, then f (y) = F′(y) = 1′ = 0;

• at y = 0, LHS derivative is 0 while RHS derivative is 1, so f (0) is undefined;Similar reason at y = 1.

To sum up,

f (y) =

0, y < 0 or y > 1

1, 0 < y < 1

undefined, y = 0, or 1.

37

Page 38: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Example 4.14 Let Y be a continuous RV with pdf

f (y) =

{3y2, 0 ≤ y ≤ 1,

0, otherwise.

Find its cdf F(y).

The cdf is given by the formula F(y) =∫ y−∞ f (z)dz, but again, we need to consider

different intervals for integrals.

• if y < 0, then F(y) =∫ y−∞ 0dz = 0;

• if 0 ≤ y ≤ 1, then F(y) =∫ 0−∞ 0dz +

∫ y0 3z2dz = 0 + z3|y0 = y3 ;

• if y > 1, then F(y) =∫ 0−∞ 0dz +

∫ 10 3z2dz +

∫ y1 0dz = 0 + z3|10 + 0 = 1.

To sum up,

F(y) =

0, y < 0

y3, 0 ≤ y ≤ 1

1, y > 1.

Theorem 4.15 Let f (y) be a pdf of a continuous random variable, then

(1) f (y) ≥ 0 for all −∞ < y <∞;

(2) Normalization:∫∞−∞ f (y)dy = 1;

(3) P(Y ∈ B) =∫

B f (y)dy, for any B ⊂ R. In particular,

P(a ≤ Y ≤ b) =

∫ b

af (y)dy.

Example 4.16 Given

f (y) =

{cy2, 0 ≤ y ≤ 2,

0, otherwise.

(1) Find the value of c such that f (y) is probability density function of a continuousrandom variable.

(2) Find P(Y = 1) and P(1 < Y < 2).

Use the normalization condition

1 =

∫ ∞−∞

f (y)dy =

∫ 2

0cy2dy =

cy3

3|20 =

8c3,

so c = 38 .

P(Y = 1) = 0 since Y is continuous.

P(1 < Y < 2) =

∫ 2

1f (y)dy =

∫ 2

1

38

y2dy =y3

8|21 =

23 − 13

8=

78.

38

Page 39: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Definition 4.17 Given a continuous random variable Y with pdf f (y), its expectedvalue is

E(Y) =

∫ ∞−∞

yf (y)dy,

provided that the integral exists.

Similar to the discrete case, we have

Theorem 4.18

(1) Expected value of a function of Y : let g : R→ R, then

E(g(Y)) =

∫ ∞−∞

g(y)f (y)dy.

(2) E(c) = c, where c is a constant viewed a constant random variable;

(3) Linearity: E(c1Y1 + · · · + ckYk) = c1E(Y1) + · · · + ckE(Yk), where Y1, . . . ,Yk

are continuous random variables, and c1, . . . , ck are real numbers.

(4) Let µ = E(Y), then the variance of Y is

V(Y) = E[(Y − E(Y)]2 =

∫ ∞−∞

(y− µ)2f (y)dy

or

V(Y) = E(Y2)− [E(Y)]2 =

∫ ∞−∞

y2f (y)dy− µ2.

Example 4.19 Given a continuous random variable Y with pdf

f (y) =

{38 y2, 0 ≤ y ≤ 2,

0, otherwise.

Find E(Y) and V(Y).

E(Y) =

∫ ∞−∞

yf (y)dy =

∫ 2

0y

38

y2dy =3y4

32|20 = 1.5

V(Y) = E(Y2)− E(Y)2 =

∫ 2

0y2 3

8y2dy− 1.52 =

3y5

40|20 − 2.25 = 0.15

39

Page 40: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

4.2 Well-known Continuous Probability Distributions

4.2.1 Uniform Distribution on an Interval

Definition 4.20 Given −∞ < θ1 < θ2 < ∞, a random variable Y is said to havea continuous uniform probability distribution on the interval [θ1, θ2] if its probabilitydensity function is

f (y) =

1

θ2 − θ1, θ1 ≤ y ≤ θ2,

0, elsewhere.

We denote Y ∼ U(θ1, θ2).

Example 4.21 Suppose that a bus always arrives at a particular stop between 8:00and 8:10 am, and that the probability that the bus will arrive in any given subintervalof time is proportional only to the length of the subinterval. For instance, the bus is aslikely to arrive between 8:00 and 8:02 as it is to arrive between 8:06 and 8:08.

We denote the 8:00 by θ1 = 0 and 8:10 by θ2 = 10, and let Y be the time momentthat the bus arrives. Then Y ∼ U(0, 10).

Theorem 4.22 Let Y ∼ U(θ1, θ2), then

E(Y) =θ1 + θ2

2, V(Y) =

(θ2 − θ1)2

12.

Proof

µ = E(Y) =

∫ ∞−∞

yf (y)dy =

∫ θ2

θ1

ydyθ2 − θ1

=y2

2(θ2 − θ1)|θ2θ1

=θ2

2 − θ21

2(θ2 − θ1)=θ1 + θ2

2

V(Y) = E(Y2)− (E(Y))2 =

∫ θ2

θ1

y2dyθ2 − θ1

−(θ1 + θ2

2

)2

=θ3

2 − θ31

3(θ2 − θ1)− θ2

1 + 2θ1θ2 + θ22

4

=(θ2 − θ1)2

12

In the above bus waiting problem, the expected time moment that the bus arrives is

E(Y) =0 + 10

2= 5, that is, 8:05am. And the variance is

(10− 0)2

12=

253

.

40

Page 41: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Example 4.23 The cycle time for trucks hauling concrete to a highway constructionsite is uniformly distributed over the interval 50 to 70 minutes. What is the probabilitythat the cycle time exceeds 65 minutes if it is known that the cycle time exceeds 55minutes?

Let Y be the cycle time, then Y ∼ U(50, 70), and so f (y) = 120 for 50 ≤ y ≤ 70, and

f (y) = 0 elsewhere.

P(Y > 65|Y > 55) =P(Y > 65)P(Y > 55)

=

∫ 7065

120 dy∫ 70

55120 dy

=70− 6570− 55

=13.

Example 4.24 The failure of a circuit board interrupts work that utilizes a computingsystem until a new board is delivered. The delivery time, Y , is uniformly distributedon the interval one to five days. The total cost of a board failure and interruption isC = 10 + 3Y2 . Find the expected cost.

Y ∼ U(1, 5), then E(Y) = 1+52 = 3 and V(Y) = (5−1)2

12 = 43 ,

E(C) = E(10+3Y2) = 10+3E(Y2) = 10+3[V(Y)+(E(Y))2] = 10+3(43

+32) = 41.

4.2.2 Gaussian/Normal Distribution

The most widely used continuous probability distribution is the Gaussian distributionor normal distribution, whose density function has the familiar “bell” shape.

Definition 4.25 A random variable Y is said to have a normal probability distributionif for σ > 0 and −∞ < µ <∞, the probability density function of Y is

(4–1) f (y) = fY (y) =1

σ√

2πe−

(y−µ)2

2σ2 , −∞ < y <∞.

Denote Y ∼ N(µ, σ2). We shall see that E(Y) = µ and V(Y) = σ2 .

In particular, a random variable Z ∼ N(0, 1) is called standard normal distribution, andits pdf is given by

(4–2) f (z) = fZ(z) =1√2π

e−z2/2, −∞ < z <∞.

Example 4.26 The final exam scores in a class are expected to have mean µ = 75 andstandard variance σ = 10. That is, let Y be the score of a randomly chosen student,then Y ∼ N(75, 102).

41

Page 42: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Remark 4.27

(1) On the on hand, despite the usefulness, it is very unfortunate that we cannotcompute the cdf

F(y) = P(Y ≤ y) =

∫ y

−∞

1σ√

2πe−

(t−µ)2

2σ2 dt

by any accurate formula. On the other hand, if Y ∼ N(µ, σ2), then the densityfunction f (y) is symmetrically distributed around µ, which allows us to dosimplification. For instance, for any a ≥ 0,

F(µ− a) = P(Y ≤ µ− a) = P(Y ≥ µ+ a) = 1− F(µ+ a),

in particular, F(µ) = P(Y ≥ µ) = P(Y ≤ µ) = 0.5. Also,

F(µ+ a)− F(µ− a) = P(µ− a ≤ Y ≤ µ+ a) = 2P(µ ≤ Y ≤ µ+ a).

(2) 68-95-99.7 Rule: Given Y ∼ N(µ, σ2), we would like to estimate

P(µ− kσ < Y < µ+ kσ) ≈

68%, k = 1,

95%, k = 2,

99.7%, k = 3.

Example 4.28 The GPAs of a large population of college students are approximatelynormally distributed with mean 2.4 and standard deviation 0.8.

(1) What fraction of the students will possesses a GPA in excess of 3.2?

(2) If students with GPA less than 1.6 will drop college, what percentage of studentswill drop?

(3) Suppose that three students are randomly selected. What is the probability thatall three will possess GPA in excess of 2.4?

Solution: Let Y be the GPA of a randomly chosen student, then Y ∼ N(µ, σ2) =

N(2.4, 0.82).

(1)

P(Y > 3.2) = P(Y > µ+σ) =1− P(µ− σ ≤ Y ≤ µ+ σ)

2≈ 1− 68%

2= 16%.

(2) P(Y < 1.6) = P(Y < µ− σ) = P(Y > µ+ σ) ≈ 16%.

42

Page 43: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

(3) Let Y1,Y2,Y3 be the GPAs of these three randomly chosen students. Then theevents (Y1 > 2.4), (Y2 > 2.4), (Y3 > 2.4) are mutually independent, and theyare of the same probability 1

2 since µ = 2.4. Thus,

P((Y1 > 2.4) ∩ (Y2 > 2.4) ∩ (Y3 > 2.4))

= P(Y1 > 2.4)P(Y2 > 2.4)P(Y3 > 2.4) =

(12

)3

=18.

The following theorem asserts that we can renormalize a normal distribution Y ∼N(µ, σ2) to a standard distribution Z =

Y − µσ∼ N(0, 1), which might greatly simplify

our computation.

Theorem 4.29 Y ∼ N(µ, σ2) ⇐⇒ Y − µσ∼ N(0, 1).

The proof is postponed.

The expected value and variance of normal distribution is as follows.

Theorem 4.30 Let Y ∼ N(µ, σ2), then E(Y) = µ and V(Y) = σ2.

The proof is postponed.

4.2.3 Gamma Distribution

The Gamma probability distribution is widely used in engineering, science, and busi-ness, to model continuous variables that are always positive and have skewed (non-symmetric) distributions, for instance, the lengths of time between malfunctions foraircraft engines, the lengths of time between arrivals at a supermarket checkout queue.

Definition 4.31 A random variable Y is said to have a gamma distribution withparameters α > 0 and β > 0 if the density function of Y is

f (y) =

yα−1e−y/β

βαΓ(α), 0 ≤ y <∞

0, elsewhere,

where Γ(α) =∫∞

0 yα−1e−ydy is called the gamma function. We denote Y ∼gamma(α, β).

43

Page 44: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

α is called the shape parameter, determining the shape or concavity of f (y), while βis called the scaling parameter.

Here is some properties of the gamma function:

• Γ(n) = (n− 1)!, in particular, Γ(1) = 0! = 1.

• Γ(α) = (α− 1)Γ(α− 1) for all α > 1.

Theorem 4.32 Let Y ∼ gamma(α, β), then

E(Y) = αβ, V(Y) = αβ2.

The proof is postponed.

Example 4.33 Four-week summer rainfall totals in a section of the Midwest UnitedStates have approximately a gamma distribution with α = 1.6 and β = 2.0.

(1) Find the mean and variance of the four-week rainfall totals.

(2) What is the probability that the four-week rainfall total exceeds 4 inches?

Solution: Y ∼ gamma(1.6, 2.0), then

E(Y) = 1.6× 2.0 = 3.2, V(Y) = 1.6× 2.02 = 6.4.

and

P(Y > 4) =

∫ ∞4

y0.6e−y/2

21.6Γ(1.6)dy ≈ 0.2896 (using calculator)

There are two special cases of gamma random variables: One is the chi-square distri-bution, the other is the exponential distribution.

Definition 4.34 Let ν be a positive integer. A random variable Y is said to have a chi-square distribution with ν degrees of freedom if Y ∼ gamma(α, β) with parametersα = ν/2 and β = 2. We denote Y ∼ χ2(ν). Consequently,

E(Y) = ν, V(Y) = 2ν.

Definition 4.35 A random variable Y is said to have an exponential distributionwith parameter β > 0 if Y ∼ gamma(α, β) with parameters α = 1 and β > 0.Consequently, the pdf of an exponential distribution is

f (y) =

e−y/β, 0 ≤ y <∞

0, elsewhere.

44

Page 45: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

We denote Y ∼ exp(β). Note that the cumulative distribution is

F(y) =

∫ y

−∞f (y)dy =

{1− e−y/β, 0 ≤ y <∞,

0, elsewhere.

andE(Y) = β, V(Y) = β2.

Note that if Y ∼ exp(β), then for any y ≥ 0,

P(Y > y) = 1− F(y) = e−y/β.

The exponential distribution Y is usually used to modeling the lifetime of electroniccomponents.

Example 4.36 A light bulb that will burn out after t hours according to an exponentialdistribution. The average lifetime of a light bulb is 100 hours.

(1) What is the probability that the light bulb will not burn out before 20 hours?

(2) What is the probability that the lifetime of the light bulb is between 20 hoursand 40 hours?

Solution: Let Y be the lifetime of the bulb, in other words, Y is the time moment thatit burns out. Then β = E(Y) = 100, and so Y ∼ exp(100) and

(1)P(Y > 20) = 1− F(20) = e−20/100 = e−0.2 ≈ 0.8187.

(2)

P(20 ≤ Y ≤ 40) =

∫ 40

20

1100

e−y/100dy = −e−y/100∣∣∣40

20= e−0.2−e−0.4 ≈ 0.1484

Example 4.37 The length of time Y necessary to complete a key operation in theconstruction of houses has an exponential distribution with mean 10 hours. Theformula C = 100 + 4Y + 3Y2 relates the cost C of completing this operation. Findthe expected cost.

Since Y ∼ exp(10), E(Y) = 10 and V(Y) = 102 = 100, and

E(C) = 100 + 4E(Y) + 3E(Y2) = 100 + 4× 10 + 3[V(Y) + (E(Y))2]

= 140 + 3[100 + 102] = 740

45

Page 46: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

4.3 Tchebysheff’s Theorem (Revisited)

Theorem 4.38 (Tchebysheff’s Theorem) Let Y be a random variable with mean µand finite variance σ2 <∞. Then for any k > 0,

P(|Y − µ| < kσ) ≥ 1− 1k2 , or P(|Y − µ| ≥ kσ) ≤ 1

k2 .

The proof in continuous case is almost the same as in discrete case, just replacing∑

by∫

, and probability distribution p(y) by density f (y).

Example 4.39 A machine used to fill cereal boxes disperse, on the average, µ ouncesper box. The manufacturer wants the actual ounces dispensed Y to be within 1 ounceof µ at least 75% of the time. What is the largest value of σ , the standard deviation ofY , that can be toilerated if the manufacturer’s objectives are to be met?

By Tchebysheff’s theorem,

P(|Y − µ| < 1) = P(|Y − µ| < kσ) ≥ 1− 1k2 ≥ 75%

so k ≥ 2, and so σ = 1k ≤

12 .

5 Multivariate Probability Distributions

Suppose that Y1,Y2, . . . ,Yn denote the outcomes of n successive trials of an experiment.The intersection of n events is given by

(Y1 = y1) ∩ (Y2 = y2) ∩ (Yn = yn),

which we will denote as

(Y1 = y1,Y2 = y2, . . . ,Yn = yn)

or more compactly, as

(Y1,Y2, . . . ,Yn) = (y1, y2, . . . , yn).

The random variables Y1,Y2, . . . ,Yn might be related in some way, or might be totallyirrelevant. Calculation of the probability of this intersection is essential in makinginferences about the population from which the sample was drawn and is a majorreason for studying multivariate probability distributions.

Example 5.1 In a physical exam of a sing person, let Y1 be the age, Y2 the height, Y3

the weight, Y4 the blood type and Y5 the pulse rate. The doctor will be interested inthe personal data (Y1,Y2,Y3,Y4,Y5). A sample measurement would be something like

(y1, y2, y3, y4, y5) = (30yrs, 6ft, 140lbs, type O, 65/min)

46

Page 47: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

5.1 Bivariate Probability Distributions

Many random variables can be defined over the same sample space.

Example 5.2 Tossing a pair of dice. The sample space contains 36 sample points:

(i, j) : i = 1, 2, . . . , 6, j = 1, 2, . . . , 6.

Let Y1 be the number on dice 1, and Y2 the number on dice 2. The probability of(Y1 = y1,Y2 = y2) for any possible values of y1 and y2 is then given by

p(y1, y2) = P(Y1 = y1,Y2 = y2) = P(Y1 = y1)P(Y2 = y2) =16× 1

6=

136.

Definition 5.3 For any two random variables Y1 and Y2 , the joint/bivariate cumulativedistribution function (cdf) for Y1 and Y2 is given by

F(y1, y2) = P(Y1 ≤ y1,Y2 ≤ y2), −∞ < y1, y2 <∞.

Here are some basic properties of the bivariate cdf F(y1, y2):

• F(−∞,−∞) = F(y1,−∞) = F(−∞, y2) = 0, and F(∞,∞) = 1.

• Given −∞ ≤ ai < bi ≤ ∞, i = 1, 2,

P(a1 < Y1 ≤ b1, a2 < Y2 ≤ b2) = F(b1, b2)+F(a1, a2)−F(a1, b2)−F(a2, b1) ≥ 0.

To memorize this formula, one can consider the rectangle R = (a1, b1]×(a2, b2],and then the above identity can be rephrased as

P((Y2,Y2) ∈ R) =∑

F(antidiagonal)−∑

F(diagonal).

5.1.1 Discrete Case

Definition 5.4 For any two discrete random variables Y1 and Y2 , the joint/bivariateprobability distribution function (pdf) for Y1 and Y2 is given by

p(y1, y2) = P(Y1 = y1,Y2 = y2), −∞ < y1, y2 <∞.

Theorem 5.5 Let Y1 and Y2 be discrete random variables with joint probabilitydistribution function p(y1, y2), then

(1) p(y1, y2) ≥ 0 for all y1, y2 ;

(2)∑

y1,y2p(y1, y2) = 1;

47

Page 48: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

(3) For any A ⊂ R× R, the probability of the event ((Y1,Y2) ∈ A) is given by

P((Y1,Y2) ∈ A) =∑

(y1,y2)∈A

p(y1, y2).

In particular,

F(b1, b2) = P(Y1 ≤ b1,Y2 ≤ b2) =∑

y1≤b1,y2≤b2

p(y1, y2).

Example 5.6 Roll a fair dice and roll a fair coin independently. Let Y1 be the numberon dice and let Y2 be the number of heads. What is the probability that there is onehead ?

Solution: The pdf of the bivariate random variables (Y1,Y2) is given by

p(y1, y2) = P(Y1 = y1)P(Y2 = y2) =16× 1

2=

112,

for all y1 = 1, 2, . . . , 6 and y2 = 0, 1. The event that there is one head is given by

P(Y2 = 1) =6∑

y1=1

p(y1, 1)

= p(1, 1) + p(2, 1) + p(3, 1) + p(4, 1) + p(5, 1) + p(6, 1)

= 6× 112

=12.

Example 5.7 Roll a fair dice and let Y1 be the number on dice. Then we roll a faircoin for Y1 times, and let Y2 be the number of heads. What is the probability that thereis one head ?

Solution: Note that if Y1 = y1 , then Y2 ∼ B(y1,12 ). Then the pdf of the bivariate

random variables (Y1,Y2) is given by

p(y1, y2) = P(Y1 = y1)P(Y2 = y2|Y1 = y1)

=16×(

y1

y2

)(12

)y2(

12

)y1−y2

=

(y1y2

)6 · 2y1

for all y1 = 1, 2, . . . , 6 and 0 ≤ y2 ≤ y1 . The event that there is one head is given by

P(Y2 = 1) =6∑

y1=1

p(y1, 1) =6∑

y1=1

(y11

)6 · 2y1

=6∑

y1=1

y1

6 · 2y1=

516

48

Page 49: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

5.1.2 Continuous Case

Definition 5.8 Two random variables Y1 and Y2 are said to be jointly continuous ifthere joint cumulative distribution F(y1, y2) is continuous in both arguments.

A function f (y1, y2) is called the joint probability density function (pdf) of the jointlycontinuous random variables Y1 and Y2 if

F(y1, y2) =

∫ y2

−∞

∫ y1

−∞f (z1, z2)dz1dz2, −∞ < y1, y2 <∞.

Theorem 5.9 Let Y1 and Y2 be jointly continuous random variables with joint prob-ability density function f (y1, y2), then

(1) f (y1, y2) ≥ 0 for all y1, y2 ;

(2)∫∞−∞

∫∞−∞ f (y1, y2)dy1dy2 = 1;

(3) For any A ⊂ R2 , the probability of the event ((Y1,Y2) ∈ A) is given by

P((Y1,Y2) ∈ A) =

∫∫A

f (y1, y2)dy1dy2.

In particular,

P(a1 ≤ Y1 ≤ b1, a2 ≤ Y2 ≤ b2) =

∫ b2

a2

∫ b1

a1

f (y1, y2)dy1dy2.

Example 5.10 Suppose that the bivariate pdf of Y1 and Y2 is uniform, that is,

f (y1, y2) =

{1, 0 ≤ y1 ≤ 1, 0 ≤ y2 ≤ 1,

0, elsewhere.

Find P(0.1 ≤ Y1 ≤ 0.3, 0 ≤ Y2 ≤ 0.5).

P(0.1 ≤ Y1 ≤ 0.3, 0 ≤ Y2 ≤ 0.5) =

∫ 0.5

0

∫ 0.3

0.1f (y1, y2)dy1dy2

=

∫ 0.5

0

∫ 0.3

0.11dy1dy2

=

∫ 0.5

0dy2 ·

∫ 0.3

0.1dy1

= 0.5× (0.3− 0.1) = 0.1.

49

Page 50: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Example 5.11 Gasoline is to be stocked in a bulk tank once at the beginning of eachweek and then sold to individual customers. Let Y1 denote the proportion of thecapacity of the bulk tank that is available after the tank is stocked at the beginning ofthe week. Let Y2 denote the proportion of the capacity of the bulk tank that is soldduring the week. Suppose that the joint density function for Y1 and Y2 is given by

f (y1, y2) =

{3y1, 0 ≤ y2 ≤ y1 ≤ 1,

0, elsewhere.

Find the probability that less than one-half of the tank will be stocked and more thanone-quarter of the tank will be sold.

Solution: We have

P(0 ≤ Y1 ≤ 0.5,Y2 > 0.25) =

∫∫0≤y2≤y1≤1,0≤y1≤0.5,

y2>0.25

f (y1, y2)dy1dy2

=

∫∫0.25<y2≤y1≤0.5

3y1dy1dy2

=

∫ 0.5

0.25dy2

(∫ 0.5

y2

3y1dy1

)

=

∫ 0.5

0.25

(32

y21

∣∣∣∣0.5y2

)dy2

=

∫ 0.5

0.25

32

(14− y2

2

)dy2

=

∫ 0.5

0.25

(38− 3

2y2

2

)dy2

=

(38

y2 −12

y32

)∣∣∣∣0.50.25

=38

(12− 1

4)− 1

2((

12

)3 − (14

)3) =5

128.

5.2 Marginal and Conditional Probability Distributions

Given the joint distribution of two random variables Y1 and Y2 , how can we determinethe individual distribution of Y1 or Y2 ?

50

Page 51: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

5.2.1 Discrete Case

In this subsection, we always let Y1 and Y2 be jointly discrete random variables withbivariate probability distribution function (pdf) p(y1, y2).

Definition 5.12 The marginal probability distribution functions of Y1 and Y2 , respec-tively, are given by

p1(y1) =∑all y2

p(y1, y2), p2(y2) =∑all y1

p(y1, y2).

Furthermore, the conditional discrete probability function of Y1 given Y2 is

p(y1|y2) =p(y1, y2)p2(y2)

provided that p2(y2) > 0.

Similarly, we can define the conditional probability function of Y2 given Y1 .

Note that p(y1|y2) = P(Y1 = y1|Y2 = y2) =P(Y1 = y1,Y2 = y2)

P(Y2 = y2).

Example 5.13 From a group of three Republicans, two Democrats, and one indepen-dent, a committee of two people is to be randomly selected. Let Y1 denote the numberof Republicans and Y2 denote the number of Democrats on the committee. Find (a)the bivariate probability function of Y1 and Y2 ; (b) the marginal probability functionof Y1 and Y2 ; (c) the conditional probability of Y2 given Y1 .

Solution: (a) The bivariate pdf is

p(y1, y2) = P(Y1 = y1,Y2 = y2) =

( 3y1

)( 2y2

)( 12−y1−y2

)(62

)To make sense of the above formula, we must have

0 ≤ y1 ≤ 3, 0 ≤ y2 ≤ 2, 0 ≤ 2− y1 − y2 ≤ 1,

and hence the possible outcomes for (y1, y2) are

(0, 1), (1, 0), (1, 1), (0, 2), (2, 0).

By direct computation, we have

p(0, 1) =215, p(1, 0) =

315, p(1, 1) =

615, p(0, 2) =

115, p(2, 0) =

315.

51

Page 52: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

(b) The marginal probability function of Y1 is given by

p1(0) = p(0, 1) + p(0, 2) =2

15+

115

=15,

p1(1) = p(1, 0) + p(1, 1) =3

15+

615

=35,

p1(2) = p(2, 0) =315

=15.

The marginal probability function of Y2 is given by

p2(0) = p(1, 0) + p(2, 0) =315

+315

=25,

p2(1) = p(0, 1) + p(1, 1) =2

15+

615

=815,

p2(2) = p(0, 2) =1

15.

(c) By the formula p(y2|y1) =p(y1, y2)p1(y1)

, we get

p(0|0) = 0, p(0|1) = 3/153/5 =

13, p(0|2) =

3/151/5

= 1

p(1|0) =2/151/5

=23, p(1|1) = 6/15

3/5 =23, p(1|2) =

01/5

= 0

p(2|0) =1/151/5

=13, p(2|1) = 0

3/5 = 0, p(2|2) =0

1/5= 0

5.2.2 Continuous Case

In this subsection, we always let Y1 and Y2 be jointly continuous random variableswith probability density function f (y1, y2).

Definition 5.14 The marginal probability density functions (pdf) of Y1 and Y2 , re-spectively, are given by

f1(y1) =

∫ ∞−∞

f (y1, y2)dy2, f2(y2) =

∫ ∞−∞

f (y1, y2)dy1.

Furthermore, the conditional density of Y1 given Y2 = y2 is given by

f (y1|y2) =f (y1, y2)

f2(y2)

52

Page 53: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

provided f2(y2) > 0. Note that if f2(y2) = 0, then f (y1|y2) is undefined.

Similarly, we can define the conditional density of Y2 given Y1 = y1 .

Example 5.15 Suppose that X and Y have bivariate density given by

f (x, y) =

{e−x, 0 ≤ y ≤ x <∞,0, elsewhere.

Find

(a) the marginal density functions of X and Y .

(b) the conditional density of Y given X = x .

(c) the conditional density of X given Y = y.

Solution: (a) the marginal density functions of X and Y are

fX(x) =

∫ ∞−∞

f (x, y)dy =

{∫ x0 e−xdy, x ≥ 0,

0, elsewhere,=

{xe−x, x ≥ 0,

0, elsewhere.

fY (y) =

∫ ∞−∞

f (x, y)dx =

{∫∞y e−xdx, y ≥ 0,

0, elsewhere,=

{e−y, y ≥ 0,

0, elsewhere.

(b) the conditional density of Y given X = x .

f (y|x) =f (x, y)fX(x)

=

e−x

xe−x = 1x , 0 < y ≤ x

0, elsewhere.

(c) the conditional density of X given Y = y.

f (x|y) =f (x, y)fY (y)

=

e−x

e−y = ex−y, 0 < y ≤ x

0, elsewhere.

5.3 Independence of Random Variables

Recall that two events A and B are independent if P(A ∩ B) = P(A)P(B).

We can define independence of random variables via the cumulative distribution func-tion (cdf).

53

Page 54: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Definition 5.16 Suppose that Y1 and Y2 have cdfs F1(y1) and F2(y2) respectively,and (Y1,Y2) has the bivariate cdf F(y1, y2). Y1 and Y2 are said to be independent if

(5–1) F(y1, y2) = F1(y1)F(y2),

for any −∞ < y1, y2 <∞. Otherwise, we say that Y1 and Y2 are dependent.

We can check the independence by pdf: probability distribution function in the discretecase, and probability density function in the continuous case.

Theorem 5.17 Y1 and Y2 are independent if and only if

• Discrete Case: p(y1, y2) = p1(y1)p2(y2) for all y1, y2 ;

• Continuous Case: f (y1, y2) = f1(y1)f2(y2) for all y1, y2 .

Example 5.18 Let Y1,Y2 be the random variables in Example 5.13. Then Y1 and Y2

are dependent since

0 = p(0, 0) 6= p1(0)p2(0) =15· 2

5=

225.

Example 5.19 Suppose that X and Y be in Example 5.15, then X and Y are dependentsince

f (x, y) = e−x 6= xe−x · e−y = fX(x)fY (y)

for 0 ≤ y ≤ x .

5.4 Expected Value of a Function of Random Variables

Given a random variable Y , recall that the expected value of g(Y) is given by

• Discrete: E[g(Y)] =∑

y g(y) p(y);

• Continuous: E[g(Y)] =∫∞−∞ g(y) f (y)dy.

Definition 5.20 Given random variables Y1,Y2, . . . ,Yn with joint pdf, the expectedvalue of g(Y1,Y2, . . . ,Yn) is given by

• Discrete: E[g(Y1,Y2, . . . ,Yn)] =∑

(y1,y2,...,yn)g(y1, y2, . . . , yn) p(y1, y2, . . . , yn);

• Continuous:

E[g(Y1,Y2, . . . ,Yn)] =

∫yn

· · ·∫

y2

∫y1

g(y1, y2, . . . , yn) f (y1, y2, . . . , yn)dy1dy2 . . . dyn.

54

Page 55: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

We shall mostly focus on the case that n = 2. This definition actually generalizesprevious ones. In particular, if Y1 and Y2 have bivariate pdf f (y1, y2), we can computeE(Y1) by

E(Y1) =

∫∫y1f (y1, y2)dy1dy2

directly, without knowing the marginal pdf for Y1 .

Example 5.21 Let X and Y be given by Example 5.15. Find E(X), E(Y), E(XY), V(X)

Solution:

E(X) =

∫ ∞0

∫ x

0xe−xdydx =

∫ ∞0

x2e−xdx = Γ(3) = 2! = 2,

E(Y) =

∫ ∞0

∫ x

0ye−xdydx =

∫ ∞0

12

x2e−xdx = 1,

E(XY) =

∫ ∞0

∫ x

0xye−xdydx =

12

∫ ∞0

x3e−xdx =12

Γ(4) =12· 3! = 3.

E(X2) =

∫ ∞0

∫ x

0x2e−xdydx =

∫ ∞0

x3e−xdx = Γ(4) = 3! = 6,

thus V(X) = E(X2)− [E(X)]2 = 6− 22 = 2.

To simplify computations of the expected value, we recall some basic properties:

Theorem 5.22

(1) E(c) = c;

(2) E(c1Y1 + c2Y2 + · · ·+ ckYk) = c1E(Y1) + c2E(Y2) + ckE(Yk);

(3) if Y1 and Y2 are independent, then E(Y1Y2) = E(Y1)E(Y2). More generally,E[g(Y1)h(Y2)] = E[g(Y1)]E[h(Y2)] for any real-valued functions g, h : R→ R.

Proof The only new property here is (3): since Y1 and Y2 are independent, thenf (y1, y2) = f1(y1)f2(y2), and hence

E[g(Y1)h(Y2)] =

∫∫g(y1)h(y2)f (y1, y2)dy1dy2 =

∫∫g(y1)h(y2)f1(y1)f2(y2)dy1dy2

=

∫g(y1)f1(y1)dy1

∫h(y2)f2(y2)dy2

= E[g(Y1)]E[h(Y2)].

55

Page 56: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Example 5.23 Suppose that a radioactive particle is randomly located with coordinates(Y1,Y2) in a unit disk. A reasonable model for the joint density function for Y1 and Y2

is

f (y1, y2) =

{1, 0 ≤ y1 ≤ 1, 0 ≤ y2 ≤ 1,

0, elsewhere.

(a). E(Y1) =∫∫

y1f (y1, y2)dy1dy2 =∫ 1

0

∫ 10 y1dy1dy2 =

∫ 10 y1dy1

∫ 10 dy2 = 1

2 · 1 = 12 .

(b). E(Y1 − Y2): by the similar formula as (a) and symmetry between Y1 and Y2 , wehave E(Y2) = 1

2 as well. So E(Y1 − Y2) = E(Y1)− E(Y2) = 0.

(c). E(Y1Y2): It is not hard to show that Y1 and Y2 are independent, so E(Y1Y2) =

E(Y1)E(Y2) = 12 ·

12 = 1

4 .

(d). E(Y21 + Y2

2 ):

E(Y21 ) =

∫∫y2

1f (y1, y2)dy1dy2 =

∫ 1

0

∫ 1

0y2

1dy1dy2 =

∫ 1

0y2

1dy1

∫ 1

0dy2 =

13.

By symmetry, we can get E(Y22 ) = 1

3 as well. So E(Y21 + Y2

2 ) = E(Y21 ) + E(Y2

2 ) = 23 .

(e). V(Y1Y2):

V(Y1Y2) = E(Y21 Y2

2 )− [E(Y1Y2)]2 = E(Y21 )E(Y2

2 )− [E(Y1Y2)]2 =13

13− (

14

)2 =7

144.

5.5 Covariance of Random Variables

In real lift, two random variables Y1 and Y2 of interest are often dependent. To measurethe dependency between Y1 and Y2 , we need the definition of covariance.

Definition 5.24 Let µ1 = E(Y1) and µ2 = E(Y2), then the covariance of Y1 and Y2

is given byCov(Y1,Y2) = E[(Y1 − µ1)(Y2 − µ2)].

Remark 5.25

(1) Cov(Y1,Y2) can be positive, negative or zero;

(2) V(Y) = Cov(Y,Y) ≥ 0;

(3) Cov(Y1,Y2) = E(Y1Y2)− E(Y1)E(Y2).

56

Page 57: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

(4) If Y1 and Y2 are independent, then Cov(Y1,Y2) = 0. Indeed,

Cov(Y1,Y2) = E(Y1Y2)− E(Y1)E(Y2) = E(Y1)E(Y2)− E(Y1)E(Y2) = 0.

However, the converse might not be true. Cov(Y1,Y2) = 0 only implies that Y1

and Y2 has no linear relation.

(5) By Cauchy-Schwartz inequality, we have

|Cov(Y1,Y2)| ≤√

V(Y1)√

V(Y2).

Moreover, the equality only holds when Y1 and Y2 are linearly dependent, i.e.,k1(Y1 − µ1) + k2(Y2 − µ2) = 0 for some constants k1 and k2 .

(6) The correlation coefficient is given by

ρ = ρ(Y1,Y2) :=Cov(Y1,Y2)√V(Y1)

√V(Y2)

.

provided both V(Y1) and V(Y2) are positive. Then

• |ρ| ≤ 1 (by Cauchy-Schwartz);• ρ = 0 means that Y1 and Y2 has no linear relation;• ρ = ±1 means that Y1 and Y2 has perfect linear relation, that is, Y2−µ2 =

k(Y1 − µ1) almost surely.

Example 5.26 Let Y1 and Y2 have joint density given by

f (y1, y2) =

{2y1, 0 ≤ y1 ≤ 1, 0 ≤ y2 ≤ 1,

0, elsewhere.

Find Cov(Y1,Y2) and ρ(Y1,Y2).

Solution: It is easy to compute the marginal pdfs:

f1(y1) =

{2y1, 0 ≤ y1 ≤ 1,

0, elsewhere.f2(y2) =

{1, 0 ≤ y2 ≤ 1,

0, elsewhere.

and thus f (y1, y2) = f1(y1)f2(y2), which means that Y1 and Y2 must be independent.Hence Cov(Y1,Y2) = 0, ρ(Y1,Y2) = 0.

Example 5.27 Let Y1 and Y2 have joint density given by

f (y1, y2) =

{2, 0 ≤ y1 ≤ y2 ≤ 1,

0, elsewhere.

57

Page 58: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Find Cov(Y1,Y2).

Cov(Y1,Y2) = E(Y1Y2)− E(Y1)E(Y2)

=

∫∫y1y2f (y1, y2)dy1dy2 −

∫∫y1f (y1, y2)dy1dy2

∫∫y2f (y1, y2)dy1dy2

=

∫ 1

0

∫ y2

0y1y22dy1dy2 −

∫ 1

0

∫ y2

0y12dy1dy2

∫ 1

0

∫ y2

0y22dy1dy2

=14− 1

323

=136.

By straightforward computation, we deduce a formula of the variance of linear combi-nations of random variables.

Theorem 5.28 Consider the random variables

U =n∑

i=1

aiYi = a1Y1 + · · ·+ anYn, and V =

m∑j=1

bjZj = b1Z1 + · · ·+ bmZm,

then

Cov(U,V) =

n∑i=1

m∑j=1

aibjCov(Yi,Zj).

In particular, we have the variance

V(U) =n∑

i=1

a2i V(Yi) + 2

∑1≤i<j≤n

aiajCov(Yi,Yj).

Furthermore, if Y1,Y2, . . . ,Yn are independent, then

V(U) =

n∑i=1

a2i V(Yi).

An even simpler example is that

V(aY + b) = a2V(Y)

Since V(b) = 0 and Cov(Y, b) = 0.

Example 5.29 Let Y1,Y2, . . . ,Yn be independent random variables with E(Yi) = µ

and V(Yi) = σ2 . (These variables may denote the outcomes of n independent trials ofan experiment.) Define the sample mean variable

Y =Y1 + Y2 + · · ·+ Yn

n.

58

Page 59: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Then E(Y) = µ, V(Y) = σ2

n .

Indeed,

E(Y) =E(Y1) + E(Y2) + · · ·+ E(Yn)

n= µ,

Since Y1,Y2, . . . ,Yn are independent, thus

V(Y) =

n∑i=1

(1n

)2

V(Yi) =1n2

n∑i=1

σ2 =σ2

n.

This example will be helpful when we introduce the central limit theorem (CLT).

5.6 Conditional Expectations

A random variable Y1 often relies on outcomes of another random variable Y2 .

Example 5.30 A quality control plan for an assembly line involves sampling n =

10 finished items per day and counting Y , the number of defectives. If P denotesthe probability of observing a defective, then Y has a binomial distribution b(n,P),assuming that a large number of items are produced by the line. But P varies from dayto day and is assumed to have a uniform distribution on the interval from 0 to 1/4, thatis, P ∼ U(0, 1

4 ).

Here we get the number Y of defectives depends on the outcome of the defectivevariable P. The conditional probability of Y given P = 1/8 is then given by thebinomial distribution b(n, 1/8), that is,

P(Y = y|P = 1/8) =

(10y

)(1/8)y(7/8)10−y, y = 0, 1, . . . , 10.

We recall the conditional distribution of Y1 given thatY2 = y2 :

• Discrete: the conditional probability distribution function of Y1 given that Y2 =

y2 is

p(y1|y2) =p(y1, y2)p2(y2)

,

provided p2(y2) = P(Y2 = y2) > 0;

• Continuous: the conditional probability density function of Y1 given that Y2 =

y2 is

f (y1|y2) =f (y1, y2)

f2(y2),

provided f2(y2) > 0.

59

Page 60: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Definition 5.31 The conditional expectation of g(Y1) given that Y2 = y2 is

Discrete:E(g(Y1)|Y2 = y2) =

∑y1

g(y1)p(y1|y2),

Continuous:E(g(Y1)|Y2 = y2) =

∫ ∞−∞

g(y1)f (y1|y2)dy1.

Example 5.32 The random variables Y1 and Y2 have the joint density function givenby

f (y1, y2) =

{1/2, 0 ≤ y1 ≤ y2 ≤ 2,

0, elsewhere.

Find the conditional expectation of Y1 given that Y2 = 1.5.

f2(y2) =

∫f (y1, y2)dy1 =

{∫ y20 1/2dy1 = y2

2 , 0 ≤ y2 ≤ 2,

0, elsewhere.

Then

f (y1|y2) =f (y1, y2)

f2(y2)=

1y2, 0 ≤ y1 ≤ y2 ≤ 2,

0, 0 ≤ y2 ≤ 2, y1 < 0ory1 > y2,

undefined, elsewhere.

Given that Y2 = 1.5, we have

E(Y1|Y2 = 1.5) =

∫y1f (y1|1.5)dy1 =

∫ 1.5

0

y1

1.5dy1 = 0.75.

If we allow the outcome of Y2 = y2 to vary, then actually y2 7→ E(Y1|Y2 = y2) canbe viewed as a function, or random variable, denoted by E(Y1|Y2). As in the Example5.32, E(Y1|Y2) = Y2/2 with outcome in [0, 2].

Theorem 5.33E(E(Y1|Y2)) = E(Y1).

The proof of this theorem is routine. The idea of this formula is the following: Takeaverage of the conditional average of Y1 given Y2 is exactly the average of Y1 .

In Example 5.32, The expected value of Y is then

E(Y) = E(E(Y|P)) = E(nP) = E(10P) = 10E(P) = 100 + 1/4

2= 1.25.

60

Page 61: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

6 Functions of Random Variables

Given random variables Y1,Y2, . . . ,Yn , and a real-valued function g : Rn → R, weobtain a new random variable

U = g(Y1,Y2, . . . ,Yn).

A very important random variable of this type would be the sample mean variable

U =Y1 + Y2 + · · ·+ Yn

n=

1n

n∑i=1

Yi.

Sometimes, we are really interested in the actual distribution of U instead of onlyknowing the expected value E(U) and the variance V(U). There are typically threestandard methods.

6.1 The Method of Distribution Functions

We compute the cumulative distribution function of U by

FU(u) = P(U ≤ u) = P(g(Y1,Y2, . . . ,Yn) ≤ u) =: P(A),

where A is the event given by

A = {(y1, y2, . . . , yn) : g(y1, y2, . . . , yn) ≤ u} ⊂ Rn.

If Y1,Y2, . . . ,Yn have joint density function f (y1, y2, . . . , yn), then

FU(u) = P(A) =

∫· · ·∫

Af (y1, . . . , yn)dy1 . . . dyn,

and the density function of U is given by fU(u) = F′U(u).

Example 6.1 Suppose that Y has density function given by

f (y) =

{2y, 0 ≤ y ≤ 1,

0, elsewhere.

Find the probability density function for U = eY .

Solution: The cdf of U = eY is given by

FU(u) = P(U ≤ u) = P(eY ≤ u) = P(Y ≤ ln u) =

∫(−∞,ln u]

f (y)dy =

∫(−∞,ln u]∩[0,1]

2ydy.

61

Page 62: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Since

(−∞, ln u] ∩ [0, 1] =

∅, u < 1,

[0, ln u], 1 ≤ u ≤ e,

[0, 1], u > e,

we get

FU(u) =

0, u < 1,∫ ln u

0 2xdx, 1 ≤ u ≤ e,∫ 10 2xdx, u > e,

=

0, u < 1,

(ln u)2 , 1 ≤ u ≤ e,

1, u > e.

Thus,

fU(u) = F′U(u) =

(0)′, u < 1,[(ln u)2]′ , 1 ≤ u ≤ e,

(1)′, u > e.

=

2 ln u

u, if 1 ≤ u ≤ e,

0, else.

Exercise 6.2 The joint density function of X and Y is given by

f (x, y) =

{e−x, 0 ≤ y ≤ x,

0, elsewhere.

Find the pdf of U = 2X − Y .

6.2 The Method of Transformations

Let Y be a random variable, and U = g(Y). Assume that g : R → R is invertible,that is, g is strictly increasing or strictly decreasing. In such case, the inverse h = g−1

exists, and set Y = h(U).

Theorem 6.3 fU(u) = fY (h(u))|h′(u)|.

In Example 6.1, since U = eY , then Y = ln U , that is, h(u) = ln u, and henceh′(u) = 1/u. By Theorem 6.3,

fU(u) = fY (h(u))|h′(u)| = fY (ln u) · |1/u| =

2 ln u|u|

, if 0 ≤ ln u ≤ 1,

0, else.

62

Page 63: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

=

2 ln u

u, if 1 ≤ u ≤ e,

0, else.

Suppose now that X and Y has joint pdf f(X,Y)(x, y), and U = g(X,Y). Assume that forany fixed Y = y (which is viewed as a parameter), the function u = g(x, y), regardedas a function u of the variable x , is invertible, and the inverse is given by x = h(u, y).Then the bivariate pdf of (U,Y) by

Theorem 6.4 f(U,Y)(u, y) = f(X,Y)(h(u, y), y)∣∣ ∂∂u h(u, y)

∣∣ .Thus, we can further obtain the pdf of U by

fU(u) =

∫ ∞−∞

f(U,Y)(u, y)dy.

In Exercise 6.2, U = 2X − Y , that is, u = g(x, y) = 2x− y. For any fixed Y = y, thisfunction is invertible, and has inverse x = h(u, y) = u+y

2 . Thus,∣∣ ∂∂u h(u, y)

∣∣ = |12 | =12 .

Hence the joint pdf of (U,Y) is

f(U,Y)(u, y) = f(X,Y)(h(u, y), y)∣∣∣∣ ∂∂u

h(u, y)∣∣∣∣ = f(X,Y)

(u + y

2, y)· 1

2=

{12 e−

u+y2 , 0 ≤ y ≤ u+y

2 ,

0, elsewhere.

=

{12 e−

u+y2 , 0 ≤ y ≤ u,

0, elsewhere.

and hence

fU(u) =

∫ ∞−∞

f(U,Y)(u, y)dy =

{∫ u0

12 e−

u+y2 dy, u ≥ 0,

0, elsewhere.=

{e−

u2 − e−u, u ≥ 0,

0, elsewhere.

6.3 The Method of Moment-generating Functions

Definition 6.5 The k-th moment of a random variable Y , where k = 0, 1, 2, . . . , isdefined to be

µk = E(Yk).

as long as the above expected value exists.

The moment-generating function (mgf) of Y is defined to be

mY (t) = E(etY ),

as long as m(t) exists.

63

Page 64: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

If there is only one random variable under consideration, we could simply write m(t)instead of mY (t). Note that it is automatic that µ0 = E(1) = 1 and m(0) = 1. Thetheorem below reveals the connections between the moments and mgf.

Theorem 6.6 E(Yk) = m(k)(0).

Proof Recall that

ex =

∞∑k=0

xk

k!, for all x ∈ R,

then the mgf m(t) has the following Taylor expansion:

m(t) = E(etY ) = E

( ∞∑k=0

tkYk

k!

)=∞∑

k=0

E(Yk)k!

tk =∞∑

k=0

µk

k!tk.

Therefore, E(Yk) = µk = m(k)(0).

Example 6.7 A random variable Y has pdf (probability density function)

f (y) =

{e−y, y > 0

0, elsewhere.

Compute (1) mgf mY (t); (2) E(e−2Y ); (3) E(Y3).

Solutions: (1) near t = 0, we could assume t < 1, then

mY (t) = E(etY ) =

∫ ∞−∞

etyf (y)dy =

∫ ∞0

etye−ydy =

∫ ∞0

e(t−1)ydy

=1

t − 1e(t−1)y

∣∣∣∣∞0

=1

1− t.

(2) E(e−2Y ) = mY (−2) = 13 ;

(3) Take derivatives of mY (t):

m′Y (t) = (1− t)−2, and m′′Y (t) = 2(1− t)−3.

So E(Y) = m′Y (0) = 1, E(Y2) = m′′Y (0) = 2, and thus, V(Y) = E(Y2) − [E(Y)]2 =

2− 12 = 1.

Now we are going to use the above theorem to compute the expectation and variance ofprevious well-known discrete distributions. Here is a summary of moment-generatingfunctions (mgf) for well known distributions:

64

Page 65: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Theorem 6.8

(1) Discrete distributions:

(a) Binomial B(n, p): m(t) = [pet + (1− p)]n ;

(b) Geometric Geo(p): m(t) =pet

1− (1− p)et ;

(c) Poisson Poisson(λ): m(t) = exp[λ(et − 1)].

(2) Continuous distributions:

(d) Uniform U(θ1, θ2): m(t) =etθ2 − etθ1

t(θ2 − θ1);

(e) Normal N(µ, σ2): m(t) = exp(µt + 1

2σ2t2)

;

(f) Gamma Gamma(α, β): m(t) = (1− βt)−α . In particular,

(i) exponential exp(β) = Gamma(1, β) has m(t) = (1− βt)−1 ;

(ii) Chi-square χ2(ν) = Gamma(ν/2, 2) has m(t) = (1− 2t)−ν/2 .

Proof For the discrete distributions:

(a) If Y ∼ B(n, p), then p(y) =(n

y

)py(1− p)n−y for y = 0, 1, 2, . . . , n, and

m(t) = E(etY ) =n∑

y=0

etyp(y) =n∑

y=0

(ny

)(etp)y(1− p)n−y = [etp + (1− p)]n.

(b) If Y ∼ Geo(p), then p(y) = p(1− p)y−1 , y = 1, 2, . . . , and

m(t) = E(etY ) =

∞∑y=1

etyp(y) =

∞∑y=1

p1− p

[et(1− p)]y =p

1− p

∞∑y=1

[et(1− p)]y

=p

1− pet(1− p)

1− et(1− p)

=pet

1− et(1− p).

(c) If Y ∼ Poisson(λ), then p(y) = λy

y! e−λ , y = 0, 1, 2, . . . , then

m(t) = E(etY ) =∞∑

y=0

etyp(y) = e−λ∞∑

y=0

(etλ)y

y!= e−λeetλ = eλ(et−1).

For the continuous distributions:

65

Page 66: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

(d) If Y ∼ U(θ1, θ2), then f (y) =

{1/(θ2 − θ1), θ1 ≤ y ≤ θ2,

0, else, then for t > 0,

m(t) = E(etY ) =

∫ ∞−∞

etyf (y)dy =

∫ θ2

θ1

ety

θ2 − θ1dy

=ety

t(θ2 − θ1)

∣∣∣∣θ2

θ1

=etθ2 − etθ1

t(θ2 − θ1)

(e) If Y ∼ N(µ, σ2), then f (y) = 1σ√

2πe−

(y−µ)2

2σ2 , and

m(t) = E(etY ) =

∫ ∞−∞

etyf (y)dy =

∫ ∞−∞

ety 1σ√

2πe−

(y−µ)2

2σ2 dy

z= y−µσ====

∫ ∞−∞

et(µ+σz) 1√2π

e−z22 dz

= eµt∫ ∞−∞

1√2π

e−z22 +σtzdz

= eµt∫ ∞−∞

1√2π

e−12 (z−σt)2+ 1

2σ2t2dz

w=z−σt==== eµte

12σ

2t2∫ ∞−∞

1√2π

e−w22 dw

= exp(µt +

12σ2t2

).

(f) Gamma(α, β) is omitted.

Remark 6.9 With the formulas of mgf, we can compute expected values and variancesfor well known distributions.

(1) If Y ∼ B(n, p), then m(t) = [etp + (1− p)]n, then

m′(t) = n[etp + (1− p)]n−1etp, =⇒ E(Y) = m′(0) = np,

m′′(t) = n(n− 1)[etp + (1− p)]n−2(etp)2 + n[etp + (1− p)]n−1etp.

=⇒ E(Y2) = m′′(0) = n(n− 1)p2 + np,

=⇒ V(Y) = E(Y2)−(E(Y))2 = n(n−1)p2+np−(np)2 = np−np2 = np(1−p).

66

Page 67: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

(2) If Y ∼ Geo(p), then m(t) = pet

1−et(1−p) and

m′(t) =pet[1− et(1− p)]− pet[−et(1− p)]

[1− et(1− p)]2 =pet

[1− et(1− p)]2 ,

m′′(t) =pet[1− et(1− p)]2 − pet2[1− et(1− p)](−et)(1− p)

[1− et(1− p)]4 =pet + pe2t(1− p)[1− et(1− p)]3 .

Thus,

E(Y) = m′(0) =1p, E(Y2) = m′′(0) =

2− pp2 , V(Y) = E(Y2)−(E(Y))2 =

1− pp2 .

(3) If Y ∼ Poisson(λ), then m(t) = eλ(et−1) and

m′(t) = eλ(et−1)λet, =⇒ E(Y) = m′(0) = λ,

m′′(t) = eλ(et−1)(λet)2 + eλ(et−1)λet = eλ(et−1)[λ2e2t + λet],

=⇒ E(Y2) = m′′(0) = λ2 + λ, V(Y) = E(Y2)− (E(Y))2 = λ2 + λ− λ2 = λ.

(4) If Y ∼ N(µ, σ2), then m(t) = exp(µt + 12σ

2t2), and

m′(t) = exp(µt +12σ2t2) · (µ+ σ2t) = m(t)(µ+ σ2t),

m′′(t) = m′(t)(µ+ σ2t) + m(t) · σ2.

Then E(Y) = m′(0) = m(0) · µ = µ and

E(Y2) = m′′(0) = m′(0) · µ+ m(0) · σ2 = µ2 + σ2,

so V(Y) = E(Y2)− [E(Y)]2 = µ2 + σ2 − µ2 = σ2 .

Theorem 6.10 (Uniqueness Theorem)

X and Y have the same distribution ⇐⇒ mX(t) = mY (t).

Example 6.11 Find the corresponding distributions for the following mgf’s:

(1) m(t) =

(et + 1

2

)10

;

(2) m(t) =et

3− 2et ;

(3) m(t) = e−1+(1−t)2;

(4) m(t) =1

(1− 3t)2 ;

(5) m(t) = eet−1 ;

67

Page 68: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

(6) m(t) =e2t − et

t.

Solutions: Rewrite these functions to match the mgf of well known distributions.

(1) m(t) =( 1

2 et + (1− 12 ))10 — B(10, 1

2 );

(2) m(t) =13 et

1− (1− 13 )et

— Geo( 13 );

(3) m(t) = exp(−2t + 1

2 (√

2)2t2)

— N(−2, (√

2)2);

(4) m(t) = (1− 3t)−2 — Gamma(2, 3);

(5) m(t) = exp(1 · (et − 1)

)— Poisson(1);

(6) m(t) =e2t − e1t

t(2− 1)— U(1, 2).

Using the uniqueness theorem, we can show certain distributions are stable underscaling and shifting.

Theorem 6.12 If Y = aX + b, then mY (t) = ebtmX(at).

Proof mY (t) = E(etY ) = E(e(t(aX+b)) = E(ebte(at)X) = ebtE(e(at)X) = ebtmX(at).

Example 6.13 If X follows a normal distribution, i.e., X ∼ N(µ, σ2), then Y = aX+balso follows normal distribution, in fact, Y ∼ N(aµ+ b, (aσ)2), since

mY (t) = ebtmX(at) = ebteµat+ 12σ

2(at)2= e(aµ+b)t+ 1

2 (aσ)2t2 .

The method of moment-generating functions are also effective in dealing with a randomvariable generating by many independent random variables.

Theorem 6.14 If Y1,Y2, . . . ,Yn are independent random variables and U = Y1 +

Y2 + · · ·+ Yn , then

mU(t) = mY1(t)× · · · × mYn(t) =

n∏i=1

mYi(t).

Proof By independence,

mU(t) = E(etU) = E

(et(Y1+···+Yn)) = E

(etY1). . .E

(etYn)

=n∏

i=1

mYi(t).

68

Page 69: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Example 6.15 Let Y1, . . . ,Yn be independent random variables such that

(1) each Yi ∼ B(1, p), then U = Y1 + · · ·+ Yn ∼ B(n, p), since

mU(t) = mY1(t)× · · · × mYn(t) =[pet + (1− p)

]n.

(2) each Yi ∼ N(µi, σ2i ), then U =

∑ni=1 aiYi ∼ N(µ, σ2) with

µ =n∑

i=1

aiµi, and σ2 =n∑

i=1

a2i σ

2i ,

since

mU(t) =n∏

i=1

maiYi(t) =n∏

i=1

mYi(ait) =n∏

i=1

exp{µi(ait) +

12σ2

i (ait)2}

= exp

{n∑

i=1

[(µiai)t +

12

(σ2i ai)2t2

]}

= exp(µt +

12σ2t2

).

7 Sampling Distributions

Definition 7.1 A random sample is a collection of random variables Y1,Y2, . . . ,Yn

that are independent and identically distributed (iid).

From now on, we shall only consider iid samples, and consider the sample meanvariable

Yn =Y1 + · · ·+ Yn

n=

n∑i=1

1n

Yi.

(1) If µ = E(Yi) and σ2 = V(Yi), then E(Yn) = µ and V(Yn) = σ2

n since

E(Yn)

=

n∑i=1

1n

E (Yi) = µ due to linearity of E(·);

V(Yn)

=

n∑i=1

(1n

)2

V (Yi) =σ2

ndue to square linearity of V(·)

(2) the mgf of Yn is given by

mYn(t) = E

(et Y1+···+Yn

n

)= E

(e

tn Y1). . .E

(e

tn Yn)

=[mY1

( tn

)]n.

69

Page 70: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

7.1 Normally Distributed Sample Mean

Theorem 7.2 If Y1,Y2, . . . ,Yn are i.i.d. and each Yi ∼ N(µ, σ2), then Yn ∼ N(µ, σ2n),

where σn = σ√n .

Proof

mYn(t) =

[mY1

( tn

)]n=

[exp

tn

+12σ2( t

n

)2)]n

= exp(µt +

12σ2

nt2)

which, by uniqueness theorem, implies that Yn ∼ N(µ, σ2n),

This theorem shows that the sample mean of iid normal samples are still normal, withthe same expected value while the standard deviation is divided by

√n.

Example 7.3 Independent random samples of n = 100 observations each are drawnfrom normal populations. The parameters of these populations are:

• Population 1: µ1 = 300 and σ1 = 60;

• Population 2: µ2 = 290 and σ2 = 80.

Find

(1) the probability that the mean of Population 1 is between 294 and 306.

(2) the probability that the mean of Population 1 is greater than the mean of Popu-lation 2 by more than 20.

Solution. (1) Let {Xi}1≤i≤n be the samples of population 1, then

Xn ∼ N(µ1, σ21/n) = N(300, 602/100) = N(300, 62).

So the probability that the mean of Population 1 is between 294 and 306 is

P(294 ≤ Xn ≤ 306) = P(300− 6 ≤ Xn ≤ 300 + 6) ≈ 68%.

by the 68-95-99.7 rule of the standard normal distribution. (2) Let {Yi}1≤i≤n be thesamples of population 2, then

Yn ∼ N(µ2, σ22/n) = N(290, 802/100) = N(290, 82),

and hence

Xn − Yn ∼ N(µ1 − µ2, 12σ21/n + (−1)2σ2

2/n)

= N(300− 290, 62 + 82) = N(10, 102).

Therefore,

P(Xn − Yn ≥ 20) =

∫ ∞20

110√

2πe−

(z−10)2

200 dz ≈ 16%.

70

Page 71: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Example 7.4 A bottling machine can be regulated so that it discharges an averageof µ ounces per bottle. It has been observed that the amount of fill dispensed by themachine is normally distributed with σ = 1.0 ounce. A sample of n = 9 filled bottlesis randomly selected from the output of the machine on a given day (all bottled withthe same machine setting), and the ounces of fill are measured for each.

(1) Find the probability that the sample mean will be within 1/3 ounce of the truemean µ for the chosen machine setting.

(2) How many observations should be included in the sample if we want the samplemean be within 1/3 ounce of µ with probability 95%? (Hint: If Y ∼ N(µ, σ2),then P(|Y − µ| ≤ 2σ) ≈ 95%).

Solution. (1) There are normally distributed random samples Y1, . . . ,Yn of size n = 9,with unknown mean µ, and variance σ2 = 12 . Then

Y9 ∼ N(µ, σ2/n) = N(µ, 12/9) = N(µ, (1/3)2).

Then our target probability is

P(|Y9 − µ| ≤ 1/3) ≈ 68%.

(2) Note that for arbitrary n, the standard deviation of Yn is σ/√

n = 1/√

n. To requirethat

P(|Yn − µ| ≤ 1/3) ≈ 95%,

we need 1/3 = 2 · 1/√

n, that is, n = 36 at least.

7.2 The Central Limit Theorem

In general, even if we do not have any information on the distributions of Yi ’s but onlythe expected value µ and variance σ2 , we still have

Theorem 7.5 (The Central Limit Theorem (CLT)) Let Y1, . . . ,Yn be i.i.d. randomvariables such that µ = E(Y1) and V(Y1) = σ2 , then we have

Un :=Yn − µσ/√

n⇒ N(0, 12), as n→∞.

Here ⇒ means the distribution of Un converges to the distribution of standard normaldistribution.

71

Page 72: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Proof We prove the special case when µ = 0 and σ = 1 (the general case can beproven in a similar fashion). By Taylor expansion and Theorem 6.6, as t close to 0,

mY1(t) = mY1(0) + m′Y1(0)t +

12

m′′Y1(0)t2 + O(t3) = 1 + µt +

12σ2t2 + O(t3)

≈ 1 +12

t2.

Note that Un =√

nYn , and by (7), its mgf is given by

mUn(t) = E(

et√

nYn)

= mYn

(t√

n)

=

[mY1

(t√n

)]n

≈[

1 +t2

2n

]n

=

{[1 +

t2

2n

] 2nt2} t2

2

→ et22 , as n→∞.

In other words, the moment generating function of Un converges to that of N(0, 1),which implies that Un ≈ N(0, 12) for large n.

In applications of the CLT, we use the approximation

Un :=Yn − µσ/√

n≈ N(0, 12)

when n is sufficiently large.

Example 7.6 Achievement test scores of all high school seniors in a state have mean60 and variance 64. What is the probability that a random sample of n = 100 studentsfrom one large high school has a mean score between 57.6 and 62.4?

Solution. Let Yi to be the score of the i-th student, then µ = E(Yi) = 60 andσ2 = V(Yi) = 64. In experience, n = 100 is large enough to apply CLT. Then

U100 =Y100 − 60√64/√

100=

54

(Y100 − 60) ≈ N(0, 12),

and hence

P(57.6 ≤ Y100 ≤ 62.4) = P(

54

(57.6− 60) ≤ U100 ≤54

(62.4− 60))

= P(−3 ≤ U100 ≤ 3)

=

∫ 3

−3

1√2π

e−y2

2 dy ≈ 99.73%.

This means that the probability in question is pretty high.

72

Page 73: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

7.3 The Normal Approximation to the Binomial Distribution

The Normal Approximation to the Binomial Distribution is an important consequenceof the Central Limit Theorem, which turns out to be very useful in real life.

Theorem 7.7 (Binomial approximation) If Y ∼ B(n, p), then

Un :=Y − np√np(1− p)

≈ N(0, 12).

Proof In a binomial experiment with trial number n and success rate p, we let Xi bethe outcome of the i-th trial, then

P(Xi = 1) = p, and P(Xi = 0) = 1− p.

Therefore, µ = E(Xi) = p and σ2 = V(Xi) = p(1− p).

The number of successes is denoted by Y , which obeys the binomial distributionY ∼ b(n, p). Clearly,

Y = X1 + X2 + · · ·+ Xn = nXn,

where Xn is the sample mean of X1, . . . ,Xn . By the central limit theorem,

Un =Xn − µσ/√

n=

Y/n− p√p(1− p)/

√n

=Y − np√np(1− p)

≈ N(0, 12).

This theorem means that we can approximate the binomial distribution Y (which isdiscrete) by the normal distribution Un (which is continuous), when n is large.

Example 7.8 An airline company is considering a new policy of booking as many as400 persons on an airplane that can seat only 378. Past studies have revealed that only90% of the booked passengers actually arrive for the flight. Find the probability thatnot enough seats are available if 400 persons are booked.

Remark: Let Y be the number of booked passengers who actually arrive, then Y ∼b(400, 0.9). Then the probability for not enough seats is given by

P(Y > 378) =

400∑k=379

P(Y = k) =

400∑k=379

(400

k

)0.9k0.1400−k.

To calculate the above formula by hand is impossible.

73

Page 74: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Solution. We use the normal approximation,

U400 =Y − 400 · 0.9√

400 · 0.9(1− 0.9)=

Y − 3606

≈ N(0, 12),

then

P(Y > 378) = P(

U400 >378− 360

6

)= P(U400 > 3) =

∫ ∞3

1√2π

e−y2

2 dy ≈ 0.13%.

Example 7.9 The time to process orders at the service counter of a pharmacy storeare exponentially distributed with mean 5 minutes. Suppose that 100 customers visitthe counter in a day. Use CLT to estimate the following.

(1) What is the probability that the total service time of the 100 customers does notexceed 10 hours?

(2) What is the probability that at least half of the 100 customers need to wait morethan 3.47 minutes? (Note: 3.47 ≈ 5 ln 2)

Solution. (1) Let Xi be the service time of the i-th customer for i = 1, . . . , 100.Note that X1, . . . ,X100 are independent, and each Xi ∼ exp(5) and thus E(Xi) = 5 andV(Xi) = 52 . Also, the total service time is given by the sum X1 +· · ·+X100 = 100X100 .By the CLT,

U =X100 − 55/√

100= 2X100 − 10 ≈ N(0, 12).

Therefore, the probability that the total service time of the 100 customers does notexceed 10 hours (= 600 min) is

P(100X100 ≤ 600) = P(X100 ≤ 6) = P(U ≤ 2× 6− 10)

= P(U ≤ 2)

=

∫ 2

−∞

1√2π

e−u22 du ≈ 97.72%

(2) Let Y be the number of customers who needs to wait more than 3.47 minutes, thenY ∼ B(n, p), where n = 100 and

p = P(Xi ≥ 3.47) =

∫ ∞3.47

15

e−x/5dx = −e−x/5∣∣∣∞3.47

= e−3.47/5 ≈ e− ln 2 =12.

74

Page 75: STAT 515, Statistics I Acknowledgement Contentsjchen/STAT515/STAT515_lecture.pdfSTAT 515, Statistics I JIANYU CHEN DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF MASSACHUSETTS

Using the CLT for binomial distribution, we have

U =Y − 100× 1

2√100× 1

2 (1− 12 )

=Y − 50

5≈ N(0, 12).

Therefore, the probability that at least half of the 100 customers need to wait more than3.47 minutes is about

P(Y ≥ 50) = P(U ≥ 50− 505

) = P(U ≥ 0) = 50%

75