mth 453: basic random processes - department of...

MTH 453: Basic Random Processes

Lecture 1: Basic Probability Review

References:

Probability for Stat. and Machine Learning Fund. and Adv. Topics by Anirban DasGupta.Probability With Applications and R by Robert Dobrow.Introduction to Stochastic Processes with R by Robert Dobrow.

Outline:

1) Experiments, Sample Spaces and Random Variables

2) A First Look into Simulation

3) Integer-Valued and Discrete Random Variables

4) Continuous Random Variables

5) Expectation, Moments and Simulation of Arbitrary Continuous RVs.

6) Joint Distributions and Independence of Random Variables

7) Properties of the most Common Distributions (will be taught/reviewed on the go)

8) Normal Approximations and Central Limit Theorem (will be taught/reviewed on the go)

Part 1. Experiments, Sample Spaces and Random Variables

Treatment of probability theory starts with the consideration of a sample space. The sample space is theset of all possible outcomes in some experiment. For example, if a coin is tossed twice and after each tossthe face that shows is recorded, then the possible outcomes of this particular coin-tossing experiment, sayX, are {HH,HT, TH, TT}, with H denoting the occurrence of heads and T denoting the occurrence oftails. We call

Ω = {HH,HT, TH, TT}the sample space of the experiment X.

In general, a sample space is a general set Ω, finite or infinite. An easy example where the sample spaceΩ is infinite is to toss a coin until the first time heads show up and record the number of the trial at whichthe first head appeared. In this case, the sample space Ω is the countably infinite set

Ω = {1, 2, 3, . . .}

Sample spaces can also be uncountably infinite; for example, consider the experiment measuring thetime for a light bulb to get burned. The sample space of this experiment is Ω = R+. In this case, Ω is anuncountably infinite set. In all cases, individual elements of a sample space are denoted as ω. The firsttask is to define events and to explain the meaning of the probability of an event.

1

Definition 1.1. Let Ω be the sample space of an experiment X. Then any subset A of Ω, including theempty set ∅ and the entire sample space Ω is called an event. Events may contain even one single samplepoint ω, in which case the event is a singleton set {ω}. We want to assign probabilities to events. Butwe want to assign probabilities in a way that they are logically consistent. In fact, this cannot be done ingeneral if we insist on assigning probabilities to arbitrary collections of sample points, that is, arbitrarysubsets of the sample space . We can only define probabilities for such subsets of Ω that are tied togetherlike a family, the exact concept being that of a σ-field. In most applications, including those cases wherethe sample space Ω is infinite, events that we would want to normally think about will be members of suchan appropriate σ-field.

Here is a definition of what counts as a legitimate probability on events.

Definition 1.2. Given a sample space Ω, a probability or a probabilitymeasure on Ω is a function P ON SUBSETS of Ω such that

(a) P[A] ≥ 0 for all A ⊆ Ω. )In particular A can be a singleton A = ω ∈ Ω.

(b) P[Ω] =∑ω∈Ω

P[ω] = 1

(c) Given disjoint subsets A1, A2, . . . of Ω,

P

[∞⋃i=1

Ai

]=∞∑i=1

P[Ai]

In particular considering an event A and all the possible elements in A,

P[A] =∑ω∈A

P[ω]

You may not be familiar with some of the notations in this definition. The symbol ∈ means “is anelement of”. So ω ∈ Ω means ω is an element of Ω. We are also using a generalized Σ-notation. Thenotation

∑ω∈Ω means that the sum is over all ω that are elements of the sample space, that is, all outcomes

in the sample space. In the case of a finite sample space Ω = {ω1, . . . , ωk}, the equation in (b) becomes

P[Ω] =∑ω∈Ω

P[ω] =k∑

n=1

P[ωk] = P[ω1] + P[ω2] + . . .+ P[ωk] = 1

Definition 1.3. Let Ω be a finite sample space consisting of N sample points. We say that the samplepoints are equally likely or that the probability distribution is uniform if P(ω) = 1/N for each samplepoint ω.

An immediate consequence, due to the addivity axiom, is the following useful formula.

2

Proposition 1.4. Let Ω be a finite sample space consisting of N equally likely sample points. Let A beany event and suppose A contains n distinct sample points. Then

P(A) =Number of sample points favorable to A

Total number of sample points=

n

N

Example 1.5. Roll a pair of dice. Find the sample space, identify the event that the sum of the two diceis equal to 7 and compute its probability.

Solution: The random experiment is rolling two dice. Keeping track of the roll of each die gives thesample space

Ω = {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 1), (2, 2), . . . , (6, 5), (6, 6)}.

The event is A = {Sum is 7} = {(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)}, and its probability, using Propo-sition 1, is 6/36 = 1/6.

Homework Problem 1. Yolanda and Zach are running for president of the student association. Tenstudents will be voting. Identify

(i) the sample space and

(ii) the event that Yolanda beats Zach by at least 7 votes.

Homework Problem 2. Joe will continue to flip a coin until heads appears. Identify the sample spaceand the event that it will take Joe at least three coin flips to get a head.

Example 1.6. A college has six majors: biology, geology, physics, dance, art, and music. The proportionof students taking these majors are 45, 30, 15, 10, 10, and 35,respectively. Choose a random student. Whatis the probability they are a science major?

Solution: The random experiment is choosing a student. The sample space is

Ω = {Bio,Geo, Phy,Dan,Art,Mus}.

The probability function is given by the number of students in the major divided by the total numberof students (45+30+15+10+10+35=145). That is,

P[Bio] = 45/145 ≈ 0.31, P[Geo] = 30/145 ≈ 0.21, P[Phy] = 15/145 ≈ 0.1,

P[Dan] = 10/145 ≈ 0.07 P[Art] = 10/145 ≈ 0.07 P[Mus] = 35/145 ≈ 0.24

The event in question is

A = {Science major} = {Bio, Geo, Phy} = {Bio} ∪ {Geo} ∪ {Phy}

where the last equality holds because the events are disjoint (we are not considering “double majors”).Finally,

P[A] = P[{Bio,Geo, Phy}] = P[{Bio}∪{Geo}∪{Phy}] = P[Bio]+P[Geo]+P[Phy] ≈ 0.31+0.21+0.1 = 0.62.

Homework Problem 3. In three coin tosses, what is the probability of getting at least two tails?

3

In order to compute some probabilities we need to know also how does the probability behaves orcompares when two events have certain properties and the following proposition explains this more clearly:

Proposition 1.7 (Properties of Probabilities). 1. If A implies B, that is, if A ⊆ B, then P[A] ≤ P[B].Ex. A=Roll of dice being 2, B= Roll of dice being even.

2. P[A does not occur] = P[Ac] = 1 − P[A]. Ex: A=Roll of dice being 2 or 4, Ac = Roll of dice being1,3,5 or 6.

3. For any events A and B,

P[A or B] = P (A ∪B) = P (A) + P (B)− P (A ∩B).

Ex: P[Roll of dice being 2, 3 or Roll of dice being 2 or 6]= P[Roll of dice being 2 or 3] + P[Roll of dice being 2 or 3]− P[Roll of dice being 2]= 1/3 + 1/3− 1/6 = 1/2

Homework Problem 4. In a city, 75% of the population have brown hair, 40% have brown eyes, and25% have both brown hair and brown eyes. A person is chosen at random. What is the probability that they

1. have brown eyes or brown hair?

2. have neither brown eyes nor brown hair?

Often the outcomes of a random experiment take on numerical values. For instance, we might beinterested in how many heads occur in three coin tosses. Let X be the number of heads. Then X is equalto 0, 1, 2, or 3, depending on the outcome of the coin tosses. The object X is called a random variable.The possible values of X are 0, 1, 2, and 3.

Definition 1.8. A random variable X is a (measurable) function from the sample space Ω to the realnumbers R.

You can think simply that a random variable assigns numerical values to the outcomes of a randomexperiment.

Random variables are enormously useful and allow us to use algebraic expressions, equalities, andinequalities when manipulating events. In many of the previous examples, we have been working withrandom variables without using the name, for example, the number of threes in rolls of a die, the numberof votes received, the number of heads in repeated coin tosses, etc.

Example 1.9. If we throw two dice, what is the probability that the sum of the dice is greater than orequal to four?

Solution: We can, of course, find the probability by direct counting. But we will use random variables.Let Y be the sum of two dice rolls. Then Y is a random variable whose possible values are 2, 3, . . . ,12. The event that the sum is greater than 4 can be written as {Y ≥ 4}. Observe that the complementaryevent is {Y ≤ 3}. By taking complements,

P[Y ≥ 4] = 1− Px[Y ≤ 3]= P[Y = 2] or P[Y = 3]= P[Y = 2] + P[Y = 3]= P[{(1, 1)}] + P[{(1, 2), (2, 1)}

=1

36+

2

36=

3

36=

1

12

4

While writing my book [Stochastic Processes] I had an argument with Feller. He assertedthat everyone said “random variable” and I asserted that everyone said “chance variable”.We obviously had to use the same name in our books, so we decided the issue by a[random] procedure. That is, we tossed for it and he won.

-Joe Doob, quoted in Statistical Science

Random variables are central objects in probability. The name can be confusing since they are reallyneither “random” nor “variables” (like x in f(x)). A random variable is actually a function, a functionwhose domain is the sample space.

A random variable assigns every outcome of the sample space a real number. Consider the three coinsexample, letting X be the number of heads in three coin tosses. Depending upon the outcome of theexperiment, X takes on different values. To emphasize the dependency of X on the outcome ω, we canwrite X(ω), rather than just X. In particular,

X(ω) =

0, if ω = TTT1, if ω = HTT, THT or TTH2, if ω = HHT,HTH or THH3, if ω = HHH

The probability of getting exactly two heads is written as P[X = 2], which isshorthand for P[{ω : X(ω) = 2}].You may be unfamiliar with this last notation, used for describing sets.The notation {ω : Property} describes the set of all ω that satisfies someproperty. So {ω : X(ω) = 2} is the set of all ω with the property thatX(ω) = 2. That is, the set of all outcomes that result in exactly twoheads, which is {HHT,HTH, THH}. Similarly, the probability of gettingat most one head in three coin tosses is P[X ≤ 1] = P[{ω : X(ω) ≤ 1}] =P[{TTT,HTT, THT, TTH}]. Because of simplicity and ease of notation, wetypically use the shorthand X in writing random variables instead of the moreverbose X(ω).

Part 2. A First Look into Simulation

At this point is important to review Ap-pendix A from the textbook (section 6 canbe omitted for now). It will be handed outin class.

Using random numbers on a computer to simulate probabilities is called the Monte Carlo method. To-day, Monte Carlo tools are used extensively in statistics, physics, engineering, and across many disciplines.The name was coined in the 1940s by mathematicians John von Neumann and Stanislaw Ulam working onthe Manhattan Project. It was named after the famous Monte Carlo casino in Monaco.

5

The first thoughts and attempts I made to practice [the Monte Carlo method] weresuggested by a question which occurred to me in 1946 as I was convalescing from an illnessand playing solitaires. The question was what are the chances that a Canfield solitairelaid out with 52 cards will come out successfully? After spending a lot of time trying toestimate them by pure combinatorial calculations, I wondered whether a more practicalmethod than “abstract thinking” might not be to lay it out say one hundred times andsimply observe and count the number of successful plays. This was already possible toenvisage with the beginning of the new era of fast computers, and I immediately thoughtof problems of neutron diffusion and other questions of mathematical physics, and moregenerally how to change processes described by certain differential equations into anequivalent form interpretable as a succession of random operations. Later [in 1946], Idescribed the idea to John von Neumann, and we began to plan actual calculations.

-Stanislaw Ulam, quoted in Eckhardt (1987):

The Monte Carlo simulation approach is based on the relative frequency model for probabilities. Givena random experiment and some event A, the probability P(A) is estimated by repeating the randomexperiment many times and computing the proportion of times that A occurs. More formally, define asequence X1, X2, . . ., where

Xk =

{1, if A occurs on the k-th trial0, if A does not occur on the kth trial,

for k = 1, 2, . . .. ThenX1 + . . .+Xn

n

is the proportion of times in which A occurs in n trials. For large n, the Monte Carlo method estimatesP[A] (by using the LAW OF LARGE NUMBERS) as

P[A] ≈ X1 + . . .+Xnn

.

Definition 1.10. Simulation is a method that uses an artificial process (programmed in the computer) torepresent the outcomes of a real process (like tossing a coin) that provides information about the probabilityof events. We use simulation to estimate probabilities in real problems and situations for which empiricalor theoretical procedures are not easily calculated.

In order to simulate random variables taking values on the integers we will use the function sample in R.

> sample(a:b,n,replace=T)

The command samples with replacement from {a, a+ 1, a+ 2, . . . , b− 1, b} n times such that outcomesare equally likely (this can be modified by specifying the probabilities of each element in a vector of b−aelements). For a coin toss, we can sample it using

> sample(0:1,n,replace=T)

Let 0 represent tails and 1 represent heads. The output is a sequence of n ones and zeros correspondingto heads and tails. The average, or mean, of the list is precisely the proportion of ones. To simulateP(Heads), type

> mean(sample(0:1,n,replace=T))

6

Computer Exercise: Repeat the command several times with one million trials (use the up arrow key).These give repeated Monte Carlo estimates of the desired probability. Are the answers the same each time?

Example 1.11. Obtain, by simulation, an approximate probability of getting three heads in three cointosses. Can you increase the precision of your results?

Solution:

# Trial

> trial if (sum(trial)==3) 1 else 0

# Replication

> n simlist for (i in 1:n) {

trial

Part 3. Integer-Valued and Discrete Random Variables

Definition 1.12. A random variable X is called discrete if it can only take countably many values.

Definition 1.13. For a random variable X that takes values in a countable set S, the probability massfunction of X is the probability function

fX(x) = P[X = x], for x ∈ S

Probability mass functions are the central objects in discrete probability that allow one to computeprobabilities. If we know the pmf of a random variable, in a sense we have “complete knowledge” (in aprobabilistic sense) of the behavior of that random variable.

The most common discrete distributions are given by:

R: Working with Probability Distributions

R has several commands for working with probability distributions like thebinomial distribution. These commands are prefixed with d, p, and r. Theytake a suffix that describes the distribution. The Bernoulli distribution is aparticular case of binomial with n = 1, the binomial suffix is “binom” whilethe discrete uniform can be obtained with sample and the Poisson distributionhas suffix “pois”.

For the binomial distribution, these commands are:

• dbinom(k,n,p): Gives the value P[X = k], where X ∼binom(n, p).

• pbinom(k,n,p): Gives the value P[X ≤ k], where X ∼binom(n, p).

• rbinom(m,n,p): Simulates m Binomal(n, p) random variables.

For the Poisson distribution, these commands are, analogously,dpois(x,lambda), ppois(x,lambda) and rpois(n,lambda).

Example 1.14. Using R, Compute exactly and approximate via simulation P[X > 5], whereX ∼Binom(100,0.1).

8

Solution: For the exact probability we use that P[X > 5] = 1− P[X ≤ 5]. Thus the R-command is

> 1-pbinom(5,100,0.1)

To simulate the probability P[X > 5] based on 10,000 repetitions, we do

> n simlist sum(simlist>5)/n

Homework Problem 11. Every person in a group of 1000 people has a 1% chance of being infected by avirus. The process of being infected is independent from person to person. Using random variables, writeexpressions for the following probabilities and solve them with R.

(a) The probability that exactly 10 people are infected.

(b) That probability that at least 16 people are infected.

(c) The probability that between 12 and 14 people are infected.

(d) The probability that someone is infected.

Homework Problem 12. In 1693, Samuel Pepys wrote a letter to Isaac Newton posing the followingquestion. Which of the following three occurrences has the greatest chance of success?

1. Six fair dice are tossed and at least one 6 appears.

2. Twelve fair dice are tossed and at least two 6’s appear.

3. Eighteen fair dice are tossed and at least three 6’s appear.

Using R, answer Mr. Pepys’ question.

Homework Problem 13. Suppose X has a Poisson distribution and P(X = 2) = 2P(X = 1). FindP(X = 3).

Homework Problem 14. Poisson approximation of the binomial: Suppose X ∼ Binom(n, p).Write an Rfunction compare(n,p,k) that computes (exactly) P[X = k]− P[Y = k], where Y ∼ Pois(np).The following is a know fact:

Let X ∼ Binom(n, p) and Y ∼ Pois(λ). If n→∞ and p→ 0 in such a wayso that np→ λ > 0, then for all k, P[X = k]→ P[Y = k]. Thus, the Poissondistribution with parameter λ = np serves as a good approximation for thebinomial distribution when n is large and p is small.

Try your function on 6 numbers where you expect the Poisson probability to be a good approximation ofthe binomial. Also try it on 6 numbers where you expect the approximation to be poor.

9

Part 4. Continuous Random Variables

Shooting an arrow at a target and picking a real number between 0 and 1 are examples of random experi-ments where the sample space is a continuum of values. Such sets have no gaps between elements; they arenot discrete. The elements are uncountable and cannot be listed. We call such sample spaces continuous.The most common continuous sample spaces in one dimension are intervals such as (a, b), (−∞, c], [a,∞)and (−∞,∞).

An example of a problem involving continuous random variables is the following: an archer shoots anarrow at a target. The target C is a circle of radius 1. The bullseye B is a smaller circle in the center ofthe target of radius 1/4. What is the probability P(B) of hitting the bullseye?

To work with random variables in the continuous world will require some new mathematical tools thatwill be introduced as follows:

Definition 1.15. A continuous random variable X is a random variable that takes values in a continuousset. If S is an uncountable subset of the real numbers, then {X ∈ S} is the event that X takes values inS. For instance, {X ∈ (a, b)} = {a < X < b}, and {X ∈ (−∞, c]} = {X ≤ c}.

In the discrete setting, to compute P[X ∈ S] add up values of the probability mass function. That is,P[X ∈ S] =

∑x∈S P[X ∈ S]. If X, however, is a continuous random variable, how could we sum over all

the points of an uncountable set?

In order to compute P[X ∈ S] we integrate the probability density function (pdf) over S. The proba-bility density function plays the role of the pmf. It is the function used to compute probabilities. However,if they are playing the same role, why are they called differently? The answer to this question is verydeep and requires “measure theory” to be answered, but roughtly speaking, the problem is that the pmfat a point x is the probability that the r.v. assumes that value x while the pdf at a point x is not! Thisis because the probability of a continuous random variable being equal to exactly one value is always 0!(Think about how can you be able to choose one point in an uncountable set, it seems impossible since youcannot even assign some ordering to it!!!). A more mathematical (but still not the 100% correct) definitionof the pdf is the following:

Definition 1.16. Let X be a continuous random variable. A function f is aprobability density function of X if

1. f(x) ≥ 0, for all x ∈ R.

2.∫∞−∞ f(x) = 1.

3. For S ⊆ R,P[X ∈ S] =

∫S

f(x)dx

10

Taking this definition, notice what happens if we allow the set S in Definition 1.16 to be a singleton{a} (or a point a). Then

P[X ∈ {a}] = P[X = a] =∫ aa

f(x)dx = 0

because the integral of any function over an interval of length 0 is 0. Thus, as explained before, in acontinuous random variable, the probability of the random variable being exactly equal to a point is 0.However, to understand a little better what the pdf is telling us, we now take the set S in Definition 1.16to be the interval (a− h, a+ h) and what we get is that:

P[X ∈ (a− h, a+ h)] =∫ a+ha−h

f(x)dx

Dividing both sides by 2h and taking the limit as h→ 0, we get:

limh→0

P[X ∈ (a− h, a+ h)]2h

= limh→0

1

2h

∫ a+ha−h

f(x)dx = f(a)

Why is it true the last equality? (Hint: Fundamental Theorem of Calculus!!!, in general, the result iscalled Lebesgue Differentiation Theorem!)

In other words, using the differential notation, for very small ∆, the pdf pf the random variable X tellsus that

f(a)∆ ≈ P[X ∈ (a−∆/2, a+ ∆/2)] or f(a)∆ ≈ P[X ∈ (a, a+ ∆)]

One way to connect and unify the treatment of discrete and continuous random variables is throughthe cumulative distribution function (cdf) which is defined for all random variables.

Definition 1.17. Let X be a random variable. The cumulative distribution function of X is the function

F (x) = P[X ≤ x],

defined for all real numbers x.

The cdf plays an important role for continuous random variables in part because of its relationship tothe density function. For a continuous random variable X

F (x) = P[X ≤ x] =∫ x−∞

f(t)dt

and thus by the fundamental theorem of calculus,

F ′(x) = f(x)

The four defining properties of a cumulative distribution function are the following:

Proposition 1.18. A function F is a cumulative distribution function, that is, there exists a randomvariable X whose cdf is F , if it satisfies the following properties:

1. limx→−∞

F (x) = 0.

11

2. limx→∞

F (x) = 1.

3. If x ≤ y, then F (x) ≤ F (y).

4. F is right-continuous. That is, for all real numbers a

limx→a+

F (x) = F (a).

Homework Problem 15. A random variable X has density function f(x) = cex, for −2 < x < 2. Find(a) Find c.

(b) Find P[X < −1].Homework Problem 16. The cumulative distribution function for a random variable X is

F (x) =

0 if x ≤ 0

sin(x) if x ≤ π/21 if x > π/2

Find P[0.1 < X < 0.2].Next, we will mention the some important continuous distributions:

Definition 1.19. We say tha X has a Normal distribution with parameters µ ∈ R and σ > 0 (denoted asX ∼ N(µ, σ2)) if the pdf of X is given by

f(x) =1√

2πσ2e−

(x−µ)2

2σ2

for all real numbers x. The parameter µ is the mean and the parameter σ is the standard deviation (seebelow for definitions).

Definition 1.20. We say tha X has a Uniform distribution on the interval (a, b) (denoted as X ∼ U(a, b))if the pdf of X is given by

f(x) =1

b− a1

(x)(a,b)

for all real numbers x ∈ (a, b).Definition 1.21. We say tha X has an Exponential distribution with rate parameter λ > 0 (denoted asX ∼ exp(λ)) if the pdf of X is given by

f(x) = λe−λx,

for all real numbers x > 0. The parameter λ is the rate parameter. Note: It is also common to define theexponential random variable in terms of the mean parameter θ. In that case, the pdf is given by

f(x) =1

θe−

xθ

Notice that the rate parameter is related to the mean parameter by λ = 1/θ.

To find the density at a point x, the cdf at a point x and to simulate n of these random variables in R,we use the commands:

N(µ, σ2) U(a, b) exp(λ)Density at x dnorm(x,mu,sigma) dunif(x,a,b) dexp(x,lambda)CDF at X pnorm(x,mu,sigma) punif(x,a,b) pexp(x,lambda)Simulate n realizations rnorm(n,mu,sigma) runif(n,a,b) rexp(n,lambda)

12

Part 5. Expectation, Moments and Simulation of Arbitrary Continuous RVs.

The expectation is a numerical measure that summarizes the typical, or average, behavior of a randomvariable. It is computed differently on discrete and continuous random variables. First, we will dwell intothe discrete case and then we will proceed to the contuous case.

Part 5.A: Expectation and Moments in the Discrete Case

Definition 1.22. If X is a discrete random variable that takes values in a set S, the expectation E[X] isdefined as

E[X] =∑x∈S

xpX(x),

where pX(x) = P[X = x] is the pmf of X.

The sum in the definition is over all values of X. If the sum is a divergent infinite series, we say thatthe expectation of X does not exist. The expectation is a weighted average of the values of X, where theweights are the corresponding probabilities of those values. The expectation places more weight on valuesthat have greater probability.

In the case when X is uniformly distributed on a finite set {x1, . . . , xn}, that is, all outcomes are equallylikely,

E[X] =n∑i=1

xiP[X = xi] =n∑i=1

xin

=x1 + x2 + . . .+ xn

n

Thus, with equally likely outcomes the expectation is just the regular average of the values. Other namesfor expectation are mean and expected value.

Example 1.23. In the (simplified) game of roulette, a ball rolls around a roulette wheel landing on oneof 38 numbers. Eighteen numbers are red; 18 are black; and two ( the 0 and 00) are green. A bet of “red”costs $1 to play and pays off even money if the ball lands on red. Let X be a player’s winnings at rouletteafter one bet of red. What is the distribution of X? What is the expected value E[X]?

Solution: The player either wins or loses $1. So X = 1 or −1, with P(X = 1) = 18/38 and P(X = −1) =20/38. Once the distribution of X is found, the expected value can be computed:

E[X] = (1)P (X = 1) + (−1)P (X = −1) = 1838− 20

38=−238

= −0.0526.

The expected value of the game is about −5 cents.

But what does E[X] = −0.0526 mean?We interpret E[X] as a long-run average. If you play roulette for a long time making many red bets,

then the average of all your one dollar wins and losses will be about −5 cents. What that also means isthat if you play, say, 10,000 times then your total loss will be about 5× 10, 000 = 50, 000 cents, or about$500.

Example 1.24. Using the function replicate in R, create a code for simulating the roulette game andapproximate its mean.

13

> simlist mean(sample(1:100,1000000,replace=T)^2)

A first thought may be that since E[X] = 101/2 = 50.5, then E[X2] = (101/2)2 = 2550.25. We see thatthis is not correct. It is not true that E[X2] = (E[X])2.

Homework Problem 17. Suppose X has a Poisson distribution with parameter λ. Find E[1/(X + 1)].

One of the most important properties of expectation is that is a linear opeartor. That is,

14

Proposition 1.28. For any random variables X and Y , it holds that

E[aX + bY ] = aE[X] + bE[Y ]

for any constants a and b.

Given an event A, define a random variable 1{A} such that

1{A} =

{1, if A occurs.0, if A does not occur.

Therefore, 1{A} equals 1, with probability P[A], and 0, with probability P[Ac]. Such a random variableis called an indicator variable. An indicator is a Bernoulli random variable with p = P[A]. The expectationof an indicator variable is important enough to highlight.

Proposition 1.29. For A an event, and 1{A} be an indicator r.v.,

E[1{A}

]= (1)P[A] + (0)P[Ac] = P[A].

This is fairly simple, but nevertheless extremely useful and interesting, because it means that probabil-ities of events can be thought of as expectations of indicator random variables.

Often random variables involving counts can be analyzed by expressing the count as a sum of indicatorvariables. We illustrate this powerful technique in the next example.

Example 1.30. Expectation of binomial distribution. Using the fact that the sum of n Bernoulli(indicator) r.v. with parameter p is a Binomial(n,p) r.v. Find the expectation of a Binomial(n,p) rv.

Solution: Let I1, . . . , In be a sequence of i.i.d. Bernoulli (indicator) random variables with success prob-ability p. Let X = I1 + . . . + In. Then X has a binomial distribution with parameters n and p (Sums pfBernoullis are Binomial, think about it!). By linearity of expectation,

E[X] = E

[n∑k=1

Ik

]=

n∑k=1

E[Ik] =n∑k=1

(1 · p+ 0 · (1− p)) = pn

This result should be intuitive. For instance, if you roll 600 dice, you would expect 100 ones. Thenumber of ones has a binomial distribution with n = 600 and p = 1/6. We emphasize the simplicity andelegance of the last derivation, a result of thinking probabilistically about the problem. Contrast this withthe algebraic approach, using the definition of expectation and combinatorial coefficients.

This technique of using indicator functions and knowing that the expectation of an indicator functionis the probability of the event, becomes really useful sometimes as in the next example:

Example 1.31. At graduation ceremony, a class of n seniors, upon hearing that they have graduated,throw their caps up into the air in celebration. Their caps fall back to the ground uniformly at randomand each student picks up a cap. What is the expected number of students who get their original cap back(a “match”)?

Solution: Let X be the number of matches. Define

Ik =

{1, if the k-th student got his cap back0, if if the kth student dis not get their cap back,

15

for k = 1, . . . , n. Then X = I1 + . . .+ In. The expected number of matches is

E[X] = Ex

[n∑k=1

Ik

]=

n∑k=1

E [Ik]

=n∑k=1

P [the k-th student got its cap back]

=n∑k=1

1

n= n

(1

n

)= 1

The reason being that the probability that the k-th student gets their cap back is 1/n since there are ncaps to choose from and only one belongs to the kth student.

Remarkably, the expected number of matches is one, independent of the number of people, n. If every-one in China throws their hat up in the air, on average about one person will get their hat back.

While a computer algorithm to approximate the expectation using simulation is feasable, sometimesit is easier to rethink the problem to find a better solution. Another way to see this problem is to thinkabout a permutation of the numbers 1, 2, . . . , n and count the number of fixed points, i.e., the number ofnumbers that stayed in its place after the permutation. Whule doing that you will be counting how manypeople got exactly his hat back.

Example 1.32. Write a code to obtain (approximate) the expectation by using simulation and a fixedpoint permutation with 50 hats.

Solution:

> n mean(replicate(10000,sum(sample(n,n)==(1:n))))

Note: Let’s pause to reflect upon what we have actually done. We have simulated a random elementfrom a sample space that contains 50! ≈ 3.04×1064 elements, checked the number of fixed points, and thenrepeated the operation 10,000 times, finally computing the average number of fixed points and all of thisin a second or two on your computer. It would not be physically possible to write down a table of all 50!permutations, their number of fixed points, or their corresponding probabilities. And yet by generatingrandom elements and taking simulations the problem becomes computationally feasible. Of course, in thiscase we already know the exact answer and do not need simulation to find the expectation. However, thatis not the case with many complex, real-life problems. In many cases, simulation is the only way to go.

Homework Problem 18 (St. Petersburgh Paradox). Very interesting problem: You are offered thefollowing game. Flip a coin until heads appears. If it takes n tosses, you will get paid $2n. Thus if headscomes up on the first toss, you get paid $2. If it first comes up on the 10th toss, you get paid $1024. Howmuch would you pay to play this game?

This problem, discovered by the eighteenth-century Swiss mathematician Daniel Bernoulli, is the St.Petersburg paradox. The “paradox” is that most people would not pay very much to play this game. Andyet you should pay whatever you are asked to (theoretically)!. Question: Why in real life you should notpay “whatever” for this game?

As the mean is a measure of how the random variable behaves in average, there are other importantquantities that depict certain behaviour of the r.v. called the moments.

16

Definition 1.33. The n−th moment of a random variable is defined as

µn := E[Xn].

Thus, for a discrete r.v. X with pmf f and support on the discrete set A, we compute the n−th momentas

µn =∑x∈A

xnf(x)

Part 5.B: Expectation and Moments in the Continuous Case

Definition 1.34. For X a continuous random variable with density f , the expectation of X is computedas

E[X] =∫ ∞−∞

xf(x)dx

Also, analogous as the discrete case, we have that

Definition 1.35. For X a continuous random variable with density f and g a given function, the expec-tation of g(X) is computed as

E[g(X)] =∫ ∞−∞

g(x)f(x)dx

Example 1.36. A balloon has radius uniformly distributed on (0,2). Find the expectation of the volumeof the balloon exactly and by simulation.

Solution: Let V the volume of the ballon. Then, V = 4 ∗ pi ∗ r3/3, where r ∼ U(0, 2). Then,

E[V ] = E[

4π

3r3]

=4π

3E[r3]

=4π

3

∫ 20

r31

2dr =

8π

3

A code that computes the expectation is

> mean((4/3)*pi*runif(1000000,0,2)^3)

Definition 1.37. As recalled from above, the n−th moment of a random variable is defined as

µn := E[Xn].

Thus, for a continuous r.v. X with pdf f , we compute the n−th moment as

µn =

∫ ∞−∞

xnf(x)dx

Finally, a very important quantity is the variance, which can be regarded as the average (squared-)errorof the random variable around the mean µ. That is, if the variance of a random variable X is σ2, then youexpect than in average the (square-)error that you make by guessing that the (random) value of X is µ isσ2. The following definition provides a way to compute it:

Definition 1.38. Let X be a random variable. Then, the Variance of X, denoted by V (X) is given by

V (X) := E[(X − E[X])2

].

Sometimes, the above formula is complicated and one can prove that also,

V (X) = E[X2]− E2[X] = µ2 − µ21,

where µn is the n−th moment.

17

The following proposition establishes some basic properties of the expectation and the variance.

Proposition 1.39. For constants a and b, and random variables X and Y ,

• E[aX + b] = aE[X] + b

• E[aX + bY ] = aE[X] + bE[Y ]

• V [a] = 0

• V [aX + b] = a2V [X]

Another important concept is the one of Standard Deviation

Definition 1.40. The standard deviation of X is the square root of the variance. That is,

σX =√V [X]

Variance and standard deviation are always nonnegative. The greater the variability of outcomes, thelarger the deviations from the mean, and the greater the two measures. If X is a constant, and hence hasno variability, then X = E[X] = µX and we see from the variance formula that V [X] = 0. The converse isalso true. That is, if V [X] = 0, then X is a constant (a.s.).

Example 1.41. A balloon has radius uniformly distributed on (0,2). Find the expectation and standarddeviation of the volume of the balloon theoretically and by simulation.

Solution: Let V be the volume. Then V = (4pi/3)R3, where R ∼ Unif(0, 2). The expected volume is

E[V ] = E[

4π

3R3]

=4π

3

∫ 20

r31

2dr =

8π

3≈ 8.378.

Also,

E[V 2] = E

[(4π

3R3)2]

=16π2

9

∫ 20

r61

2dr =

1024π2

63.

giving

V [X] = E[V 2]− (E[V ])2 = 1024π2

63−(

8π

3

)2=

64π2

7,

with standard deviation σV = 8π/√

7 ≈ 9.5. A sample R code could be

> volume mean(volume)

[1] 8.386274

> sd(volume)

[1] 9.498317

The next homework problem is challenging but provides a great example that not all random variableshave a finite expectation.

Homework Problem 19. Answer all the following:

18

1. Show that the function

f(x) =1

π

1

1 + x2, x ∈ R

is a valid density function by showing points 1 and 2 of Definiton 1.16.

2. Let X be a random variable with pdf as above (such r.v. is called the Cauchy distribution). Show thatE[X] =∞.

3. Let U ∼ Unif(−π/2, π/2). Show (by doing an “obvious” change of variable in the integral) thatE[X] = E[tan(U)].

4. Create a function that computes 20 times E[tan(U)] and run it. What you observe? Explain.Finally, it is important for us to know how to simulate almost ANY continuous random variable provided

that we have its density. For example, suppose we want to simulate observations from a random variableX, where X has density f(x) = 2x, for 0 < x < 1, and 0, otherwise. How would you do it?

Proposition 1.42 (Inverse Transform Method). Suppose X is a continuousrandom variable with cumulative distribution function F , where F is invert-ible with inverse function F−1. Let U ∼ Unif(0, 1). Then the distribution ofF−1(U) is equal to the distribution of X. In other words, to simulate X firstsimulate U and output F−1(U).

Example 1.43. Explain how to simulate the rv X above with pdf f(x) = 2x, for 0 < x < 1, and 0,otherwise.

Solution: First, computing the CDF of X is

F (x) = P[X ≤ x] =∫ x−∞

f(u)du =

∫ x0

2udu = x2, for0 < x < 1.

On the interval (0,1) the function F (x) = x2 is invertible and F−1(x) =√x. The inverse transform method

says that if U ∼ Unif(0, 1), then F−1(U) =√U has the same distribution as X. Thus to simulate X, we

can implement the code

> simlist simlist hist(simlist,prob=T,main="",xlab="")

> curve(2*x,0,1,add=T)

The proof of the inverse transform method is quick and easy. We need to show that F−1(U) has thesame distribution as X. For x in the range of X,

P(F−1(U) ≤ x) = P(U ≤ F (x)) = F (x) = P(X ≤ x),

where we used the fact that the cdf of the uniform distribution on (0,1) is the identity function, F (x) = x.

19

Homework Problem 20. By using the inverse transform method, simulate X ∼ exp(λ).

Homework Problem 21. Show that

f(x) = exe−ex

, for allx,

is a probability density function. If X has such a density function find the cdf of X.

Homework Problem 22. Let X ∼ Unif(a, b). Find a general expression for the k−th moment E[Xk].

Homework Problem 23. An isosceles right triangle has side length uniformly distributed on (0,1). Findthe expectation and variance of the length of the hypotenuse.

Homework Problem 24. Let X and Y be independent exponential random variables with parameterλ = 1. Simulate P(X/Y < 1).

Part 6. Joint Distributions and Independence of Random Variables

In the case of two random variables X and Y , a joint distribution specifies the values and probabilities forall pairs of outcomes. If X and Y are discrete r.v., the joint probability mass function of X and Y is thefunction of two variables fX,Y (x, y) = P(X = x, Y = y).

Proposition 1.44. The joint pmf is a probability function. Therefore, it sums to 1. That is, if X takesvalues in a set S and Y takes values in a set T , then∑

x∈S

∑y∈T

fX,Y (x, y) =∑x∈S

∑y∈T

P(X = x, Y = y) = 1

As in the discrete one variable case, probabilities of events are obtained by summing over the individualoutcomes contained in the event. For instance, for constants a < b and c < d,

P[a ≤ X ≤ b, c ≤ Y ≤ d] =b∑

x=a

d∑y=c

fX,Y (x, y) =b∑

x=a

d∑y=c

P(X = x, Y = y)

A joint probability mass function can be defined for any finite collection of discrete random variablesX1, . . . , Xn defined on a common sample space. The joint pmf is the function of n variables P(X1 =x1, . . . , Xn = xn).

From the joint distribution of X and Y , one can obtain the univariate, or marginal distribution of eachvariable. The marginal distribution of X is obtained from the joint distribution of X and Y by summingover the values of y. Similarly the probability mass function of Y is obtained by summing the joint pmfover the values of x.

Definition 1.45. The marginal distribution of X is the individual distribution of X in the random vector(X, Y ). It is computed by

P[X = x] =∑y∈T

P[X = x, Y = y]

and similarly for Y ,

P[X = x] =∑x∈S

P[X = x, Y = y]

20

Random variables can arise as functions of two or more random variables. Suppose f(x, y) is a real-valued function of two variables. Then Z = f(X, Y ) is a random variable, which is a function of tworandom variables. For expectations of such random variables, there is a multivariate version of the equationin Proposition 1.26.

Proposition 1.46. If X and Y are random variables and φ(x, y) is a real function of two variables, thenZ = φ(X, Y ) is a random variable. If fX,Y (x, y) is the joint density of (X, Y ), then

E[Z] = E[φ(X, Y )] =∑x∈S

∑y∈T

φ(x, y)fX,Y (x, y) =∑x∈S

∑y∈T

φ(x, y)P[X = x, Y = y]

In the same way as in the discrete case, we can reformulate everything in terms of the continuous case:

Proposition 1.47. For continuous random variables X and Y defined on a common sample space, thejoint density function fX,Y (x, y) of the random vector (X, Y ) has the following properties.

• f(x, y) ≥ 0 for all real numbers x and y.

• ∫ ∞−∞

∫ ∞−∞

f(x, y)dxdy = 1

• For S ⊆ R2,P[(X, Y ) ∈ S] =

∫∫S

fX,Y f(x, y)dxdy

For continuous random variables X1, . . . , Xn defined on a common sample space, the joint densityfunction f(x1, . . . , xn) is defined similarly.

Definition 1.48. If X and Y have joint density function f , the joint cumulative distribution function ofX and Y is

F (x, y) = P[X ≤ x, Y ≤ y] =∫ x−∞

∫ y−∞

f(s, t)dtds

defined for al x and y. Differentiating with respect to both x and y gives

∂2

∂x∂yF (x, y) = f(x, y)

The joint density of X and Y captures all the “probabilistic information” about X and Y . In principle,it can be used to find any probability which involves these variables. From the joint density the marginaldensities are obtained by integrating out the extra variable (In the discrete case, we sum over the othervariable.).

Definition 1.49. The marginal distributions of the a random variable X given the joint distribution of Xand Y is given by

fX(x) =

∫ ∞−∞

f(x, y)dy

Similarly, the marginal of the random variable Y is given by

fY (x) =

∫ ∞−∞

f(x, y)dx

21

Computing expectations of functions of two or more continuous random variables should offer no sur-prises as we use the continuous form of the “law of the unconscious statistician.”

Proposition 1.50. If (X, Y ) have joint density f , and g(x, y) is a function of two variables, then

E[g(X, Y )] =∫ ∞−∞

∫ ∞−∞

g(x, y)fX,Y (x, y)dxdy.

In particular, the expected product is given by

E[XY ] =∫ ∞−∞

∫ ∞−∞

xyfX,Y (x, y)dxdy.

The next concept is arguably one of e most important concepts in Probability. The following will begiven as a definition although is important to notice that it is not the formal definition, but a consequenceof it.

Definition 1.51. Two random variables X and Y are called independent if for any A and B subsets of Ω,

P[X ∈ A, Y ∈ B] = P[X ∈ A]P[X ∈ B]

From this definition, we can conclude that X and Y are independent if the joint probability (mass)function of X and Y has is the product of the marginals. That is,

fX,Y (x, y) = P(X = x, Y = y) = P (X = x)P (Y = y) = fX(x)fY (y), for all x, y.

If X and Y are independent random variables, then knowledge of whether or not X occurs gives noinformation about whether or not Y occurs. It follows that if f and g are functions, then f(X) gives noinformation about whether or not g(Y ) occurs and hence f(X) and g(Y ) are independent random variables.

Proposition 1.52. Suppose X and Y are independent random variables, and f and g are any functions.Then, the random variables f(X) and g(Y ) are also independent. Moreover,

E[f(X)g(Y )] = E[f(X)]E[g(Y )]

and, by choosing f(x) = g(x) = x, we get:

E[XY ] = E[X]E[Y ]

Example 1.53. Suppose the radius R and height H of a cone are independent and each uniformly dis-tributed on {1, . . . , 10}. Find the expected volume of the cone and make a code to simulate it.

Solution: The volume of a cone is given by the formula v(r, h) = πr2h/3. Let V be the volume of thecone. Then V = πR2H/3 and

E[V ] = E[π

3R2H

]=π

3E[R2]E[H] =

π

3

(10∑r=1

r2

10

)(10∑h=1

h

10

)=π

3

(77

2

)(11

2

)=

847π

12≈ 221.744

An R code that illustrates the example could be:

> simlist mean(simlist)

22

Example 1.54. Suppose the joint density of X and Y is

fX,Y (x, y) = cxy, for 1 < x < 4 and 0 < y < 1.

Find c and compute P[2 < X < 3, Y > 1/4].Solution:

To find c, we need to solve the equation∫ ∞−∞

∫ ∞−∞

fX,Y (x, y)dxdy = 1,

which becomes ∫ 10

∫ 41

cxydxdy = 1.

Solving, we have that c = 4/15. Then, to find the probability,

P[2 < X < 3, Y > 1/4] =∫ 1

1/4

∫ 32

4

15xydxdy =

4

15

5

2

15

32=

5

16.

Homework Problem 25. The joint pmf of X and Y is

fX,Y =x+ 1

12

for x = 0, 1 and y = 0, 1, 2, 3. Find the marginal distributions of X and Y . Describe their distributionsqualitatively. That is, identify their distributions as one of the known distributions you have worked with(e.g., Bernoulli, binomial, Poisson, or uniform).

Homework Problem 26. Suppose X and Y are independent random variables. Does E[X/Y ] = E[X]/E[Y ]?Either prove it true or exhibit a counterexample.

Homework Problem 27. HARD PROBLEM: An elevator containing p passengers is at the groundfloor of a building with n floors. On its way to the top of the building, the elevator will stop if a passengerneeds to get off. Passengers get off at a particular floor with probability 1/n. Find the expected number ofstops the elevator makes. (Hint: Use indicators, letting 1{k} = 1 if the k−th floor is a stop. Be careful:more than one passenger can get off at a floor.)

Homework Problem 28. Suppose X and Y are i.i.d. exponential random variables with λ = 1. Findthe density of X/Y and use it to compute P(X/Y < 1). Also, simulate that probability.

Having looked at measures of variability for individual and independent random variables, we nowconsider measures of variability between dependent random variables. The covariance is a measure ofthe association between two random variables. For jointly distributed continuous random variables, thecovariance and correlation are defined as:

Definition 1.55. Let X and Y be jointly distributed continuous random variables with joint densityfunction f . Let µX = E[X] and µY = E[Y ]. The covariance of X and Y is

Cov(X, Y ) = E[(X − µx)(Y − µY )]

Moreover, the last quantity can be also computed as

Cov[X, Y ] = E[XY ]− E[X]E[Y ]

On the other hand, the Correlation (Pearson) is given by

ρX,Y =Cov[X, Y ]√V ar[X]V ar[Y ]

23

For independent random variables, E[XY ] = E[X]E[Y ] and thus Cov[X, Y ] = 0. The covariance will bepositive when large values of X are associated with large values of Y and small values of X are associatedwith small values of Y . In particular, for outcomes x and y products of the form (x− µX)(y − µY ) in thecovariance formula will tend to either both be positive or both be negative, both cases resulting in positivevalues.

On the other hand, if X and Y are inversely related, most terms (x − µX)(y − µY ) will be negative,since when X takes values above the mean, Y will tend to fall below the mean, and viceversa. In thiscase, the covariance between X and Y will be negative. Covariance is a measure of linear associationbetween two variables. In a sense, the “less linear” the relationship, the closer the covariance is to 0. It isimportant to notice that the sign of the covariance indicates whether two random variables are positivelyor negative associated. But the magnitude of the covariance can be difficult to interpret. The correlationis an alternative measure when this wants to be done because is standardized.

Proposition 1.56 (Properties of correlation). 1. −1 ≤ ρX,Y ≤ 1.

2. If Y = aX + b is a linear function of X for constants a and b, then Corr(X, Y ) = ±1, depending onthe sign of a.

Correlation is a common summary measure in statistics. Dividing the covariance by the standarddeviations creates a “standardized” covariance, which is a unitless measure that takes values between −1and 1. The correlation is exactly equal to ±1 if Y is a linear function of X. Random variables that havecorrelation, and covariance, equal to 0 are called uncorrelated.

Definition 1.57. We say random variables X and Y are uncorrelated if

E[XY ] = E[X]E[Y ],

that is, if Cov(X, Y ) = 0.

24

Basic Probability ReviewExperiments, Sample Spaces and Random VariablesA First Look into SimulationInteger-Valued and Discrete Random VariablesContinuous Random VariablesExpectation, Moments and Simulation of Arbitrary Continuous RVs.Expectation and Moments in the Discrete CaseExpectation and Moments in the Continuous Case

Joint Distributions and Independence of Random Variables

mth 453: basic random processes - department of...

Documents