engr123: engineering mathematics with logic and statistics

ENGR123: Engineering Mathematics with Logicand Statistics

1

Contents

1 Statistics in Engineering and Sampling methods 1

1.1 Role of Statistics in Engineering . . . . . . . . . . . . . 2

1.1.1 Collecting data for analysis... . . . . . . . . . . 5

1.1.2 What should we measure? . . . . . . . . . . . . 6

1.2 Population and Sample . . . . . . . . . . . . . . . . . . 7

1.2.1 Why sample? . . . . . . . . . . . . . . . . . . . 7

1.2.2 Sample statistics and Population parameters . . 8

1.3 Sampling methods . . . . . . . . . . . . . . . . . . . . 9

1.3.1 Sampling notation . . . . . . . . . . . . . . . . 9

1.3.2 Simple random sampling (SRS) . . . . . . . . . 10

1.3.3 Typical parameters . . . . . . . . . . . . . . . . 12

1.3.4 Sample statistics . . . . . . . . . . . . . . . . . 12

1.3.5 Simple Random Sampling (SRS) . . . . . . . . 15

1.3.6 Stratified sampling . . . . . . . . . . . . . . . . 17

1.3.7 Cluster sampling . . . . . . . . . . . . . . . . . 18

2

1.3.8 Convenience sampling . . . . . . . . . . . . . . 19

2 Data, Summary Statistics and Graphics 201

2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

2.1.1 Data types . . . . . . . . . . . . . . . . . . . . . 202

2.2 Summary statistics . . . . . . . . . . . . . . . . . . . . 202

2.2.1 Location . . . . . . . . . . . . . . . . . . . . . . 203

2.2.2 Special percentiles . . . . . . . . . . . . . . . . 204

2.2.3 Variability (Spread) . . . . . . . . . . . . . . . . 205

2.2.4 The Mode and the Range . . . . . . . . . . . . 205

2.2.5 Summary statistics for different data types . . . 206

2.3 Graphics for categorical data. . . . . . . . . . . . . . . 207

2.4 Graphics for measurement data . . . . . . . . . . . . . 209

2.4.1 Histogram . . . . . . . . . . . . . . . . . . . . . 209

2.4.2 Box plots . . . . . . . . . . . . . . . . . . . . . 213

2.4.3 Empirical Cumulative Distribution Function (ECDF)216

3 Discrete Random Variables and Distributions 301

3.1 Random Variables . . . . . . . . . . . . . . . . . . . . . 302

3.2 Probability Distribution . . . . . . . . . . . . . . . . . 304

3.3 Mean and Variance of discrete RVs . . . . . . . . . . . 307

3.4 Bernoulli Distribution . . . . . . . . . . . . . . . . . . 314

3.5 Binomial Distribution . . . . . . . . . . . . . . . . . . . 315

3

3.6 Geometric Distribution . . . . . . . . . . . . . . . . . . 323

3.7 Poisson distribution . . . . . . . . . . . . . . . . . . . . 326

3.8 Poisson/Binomial overlap . . . . . . . . . . . . . . . . . 333

3.9 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

4 Continuous Random Variables and Distributions 401

4.1 Continuous Random Variables . . . . . . . . . . . . . . 402

4.1.1 Probability Distribution (PDF and CDF) . . . . 403

4.1.2 Mean and Variance . . . . . . . . . . . . . . . . 413

4.2 Exponential Distribution . . . . . . . . . . . . . . . . . 418

4.2.1 Examples of Exponential PDF . . . . . . . . . . 419

4.2.2 Examples of Exponential CDF . . . . . . . . . . 421

4.2.3 The exponential and Poisson distributions . . . 427

4.2.4 Memoryless Property . . . . . . . . . . . . . . . 428

4.2.5 Minimum of Several Independent ExponentialRandom Variables . . . . . . . . . . . . . . . . 432

4.2.6 Counting Events in Small Time Intervals . . . . 433

4.3 Gamma distribution . . . . . . . . . . . . . . . . . . . 434

4.4 Uniform Distribution . . . . . . . . . . . . . . . . . . . 436

4.5 Normal (or Gaussian) Distribution . . . . . . . . . . . 442

4.5.1 Standardising a Normal Random Variable . . . 457

5 Confidence Intervals, Estimation and Samples 501

4

5.1 How do we get statistical distributions? . . . . . . . . . 502

5.2 Model fitting cycle . . . . . . . . . . . . . . . . . . . . 503

5.3 Point estimates and interval estimates . . . . . . . . . 504

5.4 Central Limit Theorem . . . . . . . . . . . . . . . . . . 504

5.5 Sample mean . . . . . . . . . . . . . . . . . . . . . . . 506

5.5.1 Sample mean as a point estimate of the true(population) mean . . . . . . . . . . . . . . . . 506

5.5.2 Properties of the sample mean . . . . . . . . . . 506

5.5.3 Standardization of the sample mean . . . . . . . 508

5.5.4 Confidence Intervals (CIs) for the sample mean 509

5.5.5 Interpretation and Verbalizing CIs . . . . . . . 510

5.5.6 Practicality of the CIs . . . . . . . . . . . . . . 512

5.5.7 Minimum sample size required for certain rangeof estimation error . . . . . . . . . . . . . . . . 513

5.6 Sample proportion . . . . . . . . . . . . . . . . . . . . 515

5.6.1 Sample proportion as a point estimate of thepopulation proportion . . . . . . . . . . . . . . 515

5.6.2 Properties of the sample proportion . . . . . . . 515

5.6.3 CIs of the true proportion p . . . . . . . . . . . 516

5.7 Comparison of two population means . . . . . . . . . . 520

5

Topic 1

Statistics in Engineeringand Sampling methods

1

1.1 Role of Statistics in Engineering

The collection and analysis of data are fundamental to science andengineering. Engineers learn how to design important new productsand processes by analysing data collected in scientific experiments. Amajor difficulty with scientific data is that they are often subject torandom variation, or uncertainty. That is, when scientific measure-ments are repeated, they come out somewhat differently each time.This poses a problem: How can one draw conclusions from the resultsof an experiment when those results could have come out differently?To address this question, a knowledge of statistics is essential. Statis-tics is the field of study concerned with the collection, analysis, andinterpretation of uncertain data. Generally speaking, the methods ofstatistics allow engineers to design valid experiments and to draw reli-able conclusions from the data they produce (or obtain).

There is no strict rule of what an engineer must do. However, theengineering method can be summarized as in the following figure:

Figure 1.1: The engineering method

2

Usually, we follow the following steps:

1. Develop a clear and concise description of the problem.

2. Identify, at least tentatively, the important factors that affectthis problem or that may play a role in its solution.

3. Propose a model for the problem, using scientific or engineeringknowledge of the phenomenon being studied. State any limita-tions or assumptions of the model.

4. Conduct appropriate experiments and collect data to test or val-idate the tentative model or conclusions made in steps 2 and3.

5. Refine the model on the basis of the observed data.

6. Manipulate the model to assist in developing a solution to theproblem.

7. Conduct an appropriate experiment to confirm that the proposedsolution to the problem is both effective and efficient.

8. Draw conclusions or make recommendations based on the prob-lem solution

Consequently, engineers must know how to efficiently plan experi-ments, collect data, analyze and interpret the data, and understandhow the observed data are related to the model they have proposed forthe problem under study. As mentioned, the field of statistics dealswith the collection, presentation, analysis, and use of data to makedecisions, solve problems, and design products and processes. Becausemany aspects of engineering practice involve working with data, obvi-ously some knowledge of statistics is important to any engineer.

3

Example 1.1.1. Some areas in engineering and computing where statis-tics are used:

• Data mining is performed with the help of statistics to discoverpatterns within the data;

• Data compression uses statistical algorithms to compress data;

• Artificial intelligence: how should algorithms learn when theiraccess to new data is random?;

• Network traffic modelling: the use of statistical models and an-alytical techniques are used to design algorithms which avoidnetwork congestion;

• Robotics: how to control movement in the face of errored sensorinputs and imperfect control mechanisms;

• Search engines: Google’s famous page rank technique uses well-known concepts from multivariate statistics;

• ...

Example 1.1.2. Drones

In designing the software to control the motion of an unmanned aerialvehicle (UAV) the algorithms must be able to cope with multiplesources of randomness: the sensor inputs are errored (GPS, pressure,accelerometer, etc.) and the response of the mechanical components(rotors etc.) is imperfect. Also, the environment is random - randommovements of air, wind, etc.

4

Hence we have a classic engineering design problem where the actualmotion of the UAV must satisfy certain rules (safety, job-completion,etc.) in the face of considerable uncertainty.

The design problem will be solved using statistical models for the un-certainty and possibly statistical analysis to measure the performanceof the algorithms. Note that even the performance of the UAV controlsystem is likely to be measured statistically. For example, you cannotdemand that the UAV will ALWAYS be within 1 metre of the desiredtarget say (Why?). Instead, you would probably ask that the positionis to within 1 metre of the target with high probability, perhaps atleast 99% of the time.

1.1.1 Collecting data for analysis...

Engineers need to know how to analyse data, but first you must obtainthe data. However, the fact is that collecting data for analysis is oftendifficult. Collecting data is:

• Rarely simple

• Varies from case to case

• Almost always restricted by practical considerations

Here are a few examples for you to think of the possible problems incollecting data.

Example 1.1.3. How long does it take to download a course informa-tion page from the VUW blackboard system using wireless access onthe VUW campus?

5

Example 1.1.4. How much does rounding error affect fixed point al-gorithms used in object tracking?

Example 1.1.5. Does a new GPS device deliver better results thanthe previous model?

1.1.2 What should we measure?

What are the sorts of data you might be interested in? There couldbe a single variable or many.. We may measure

• One quantity, X, eg. the power of a signal;

• Two quantities, (X, Y ), eg. the amplitude and phase of a radiosignal;

• N quantities, (X1, X2, . . . , XN), eg. when N = 3 this could be a3D position in space.

If these quantities are random, we call them random variables andusually use capital letters to denote them. In order to be able togather useful data about a situation, we need to know the variable(s)of interest and also where they come from. In other words, we mustask:

Question: What do we sample from? (ie, what is the popu-lation of interest?)

6

1.2 Population and Sample

The basic idea behind all statistical methods of data analysis is to makeinferences about a population by studying a relatively small samplechosen from it.

A population is the entire collection of objects or outcomesabout which information is sought.

A sample is a subset of a population, containing the objectsor outcomes that are actually observed.

Population

Sample

Figure 1.2: Population and Sample

1.2.1 Why sample?

• Less time consuming than a census;

7

• Less costly to carry out than a census;

• It is possible to obtain statistical results of a high precision basedon samples.

PopulationSample

Inference

SamplingIn

vest

igat

e sa

mpl

e

Graphics

Formal statisticalprocedures

Summarystatistics

1.2.2 Sample statistics and Population parame-

ters

• A numerical summary of a sample is called statistic;

• A numerical summary of a population is called parameter;

• Statistics are often used to estimate parameters.

8

1.3 Sampling methods

1.3.1 Sampling notation

In order to define a random sample, we need to define a random vari-able (r.v.). Recall the following concepts from ENGR121.

• An outcome is the result of an uncertain situation (a randomexperiment);

• An event is either a single outcome or a group of outcomes;

We use capital letters such as A, B, C... to denote events

• A random variable is a quantity taking numeric values which isa function of the result of a random experiment;

The random variable is denoted with capital letters X, Y , Z, T ..., while a

particular value of a random variable is denoted by lower case letters such

as x, y,...

Example 1.3.1. The following are some random variables:

• Outcome of a coin toss (H=0, T=1);

• Lifetime of a power transformer;

• Number of English words in a randomly selected page of a paper;

• Number of faulty items in a random batch of chips;

• Number of requests for resources arriving at a server in a onehour period;

• ...

9

1.3.2 Simple random sampling (SRS)

What is SRS?

A SRS is when each random variable X1, X2, . . . , Xn in thesample is independent of the other variables and is drawn fromthe same distribution.

Aim?

SRS and its variations are designed so that the samples are represen-tative.

Purpose?

We are often interested in a parameter of interest of the population.

10

Sam

ple

Values calculated usingpopulation data are called

Values calculatedusing sample dataare called

parametersstatistics

mean = Population

make inferences about

11

1.3.3 Typical parameters

For a population, the following are typical parameters which representsome property of the population.

• Mean;

• Variance;

• Standard deviation;

• Median;

• Proportion;

• %ile or quantile.

Each of the typical parameters mentioned above has a sample coun-terpart.

1.3.4 Sample statistics

The sample mean is also called the “arithmetic mean”, or more simply,the average. It is the sum of the numbers in the sample, divided byhow many there are.

12

Let X1, . . . , Xn be a sample. The sample mean is defined as

X =X1 +X2 + · · ·+Xn

n. (1.3.1)

Meanwhile the standard deviation or variance is a measure of vari-ability or spread. The basic idea behind the standard deviation (orvariance) is that when the spread is large, the values will tend to befar away from their mean, but when the spread is small, the values willtend to be close to their mean.

Let X1, . . . , Xn be a sample. The sample variance is defined as

s2 =(X1 − X̄)2 + (X2 − X̄)2 + · · ·+ (Xn − X̄)2

n− 1(1.3.2)

=1

n− 1

n∑i=1

(Xi − X̄)2.

An equivalent formula, which can be easier to compute, is

s2 =1

n− 1

(n∑i=1

X2i − nX̄2

)

13

While the sample variance is an important quantity, it has a drawbackas a measure of spread. Its units are not the same as the units ofthe sample values; instead they are the squared units. That is whywe introduce the sample standard deviation, as the square root of thesample variance.

Let X1, . . . , Xn be a sample. The sample standard deviation isdefined as

s =

√√√√ 1

n− 1

n∑i=1

(Xi − X̄)2.

An equivalent formula, which can be easier to compute, is

s =

√√√√ 1

n− 1

(n∑i=1

X2i − nX̄2

).

The sample mean defined as in (1.3.1) is an unbiased estimate of thepopulation mean and the sample variance defined in (1.3.2) is an un-biased estimate of the population variance. By saying “unbiased”, wemean that the expected value of the sample mean equals the popula-tion mean. This can be interpreted as follows. Suppose that we havemany samples from the same population. Each sample gives us a sam-ple mean and these sample means vary as the samples vary. Then,

14

the average of those sample means can be shown to be the same asthe population mean when the number of the samples taken tends toinfinity (recall the limits studied in ENGR121). The same argumentis true for the sample variances. This “unbiasedness” is a desirableproperty for any statistic, and that is also the reason why the samplemean and the sample variance are used widely.

1.3.5 Simple Random Sampling (SRS)

We want to have a good sample in the sense that the sample is rep-resentative of the population. The simplest way to achieve this is viaa SRS. If you sample from a finite population, there are two possibil-ities: sampling with replacement and sampling without replacement.Sampling with replacement is the easiest to explain, so we will focus onthis. A SRS with replacement is simply one where every element in thepopulation is equally likely to occur in each draw from the population

Note that with any kind of random sampling, sampling variability isunavoidable. Let’s see the following example.

Example 1.3.2. Suppose that we have a population of 6 elements{1, 2, 3, 4, 5, 6}. And suppose that we take a sample of 3 elements(each element from the population can be repeatedly selected) fromthis population. Then the sample means can vary quite significantlyand they can differ from the true population mean as shown in Figure1.3 below.

A SRS is reasonable when there is little structure in the populationor when little is known about the population. But suppose you aremeasuring the data rates supplied to households in Wellington. Somehouseholds may be connected with fibre and some with copper. Abetter sample would make sure that both copper connections and fibre

15

Figure 1.3: Sample means can vary quite a lot.

connections were sampled. Note that a SRS might by chance onlysample copper connections. This kind of scenario leads to stratifiedsampling.

16

1.3.6 Stratified sampling

The idea of stratified sampling is as follows:

1. Divide the population into subgroups (called strata) ac-cording to some common characteristics;

2. Select a SRS from each subgroup;

3. Combine the samples from the subgroups in a suitable wayto obtain good estimates of the population parameters.

Example 1.3.3. We want to conduct a survey to estimate the meanheight of the schoolchildren in a small primary school with 4 classes,each of about 30 children, covering 4 different age groups. We decideto measure the heights of a sample of size 20 children for this purpose.How should we create the sample?

Solutions: It seems sensible to consider stratified sampling instead ofa SRS. Maybe the strata could be age group, gender or both?

�

17

1.3.7 Cluster sampling

Cluster sampling is often motivated by administrative convenience: toease sample specification through the existence of a list of clusters, tospeed up access to the population, or to reduce sampling cost, etc.

Source: https://www.slideshare.net/SunilKumar148/sampling-and-sample-types

18

With cluster sampling, we usually follow the following two simple steps:

1. Divide the population into several clusters, each represen-tative of the population;

2. Select a SRS of clusters;

Note: All individuals in selected clusters can be used, or, again,select a sample from these clusters using another samplingmethod.

1.3.8 Convenience sampling

Convenience sampling is used for initial investigations, proof of con-cept, pilot studies, etc. Samples produced from convenience samplingare called samples of convenience and since the samples have no rig-orous probabilistic basis, statistical analysis based on these samples isvery limited.

A sample of convenience is a sample that is not drawn bywell-defined random method.

19

Example 1.3.4. (The 1936 US election - A polling catastrophe)

In 1936, Roosevelt was up for re-election and faced Republican AlfLandon as candidates for the US president. At that time, the LiteraryDigest, an influential weekly magazine of the time, had begun politicalpolling and had correctly predicted the outcome of the previous fivepresidential elections. For this cycle, they had polled a sample of over 2million people based upon telephone and car registrations. The resultsthey obtained was

Landon, 57.1%, Roosevelt, 42.9%

which predicted Landon would win in a landslide. This projected Lan-don would win 370 out of 539 possible electoral votes.

Instead, the actual results gave a very different picture as

Roosevelt 60.8%, Landon 36.5%

What was the problem here? Was it convenience sampling? Was itsomething else?

Example 1.3.5. Engineering and computing vs social science

We are often in a good position in engineering and computing appli-cations as it may be possible to take many samples, eg. the amount ofdata on mobile traffic you can obtain from even 1 switch in the networkin 1 day is very large. In contrast, a social scientist administering asurvey may take hours for each subject. Nevertheless, there are stillmany applications where sampling is lengthy and expensive. Hence,good sampling methods are very important.

20

Topic 2

Data, SummaryStatistics and Graphics

201

2.1 Data

2.1.1 Data types

It is always important to identify the type of data. When a numer-ical quantity designating how much or how many is assigned to eachitem in a sample, the resulting set of values is called numerical orquantitative. In some cases, sample items are placed into categories,and category names are assigned to the sample items, then the dataare categorical or qualitative. These two types of data are classifiedfurther as shown in the following graph.

Figure 2.1: Types of data - or types of random variables

2.2 Summary statistics

Sample statistics are also called summary statistics. In this section,we will see further summary statistics:

202

2.2.1 Location

Measures of location include the Mean and Median which are twoways of measuring the central tendency of the data.

• Mean = average of data; as defined in (1.3.1) in Section 1.3.4.

• Median = a point half way through the data (roughly speaking,50% of the data are below the median).

The Mean and Median are two different ways of looking at the centraltendency of the data. Note that they may or may not agree and theyare NOT necessarily at a peak in the data;

The key difference between Mean and Median is that

Median is robust to outliers.

Note that sometimes a sample may contain a few points that are muchlarger or smaller than the rest, such points are called outliers. Outliersmay be false samples (resulting from data entry errors or measurementerrors) or sometimes they represent the fact that a population maycontain some extreme values.

If an outlier is found to result from an error, then it should be deleted.If a population truly contains outliers, they must be retained. Thefollowing example shows how differently the mean and median respondto an outlier

Example 2.2.1. • Data set 1 = {1, 1, 2, 2, 3}.

Mean = 1.8, Median = 2.

203

• Data set 2 = {1, 1, 2, 2, 504}.

Mean = 102, Median = 2.

Notice that Data set 1 does not have any outlier while Data set 2 has1 outlier. The mean is dramatically affected by the outlier, but themedian is unchanged.

2.2.2 Special percentiles

We often use the 25%ile, 50%ile and 75%ile, each with their own name:

• 25%ile = Lower quartile (LQ);

⇒ 25% of the data lies below LQ.

• 50%ile = Median;

⇒ 50% of the data lies below the median, or in other words, themedian is the middle number.

• 75%ile = Upper quartile (UQ);

⇒ 75% of the data lies below LQ.

As we will see later, box plots use the quartiles and the median andalso the measure of the spread of the data known as

IQR = UQ− LQ,

where IQR stands for Inter-Quartile Range.

204

Note that many other %iles can be useful. For example, when a telecomprovider quotes their peak data rate, they may well mean the 95%ileor even the 99%ile.

2.2.3 Variability (Spread)

Variability is another important concept requiring summary statisticswhich give us measurements of the spread of the data, including

• Variance and standard deviation (std.dev) as defined in (1.3.2)in Section 1.3.4.

• IQR = UQ - LQ by definition is the range of the central 50% ofthe data.

Just like the Mean and Median, when comparing Variance to the IQR,the key difference is that:

IQR is robust to outliers.

2.2.4 The Mode and the Range

The Mode and the Range are summary statistics that are of limiteduse but are occasionally seen. The sample mode is the most frequentlyoccurring value in a sample (and only really makes sense if the data isdiscrete). The range is the difference between the largest and smallestvalues in a sample. It is a measure of spread, but it is rarely used,because it depends only on the two extreme values and provides noinformation about the rest of the sample.

205

2.2.5 Summary statistics for different data types

For different types of data, we use different types of summary statistics,which can be summarized as below:

1. Nominal:

X̄ (mean) s (std.dev) %ile (quantiles) p (proportion)

No No No Yes

2. Ordinal:


? ? ? Yes

With categorical data, each sample item is assigned a categoryrather than a quantitative value. Therefore it is meaninglessto compute the sample mean and sample variance. However, towork with categorical data, numerical summaries are still needed.Therefore, the two most commonly used ones are frequencies andsample proportions (also known as relative frequencies). Thefrequency for a given category is simply the number of sampleitems that fall into that category. The sample proportion is thefrequency divided by the sample size.

For ordinal data, categories are often assigned numerical valueswhich reflect their ordered nature (1 for strongly agree, 2 foragree, 3 for neutral, etc.). So sometimes, we may use the mean,variance and quantiles, but this is very problematic. Why?

206

3. Measurement (discrete + continuous):


Yes Yes Yes ?

With measurement data, we use the mean, the variance andquantiles widely, if not always, to summarize the statistical in-formation from a sample. Proportions might also be useful if weclassify the sample values into several categories for some reason.

The next sections concern graphics - WARNING - never un-derestimate the importance of graphics.

2.3 Graphics for categorical data.

For categorical data, we often use

• Tables

• Barplots (bar chart)

• Pie charts.

Example 2.3.1. Initial enrolment of ENGR123 (2017): the data isgiven as in the table below. We can also use barplots and pie chartsto represent the data. Many packages have the ability to draw thesegraphics - python, Matlab, R, excel, etc.

207

SWEN COMP CGRA NWEN PSYC121 81 20 15 7

0

20

40

60

80

100

120

140

SWEN COMP CGRA NWEN PSYC

208

2.4 Graphics for measurement data

There are 2 popular choices:

• Histogram

• Boxplot

Some people also use dot plots for samples of smaller size, say ≤ 20.

2.4.1 Histogram

A histogram is a graphic that gives an idea of the shape of a sample,indicating regions where sample points are concentrated and regionswhere they are sparse. There are several types of histograms:

• Frequency histogram: height of bars = frequency

• Relative frequency histogram: height of bar = relative frequency

• Probability histogram: area of bars = relative frequency

Warnings: For the first two types, use EQUAL bin (class) widths.

209

To construct a histogram:

• Choose boundary points for the class intervals (choose thebins - often with equal bin widths);

• Compute the frequency (or relative frequency) for eachclass;

• Compute the density for each class, according to the for-mula

Density =Relative frequency

Class width;

This step is optional if the classes all have the same width.

• Draw a rectangle for each class. If the classes have thesame width, the heights of the rectangles may be set equalto the frequencies (or relative frequencies or the density);If the classes do not all have the same width (probabilityhistogram), the heights of the rectangles must be set equalto the densities.

210

Question: Histogram - What to look for?

For histograms, we usually look for:

• Range

As mentioned before, the range is the difference between the largestand the smallest values of the sample.

• Symmetry

A histogram is perfectly symmetric if its right half is a mirrorimage of its left half. Histograms that are not symmetric arereferred to as skewed. In practice, virtually no sample has a per-fectly symmetric histogram; all exhibit some degree of skewness.

• Skewness

In a skewed histogram, one side, or tail, is longer than the other.A histogram with a long right hand tail is said to be skewed tothe right, or positively skewed. A histogram with a long left-handtail is said to be skewed to the left, or negatively skewed.

• Number and location of modes

We use the term mode to refer the most frequently occurringvalue in a sample. This term is also used in regard to histogramsand other curves to refer to a peak, or local maximum. A his-togram is unimodal if it has only 1 peak, or mode, and bimodalif it has two clearly distinct modes. In principle, a histogramcan have more than two modes, but this does not happen often inpractice.

• Outliers

211

Example 2.4.1. An indoor UAV flight was recorded to mm accuracyusing camera vision techniques. The flight plan was to reach a certainposition and then hover as close as possible to the target position. Thefirst histogram shows the height of the UAV. Of course, the full dataset also includes the rise of the UAV to the target position and we arenot so interested in that part of the flight. Hence, the second histogramshows the heights while in hover mode.

212

2.4.2 Box plots

A boxplot is a graphic that presents the median, the first (the lower)and the third (the upper) quartiles, and any outliers that are presentin the sample. Boxplots are easy to understand and come in may forms... the following is the classic form:

IQR

LQ UQ

MedianOutliersMin Max Outlier

NOTE: in this figure Max and Min are used to denote the maximumand minimum values that are not classified as outliers. In generalthough, the smallest value, whether classified as an outlier or not, iscalled the minimum value. Similarly for the maximum value.For the purpose of drawing boxplots, any point that is

> UQ+ 1.5IQR

< LQ− 1.5IQR

is considered an outlier. Other definitions also exist.

Question: Box plots - What to look for?

Look for:

• Range, Median, LQ, UQ,

These can be seen clearly from the boxplot.

213

• Symmetry

If the median is roughly in the middle of the box, and the lengthsof the two whiskers are roughly equal then the data can be saidto be symmetric, otherwise, it is said to be skewed.

• Skewness

If the median is closer to the LQ and/or the left whisker is shorterthan the right whisker then the data is skewed to the right and ifthe median is closer to the UQ and/or the right whisker is shorterthan the left whisker, then the data is skewed to the left.

• Outliers

Outliers are obviously seen from the boxplots.

Example 2.4.2. Looking at the UAV data again, the boxplots clearlyshow a good degree of symmetry in height while hovering (see the hovermode boxplot) and the outliers due to the rise of the device (see thefull flight boxplot).

214

2.4.3 Empirical Cumulative Distribution Function

(ECDF)

Definition 2.4.3. Given a random sample X1, X2, . . . , Xn, the empir-ical cumulative distribution function (cdf) is defined as

F (x) =1

n

n∑i=1

1{Xi≤x},

where 1{Xi≤x} = 1 if Xi ≤ x and 0 otherwise.

In words,

F (x) = Proportion of the sample ≤ x.

The ECDF is what you get when you plot F (x) versus x. The ECDFis usually seen in the form of a staircase, but in fact, it has jumps ateach value of the data and we can either sketch those jumps or not.Obviously, ECDF is an increasing function.

The ECDF is an unbiased estimate of the cumulative distribution func-tion (CDF - see Topics 3 and 4) (you may try to prove it yourself).

Question: ECDF - What to look for?

With ECDF, we look for:

• Range, Median, LQ, UQ,

The minimum is the value of x where we see the first jump,the maximum is the value of x where F (x) reaches 1. To findthe Median, LQ, UQ, we find the corresponding x for F (x) =0.5, 0.25, 0.75 respectively.

216

• Symmetry

If the ECDF looks symmetric then the data is symmetric; other-wise the data is skewed.

• Skewness

Once we find the Median, LQ and UQ, we can easily judge whetherthe data is skewed.

• Outliers

Even though outliers are not easily seen from the ECDF, we maynotice them if we see a few first or last jumps which are far awayfrom the rest.

217

Example 2.4.4. Data = {1, 2, 2, 3, 5} .This is what the ECDF looks like:

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

Empirical Cumulative Distribution Function

x

F(x

)

218

Example 2.4.5. (One data set looked at 3 different ways)

A time series of length 100 of the numbers of users connected to theInternet through a server every minute (This is an R dataset namedWWWusage). The ECDF looks like this:

50 100 150 200 250

0.0

0.2

0.4

0.6

0.8

1.0

Empirical CDF

x

F(x

)

Figure 2.2: ECDF for WWWusage data

Viewing the same data using a histogram or boxplot, we have thefollowing

For practice, you should try to convert from each graphic to the othersin your head and then compare what you got to the true graphic.

219

Fre

quen

cy

50 100 150 200 250

05

1015

2025

30

Figure 2.3: Histogram for WWWusage data

100 150 200

Figure 2.4: Boxplot for WWWusage data

220

Topic 3

Discrete RandomVariables andDistributions

Reading: Croft — §29.2–10 (not 29.4)

301

3.1 Random Variables

Recall that

• A random variable X assigns values to outcomes in the samplespace.

• A discrete random variable is best described by its probabilitydistribution which gives the values that X may assume and theirprobabilities.

Example 3.1.1. Roll a pair of dice. The sample space is

Image from http://www.tc3.edu/instruct/sbrown/swt/pic/

dice-combinations3d-bw.png

• Let the random variable X be

“the number of times the number 1 faces up”.

302

http://www.tc3.edu/instruct/sbrown/swt/pic/dice-combinations3d-bw.png

http://www.tc3.edu/instruct/sbrown/swt/pic/dice-combinations3d-bw.png

• What values can X take? Since the number 1 can be face up onnone or one or both of the dice, the random variable X can takeon the values 0 or 1 or 2 only.

• What is the probability of X taking these values?

1 2 3 4 5 6

1 2 1 1 1 1 1

2 1 0 0 0 0 0

3 1 0 0 0 0 0

4 1 0 0 0 0 0

5 1 0 0 0 0 0

6 1 0 0 0 0 0

Therefore, we would write

P (X = 0) = 2536 P (X = 1) = 10

36 P (X = 2) = 136

or summarise the probability distribution in a table

xi 0 1 2

P (X = xi)2536

1036

136

303

3.2 Probability Distribution

Notes —

• Random variables — use capital letters, e.g., X, Y , Z

• Observed values — use lower case letters, e.g., x, y, z.

• We write statements like “P (X = x) = 14”.

Here the A is the event “X = x” and P (A) = 14 .

• The notationF (x) = P (X ≤ x)

denotes the probability of the event that random variable X re-ceives values ≤ x. This is known as the cumulative distributionfunction (cdf) of X.

Note that F (x) is defined for all x ∈ IR (real numbers).

• The notationp(x) = P (X = x)

stands for the probability that the random variable X takes theparticular value x.

• For many situations it will not be necessary to make a list ofall probabilities in order to specify a probability distribution,because some simple formula (called the probability function)can be found.

• When all the possible values of a random variable have beenconsidered, the sum of the probabilities must be one.

304

• A probability distribution must assign probabilities to outcomesso that

(1) 0 ≤ P (X = xi) ≤ 1 for all outcomes xi in sample space

(2)∑i

P (X = xi) = 1

• A probability distribution may be represented graphically in theform of a bar chart.

Example 3.2.1. Let X be the random variable defined as

“the result of rolling a fair six-sided die”.

thenP (X = x) = 1

6 for x = 1, 2, 3, 4, 5, 6

1 2 3 4 5 6 7

x0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

P(X=x)

Result of rolling a fair die

305

Example 3.2.2. Roll a pair of dice and let the random variable Ybe the sum of the numbers that face up.

1 2 3 4 5 6

1 2 3 4 5 6 7

2 3 4 5 6 7 8

3 4 5 6 7 8 9

4 5 6 7 8 9 10

5 6 7 8 9 10 11

6 7 8 9 10 11 12

yi 2 3 4 5 6 7 8 9 10 11 12

P (Y = yi)136

236

336

436

536

636

536

436

336

236

136

306

3.3 Mean and Variance of discrete RVs

A discrete random variable assigns values to a discrete set, usu-ally a subset of the integers. The probability distribution of adiscrete random variable, X, gives the probabilities of all thepossible values of X. This means that if the possible values arearranged in order, there is a gap between each value and thenext one. The set of possible values may be infinite.

Particularly, a discrete random variable X takes on the values

x1, x2, x3, . . .

with probabilities

P (X = x1), P (X = x2), P (X = x3), . . .

The mean is defined to be the expected value

E(X) =∑i

xi P (X = xi)

This sum may contain an infinite number of terms.

307

Notes —

• The expected value of a function f of a random variable X isdefined to be

E(f(X)) =∑i

f(xi)P (X = xi)

• For example, using f(x) = x2

E(X2) =∑i

(xi)2 P (X = xi)

The variance of a discrete random variable X is defined to be

var(X) = E((X − µ)2)

=∑i

(xi − µ)2 P (X = xi)

where µ = E(X).

308

Notes —

• The variance is an

“average of squared distances (deviations) from the mean”

• An equivalent formula for the variance is

var(X) = E(X2)− (E(X))2

This is the formula we usually use to solve problems.

• The standard deviation is the square root of the variance.

standard deviation(X) =√

var(X)

Example 3.3.1. Let X be the number of times the number 1 comesup when rolling two dice (see page 303).

xi 0 1 2

P (X = xi)2536

1036

136

309

Then in this problem, the mean and variance of X are as follows:

E(X) = 0× 2536 + 1× 10

36 + 2× 136

= 1236 = 1

3

E(X2) = 02 × 2536 + 12 × 10

36 + 22 × 136

= 1436 = 7

18

var(X) = E(X2)− (E(X))2

= 718 − (1

3)2 = 518

310

Example 3.3.2. Let the random variable Y be the sum of the num-bers that face up when rolling a pair of dice (see page 306).

yi 2 3 4 5 6 7 8 9 10 11 12

P (Y = yi)136

236

336

436

536

636

536

436

336

236

136

E(Y ) = 2( 136) + 3( 2

36) + 4( 336) + 5( 4

36) + 6( 536) + 7( 6

36)

+ 8( 536) + 9( 4

36) + 10( 336) + 11( 2

36) + 12( 136)

=2 + 6 + 12 + 20 + 30 + 42 + 40 + 36 + 30 + 22 + 12

36

= 25236 = 7

E(Y 2) = 22( 136) + 32( 2

36) + 42( 336) + 52( 4

36) + 62( 536) + 72( 6

36)

+ 82( 536) + 92( 4

36) + 102( 336) + 112( 2

36) + 122( 136)

=4 + 18 + 48 + 100 + 180 + 294 + 320 + 324 + 300 + 242 + 144

36

= 197436 = 329

6

var(Y ) = E(Y 2)− (E(Y ))2 = 3296 − (7)2 = 35

6

311

Example 3.3.3. Consider the random variable X with probabilitydistribution given below.

xi 2 5

P (X = xi) 0.4 0.6

This is another simple example for more practice. We have

E(X) = 2× 0.4 + 5× 0.6 = 3.8

E(X2) = 22 × 0.4 + 52 × 0.6 = 16.6

var(X) = E(X2)− (E(X))2 = 16.6− (3.8)2 = 2.16

312

Rules for Expectation and Variance:

• E(aX + b) = aE(X) + b

• E(aX + bY ) = aE(X) + bE(Y )

• Var(aX + b) = a2Var(X)

The above three rules are true for any random variablesX and Y , either discrete or continuous.

In addition, if X and Y are INDEPENDENT ran-dom variables, then:

• Var(aX + bY ) = a2Var(X) + b2Var(Y )

313

3.4 Bernoulli Distribution

Here the random variable X can take only the values 0 and 1 with

P (X = 0) = 1− p P (X = 1) = p

where 0 < p < 1.

Example 3.4.1. Let X be the number of heads obtained on a singletoss of a biased (unfair) coin, where the probability of a head is p.

Question: What is the mean and variance of a Bernoulli random vari-able?

Simply applying the formulae for the mean and variance, we obtain

E(X) = 0× (1− p) + 1× p = p

E(X2) = 02 × (1− p) + 12 × p = p

var(X) = E(X2)− (E(X))2 = p− p2 = p(1− p)

Hence, in a short summary, we have

If X ∼ Bernoulli(p) then

P (X = 0) = 1− p

P (X = 1) = p

E(X) = p

var(X) = p(1− p)

314

3.5 Binomial Distribution

Reading: Croft — §29.10

• Consider a population of N elements, e.g., a bucket with N mar-bles.

• Of these, M elements have a certain property, e.g., M marblesare white.

• We take a sample of n < N elements.

• Before taking a marble, we put the previous marble back, i.e.,sampling with replacement. Consequently the bucket contains Nmarbles every time we start taking a marble. We may take thesame marble we sampled before.

• Let p denote the probability of “success” (a white marble), sop = M/N .

• Let X be the number of successes (white marbles) in the sample.

315

Then X follows the binomial distribution and

P (X = k) =

(n

k

)pk (1− p)n−k

with k = 0, 1, . . . , n and 0 ≤ p ≤ 1 and(n

k

)=

n!

k!(n− k)!

Binomial conditions

(1) There are only two possible outcomes for each trial, called

“success” and “failure”.

(2) The probabilities of a success and a failure remain the same fromtrial to trial.

(3) There are n trials, and n is constant.

(4) The n trials are independent of one another.

316

If X ∼ Binomial(n, p) then

X = “number of successes from n independent trials”

P (X = k) =

(n

k

)pk (1− p)n−k

E(X) = np

var(X) = np(1− p)

Example 3.5.1. Suppose we toss a biased coin (p = 0.2 = P (H)) tentimes and count the number of heads.

(a) Find the probability of getting exactly 3 heads.

(b) Find the probability of getting less than 3 heads.

(c) Find the probability of getting more than 3 heads.

Solution:

Let X be the number of heads from n = 10 trials with p = 0.2.

Using formula . . .

(a)P (X = 3) =

(103

)(0.2)3 (0.8)7

= 120× 0.008× 0.2097152

= 0.201326592 exactly

317

(b) P (X < 3) = P (X = 0) + P (X = 1) + P (X = 2)

=(

100

)(0.2)0 (0.8)10 +

(101

)(0.2)1 (0.8)9 +

(102

)(0.2)2 (0.8)8

= 0.1073741824 + 0.268435456 + 0.301989888

= 0.6777995264 exactly

(c) P (X > 3) = 1− (P (X = 3) + P (X < 3))

= 1− (0.201326592 + 0.6777995264)

= 0.1208738816 exactly

318

Using tables . . .

Binomial Probabilities P (X = x), X ∼ Binomial(n, p)

pn x 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

10 0 0.9044 0.5987 0.3487 0.1969 0.1074 0.0563 0.0282 0.0135 0.0060 0.0025 0.00101 0.0914 0.3151 0.3874 0.3474 0.2684 0.1877 0.1211 0.0725 0.0403 0.0207 0.00982 0.0042 0.0746 0.1937 0.2759 0.3020 0.2816 0.2335 0.1757 0.1209 0.0763 0.04393 0.0001 0.0105 0.0574 0.1298 0.2013 0.2503 0.2668 0.2522 0.2150 0.1665 0.11724 0.0010 0.0112 0.0401 0.0881 0.1460 0.2001 0.2377 0.2508 0.2384 0.20515 0.0001 0.0015 0.0085 0.0264 0.0584 0.1029 0.1536 0.2007 0.2340 0.24616 0.0001 0.0012 0.0055 0.0162 0.0368 0.0689 0.1115 0.1596 0.20517 0.0001 0.0008 0.0031 0.0090 0.0212 0.0425 0.0746 0.11728 0.0001 0.0004 0.0014 0.0043 0.0106 0.0229 0.04399 0.0001 0.0005 0.0016 0.0042 0.0098

10 0.0001 0.0003 0.0010

Cumulative Binomial Probabilities P (X ≥ x), X ∼ Binomial(n, p)

pn x 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

10 0 1 1 1 1 1 1 1 1 1 1 11 0.0956 0.4013 0.6513 0.8031 0.8926 0.9437 0.9718 0.9865 0.9940 0.9975 0.99902 0.0043 0.0861 0.2639 0.4557 0.6242 0.7560 0.8507 0.9140 0.9536 0.9767 0.98933 0.0001 0.0115 0.0702 0.1798 0.3222 0.4744 0.6172 0.7384 0.8327 0.9004 0.94534 0.0010 0.0128 0.0500 0.1209 0.2241 0.3504 0.4862 0.6177 0.7340 0.82815 0.0001 0.0016 0.0099 0.0328 0.0781 0.1503 0.2485 0.3669 0.4956 0.62306 0.0001 0.0014 0.0064 0.0197 0.0473 0.0949 0.1662 0.2616 0.37707 0.0001 0.0009 0.0035 0.0106 0.0260 0.0548 0.1020 0.17198 0.0001 0.0004 0.0016 0.0048 0.0123 0.0274 0.05479 0.0001 0.0005 0.0017 0.0045 0.0107

10 0.0001 0.0003 0.0010

319

(a) Using the table of binomial probabilities P (X = x)

P (X = 3) = 0.2013 (4dp)

(b) Using the table of cumulative binomial tables P (X ≥ x)

P (X < 3) = 1− P (X ≥ 3) = 1− 0.3222 = 0.6778 (4dp)

(c) P (X > 3) = P (X ≥ 4) = 0.1209 (4dp)

320

For your reference, the following graphs show the SHAPES of the den-sity of Binomial distributions according to probability of success p.The red curve gives the theoretical probabilities. The histogram givesrelative frequencies obtained from a particular sample drawn from abinomial distribution.

Small p

Rel

ativ

e fr

eque

ncy

0 2 4 6 8 10 12

0.00

0.05

0.10

0.15

0.20

Figure 3.1: Small p

321

Medium p

Rel

ativ

e fr

eque

ncy

10 15 20 25 30

0.00

0.05

0.10

0.15

Figure 3.2: Medium p

Large p

Rel

ativ

e fr

eque

ncy

25 30 35 40

0.00

0.05

0.10

0.15

0.20

Figure 3.3: Large p

322

3.6 Geometric Distribution

Suppose that we have a biased (unfair) coin with

P (H) = p P (T) = 1− p

where 0 < p < 1.

This time we make a succession of tosses and define the random vari-able X to be

the number of tosses up to and including the first head

(a “success”).

P (X = 1) = P (H) = p

P (X = 2) = P (TH) = (1− p)p

P (X = 3) = P (TTH) = (1− p)2p

...

P (X = k) = P ((k − 1) T then H) = (1− p)k−1p

The sequence

P (X = 1), P (X = 2), P (X = 3), P (X = 4), . . .

is the geometric sequence

p, (1− p)p, (1− p)2p, (1− p)3p, . . .

with first term

323

a = p and multiplier r = (1− p).

So the infinite sum

p + (1− p)p + (1− p)2p + (1− p)3p + . . .

=a

1− r=

p

1− (1− p)= p

p = 1

i.e., probabilities P (X = xi) add up to one.

Question: What is the mean and variance of a Geometric

random variable?

Let q = 1− p.

Then (watch this clever trick)

E(X) = 1 p+ 2 qp+ 3 q2p+ 4 q3p+ . . .

= (1− q) + 2(1− q)q + 3(1− q)q2 + 4(1− q)q3 + . . .

= (1− q) + (2q − 2q2) + (3q2 − 3q3) + (4q3 − 4q4) + . . .

= 1 + (−q + 2q) + (−2q2 + 3q2) + (−3q3 + 4q3) + . . .

= 1 + q + q2 + q3 + . . . (geometric series)

=1

1− q=

1

p

A little bit of trickier maths (with a similar trick) gives

var(X) =1− pp2

324

In summary, we have

If X ∼ Geometric(p) then

X = “number of trials up to and including the first success”

P (X = k) = (1− p)k−1p, k = 1, 2, 3, . . .

E(X) =1

p

var(X) =1− pp2

Geometric conditions

(1) There are only two possible outcomes for each trial (binary out-comes)

(2) Probability of “success” (p) remains constant

(3) The trials are independent of one another.

325

3.7 Poisson distribution

Reading: Croft — §29.11

• Consider the scenario where we need to count the number ofevents in a continuum (or a given interval of time or space).

• Events within that specific interval of time (or space) occur:

– singly: meaning that no 2 events happen exactly at the sametime;

– at a constant rate: meaning that the probabilities of ob-serving an event are the same for two intervals of the samelength;

– independently: meaning that the fact that some event hap-pens does not affect the probability that some other eventhappens.

• Let X denote the number of events occurring in a period of time(or a given interval of space)

326

Then X follows the Poisson distribution and

P (X = k) =e−λλk

k!,

with k = 0, 1, 2, . . . .The parameter λ is actually the mean of the distribution (shownlater). In notation, we write

X ∼ Poi(λ).

Example 3.7.1. Several examples of variables which could possibly bemodelled as Poisson random variables are below. Do you think that aPoisson model is a good one?

• Number of visits to a website on a given day;

• The number of misprints on a page of a book;

• The number of wrong telephone numbers dialled in a day;

• The number of α-particles discharged in a fixed period of timefrom some radioactive source;

For your reference, the following graphs show the SHAPES of the pdfof Poisson distributions with some particular λ value. As before, thered curve gives the actual probability and the histogram gives somerelative frequencies obtained by sampling.

327

lambda=2

Rel

ativ

e fr

eque

ncy

0 2 4 6 8 10

0.0

0.1

0.2

0.3

0.4

lambda=5

Rel

ativ

e fr

eque

ncy

0 5 10 15

0.00

0.05

0.10

0.15

0.20

328

lambda=10

Rel

ativ

e fr

eque

ncy

0 5 10 15 20

0.00

0.05

0.10

0.15

329

Question: What is the mean and variance of a Poisson ran-dom variable?

Using the fact that

∞∑k=0

P (X = k) =∞∑k=0

e−λλk

k!= 1

we have (watch this clever trick)

E(X) =∞∑k=0

xp(x) =∞∑k=0

ke−λλk

k!

=∞∑k=1

ke−λλk

k!

=∞∑k=1

e−λλk−1

(k − 1)!λ

= λ

∞∑k=0

e−λλk

k!

= λ.

With the same trick we will be able to calculate

E(X2) = λ2 + λ.

Therefore,

var(X) = E(X2)− [E(X)]2 = (λ2 + λ)− λ2 = λ.

In summary,

330

If X ∼ Poi(λ) then

X = “number of events occurring in an interval of time (orspace)”

and

P (X = k) =λke−λ

k!, k = 0, 1, . . .

E(X) = λ

var(X) = λ.

Example 3.7.2. Suppose that the number of customers in a oneminute interval coming to a check out counter in a fast-food storefollows a Poisson distribution with λ = 2.

1. Find the probability of having less than 2 customers in a minute.

2. Find the probability of having exactly 3 customers in a minute.

3. Find the probability of having greater than 3 customers in aminute.

Solutions: X ∼ Poi(2).

331

1.

P (X < 2) = P (X = 0) + P (X = 1)

=e−2 × 20

0!+e−2 × 21

1!= 0.1353 + 0.2707 = 0.406.

2.

P (X = 3) =e−2 × 23

3!= 0.1804.

3.

P (X > 3) = 1− [P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)]

= 1− (0.1353 + 0.2707 + 0.2707 + 0.1804)

= 0.1429.

ANOTHER WAY is to look up the probabilities in a table.

332

3.8 Poisson/Binomial overlap

In fact, one common way to think of the Poisson distribution is as anapproximation to the binomial distribution when n is large and theprobability of success p is small. We illustrate with an example.

A mass contains 10000 atoms of a radioactive substance. The prob-ability that a given atom will decay in a one-minute time period is0.0002. Let X represent the number of atoms that decay in one minute.Now, each atom can be thought of as a Bernoulli trial, where successoccurs if the atom decays. Thus X is the number of successes in10000 independent Bernoulli trials, each with probability 0.0002, sothe distribution of X is Bin(10000, 0.0002). Then the mean of X isµX = 10000× 0.0002 = 2.

Another mass contains 5000 atoms, and each of these atoms has theprobability 0.0004 of decaying in a one-minute time interval. Let Y rep-resent the number of atoms that decays in one minute from this mass.By the reasoning in the previous paragraph, Y ∼ Bin(5000, 0.0004)and µY = 5000× 0.0004 = 2.

In each of these cases, the number of trials n and the success probabilityp are different, but the mean number of successes, which is equal tothe product np, is the same. Look at the following two probabilities

P (X = 3) =10000!

3!9997!(0.0002)3(0.9998)9997 = 0.180465091

P (Y = 3) =5000!

3!4997!(0.0004)3(0.9996)4997 = 0.180483143.

It turns out that these probabilities are very nearly equal to each other.This suggests that when n is large and p is small, the mass functiondepends almost entirely on the mean np, and very little on the specificvalues of n and p. We can therefore approximate the binomial mass

333

function with a quantity that depends on the product only. Specif-ically, if n is large and p is small, let λ = np then it can be shownby advanced methods that X can be well-approximated by a Poissonrandom variable with mean, λ = np.

Conversely, suppose that X is the number of events in T seconds andE(X) = λ. Also assume that the usual Poisson assumptions hold.

0 Ttime

Events occurs singly, randomly and at constant rate. You expect events

Divide the interval [0, T ] into n equal sub-intervals and let p be theprobability of at least one event occuring in a sub-interval. Denote byX1 the number of bins containing one or more events, then

X1 ∼ Bin(n, p), E(X1) = np.

0 Ttime

Events occur singly, randomly and at constant rate in every sub-interval

Imagine that we make bins smaller and smaller (big n) and λ = np,then X and X1 are similar.

334

In summary,

• A Poisson random variable X ∼ Poi(λ) can be approximated bya Binomial random variable X1 ∼ Bin(n, λn) provided that n islarge enough.

• A random variable X following Bin(n, p) with really big n andsmall p is approximated by a Poisson random variable with pa-rameter λ = np.

Question: Why bother with Poisson if we can fudge it witha binomial?

Answer: Poisson is a more direct model and simpler!

Example 3.8.1. A manufacturer produces VLSI chips. 1% of whichare defective. Find the probability (up to 3 decimal points) that in abox containing 100 chips, no defective are found, using both binomialcalculation and Poisson approximation?

Solutions: X = number of defective VLSI chips. X ∼ Bin(100, 0.01).

Using Binomial calculation

P (X = 0) =100!

0!100!(0.01)0(0.99)100 = 0.366.

Using Poisson approximation: X ∼ Poi(1) (since np = 100×0.01 = 1).

P (X = 0) =e−110

0!= 0.367.

335

We can use ipython --pylab to lookup probabilities.

from scipy.stats import *

# If X follows Binomial(n=10,p=0.4) then P(X=3) is

prob = binom.pmf(3,10,0.4)

# If X follows Geometric(p=0.4) then P(X=3) is

prob = geom.pmf(3,0.4)

#If X follows Poisson(\lambda=2) then P(X=3) is

prob = poisson.pmf(3,2)

clf()

x = linspace(0,10,11)

y = binom.pmf(x,10,0.4)

bar(x,y)

xlabel("x", fontsize=16)

ylabel("P(X=x)", fontsize=16)

title("Binomial(10,0.4) distribution", fontsize=16)

axis([0,10,0,0.3])

336

3.9 Recap

• A random variable X assigns values to outcomes in the samplespace.

• A probability distribution gives the values that X may assumeand their probabilities.

• 0 ≤ P (X = xi) ≤ 1 and∑P (X = xi) = 1

• The mean and variance of X are given by

E(X) =∑xi P (X = xi)

E(X2) =∑

(xi)2 P (X = xi)

var(X) = E(X2)− (E(X))2

• If X ∼ Bernoulli(p) then P (X = 0) = 1 − p, P (X = 1) = p,E(X) = p and var(X) = p(1− p)

• If X ∼ Geometric(p) then X is the “number of trials up to andincluding the first success”, P (X = k) = (1 − p)kp, E(X) = 1

p

and var(X) = 1−pp2

• If X ∼ Binomial(n, p) then X is “number of successes from nindependent trials”, P (X = k) =

(nk

)pk (1 − p)n−k, E(X) = np

and var(X) = np(1− p)

• If X ∼ Poisson(λ) then X is “number of events occurring in a

fixed interval of time (or space)”, P (X = k) = e−λλk

k! , E(X) = λand var(X) = λ

337

– We can lookup the probability mass P (X = k) for Binomialdistribution using Microsoft Excel

= BINOMDIST(k,n,p,0)

and lookup(nk

)with

= COMBIN(n,k)

Similarly, we can use Microsoft Excel for Poisson and othercontinuous distributions in the next topic.

338

Topic 4

Continuous RandomVariables andDistributions

Reading: Croft — §29.4,7–8,12–14

401

4.1 Continuous Random Variables

A continuous random variable is a random variable which can take on

any value in a given range.

Example 4.1.1. Heights and weights of people, magnitudes of earth-quakes, levels of reservoirs, blood cholesterol levels, etc.

We cannot define probabilities for continuous random variables in thesame way as for discrete random variables.

• If X ∼ Binomial(n, p) then X is discrete and takes on the values0, 1, 2, . . . , n.

Hence we can easily write down the probabilities

P (X = 0), P (X = 1), . . . , P (X = n)

• Suppose Y is a continuous random variable which can take onany value in R.

– Y has an uncountably infinite set of possible values.

– We cannot list probabilities which describe the behaviour ofY .

– Instead it turns out that the best way to describe the be-haviour of Y is to look at probabilities of the form

P (Y ≤ y).

402

4.1.1 Probability Distribution (PDF and CDF)

For any continuous random variable X there is a function f(x)such that

P (X ≤ x) =

∫ x

−∞f(u) du for all real numbers x

andf(x) ≥ 0 for all real numbers x

The function f(x) is called the

probability density function (pdf) of X.

The function

F (u) = P (X ≤ u) =

∫ u

−∞f(x) dx

is called the

cumulative distribution function (cdf) of X

403

For any continuous random variable X with probability densityfunction (pdf) f(x)

(1) P (a < X < b) =

∫ b

a

f(x) dx = F (b)− F (a)

(2)

∫ ∞−∞

f(x) dx = 1

404

F(a)

F(b)

ba

b

P(a<X<b)

a

=F(b)-F(a)

x

pdf002

XF (x)

1

0

P(a

<X

<b

)

x

f (x)X

Total area under curve = 1

405

Example 4.1.2. Let X be a continuous random variable with pdfgiven by

f(x) =

{3x2 0 ≤ x ≤ 1

0 otherwise

• For u < 0 we have F (u) = 0

• For 0 ≤ u ≤ 1 we have

F (u) =

∫ u

−∞f(x) dx = 0 +

∫ u

0

f(x) dx

=

∫ u

0

3x2 dx =

[x3

]u0

= u3 − 03 = u3

• For u > 1 we have F (u) = 1

• In summary

F (u) =

0 u < 0

u3 0 ≤ u ≤ 1

1 u > 1

So

406

P (X ≤ 12) = F (1

2) = (12)3 = 1

8

P (X ≥ 23) = 1− F (2

3) = 1− (23)3 = 19

27

P (X = 56) = 0 Why is this ??

P (X ≤ 3) = F (3) = 1

407

Checklist

CDF F (u)

• is a probability, i.e., F (u) = P (X ≤ u)

• must satisfy 0 ≤ F (u) ≤ 1 for all u

• Whenever a ≤ b we have P (X ≤ a) ≤ P (X ≤ b) so

F (a) ≤ F (b) whenever a ≤ b

i.e., F (u) is a non-decreasing function of u.

• Since P (X ≤ −∞) = 0 and P (X ≤ ∞) = 1 we must have

limu→−∞

F (u) = 0 and limu→∞

F (u) = 1

PDF f (x)

• is not a probability and so f(x) > 1 is possible (and ok).

• must have f(x) ≥ 0 for all x.

• must be a piece-wise continuous function.

• The probability P (a ≤ X ≤ b) is given by the area under f(x)between x = a and x = b.

• Since the probability that X lies in (−∞, ∞) must be 1 we musthave ∫ ∞

−∞f(x) dx = 1

408

Example 4.1.3. A company has been monitoring its daily telephoneusage. The daily use of time conforms to the following pdf (measuredin hours). [Using your modelling skills - why is this unlikely tobe a good model in the real world?].

f(x) =

{364x

2(4− x) 0 ≤ x ≤ 4

0 otherwise

(a) Sketch a graph of f(x).

Clearly f(x) = 0 when x = 0 or x = 4.

f(x) = − 364x

3 + 316x

2 cubic equation

f ′(x) = − 964x

2 + 38x = 0

x = 0 or 964x = 3

8 i.e. x = 83

so the turning points are at (0, 0) and (83 ,

49).

409

(b) Find the cdf for daily telephone usage X.

For 0 ≤ u ≤ 4,∫ u

−∞f(x) dx =

∫ u

0

364x

2(4− x) dx

=

∫ u

0

(− 364x

3 + 316x

2) dx

=

[− 3

256x4 + 1

16x3

]u0

= − 3256u

4 + 116u

3

Therefore

F (u) =

0 u < 0

− 3256u

4 + 116u

3 0 ≤ u ≤ 4

1 u > 4

410

(c) The current budget of the company covers only 3 hours of dailytelephone usage. What is the probability that the budgeted figureis exceeded?

P (X > 3) = 1− P (X ≤ 3) = 1− F (3)

= 1− (− 3256(3)4 + 1

16(3)3)

= 67256 = 0.2617 (4dp)

411

Example from page 406

>>> u = Symbol("u")

>>> f = 3*x**2

>>> F = integrate(f, (x,0,u))

>>> F

u**3

>>> F.subs(u,Rational(1,2))

1/8

>>> 1-F.subs(u,Rational(2,3))

19/27

Example from page 409

>>> u = Symbol("u")

>>> f = Rational(3,64)*x**2*(4-x)

>>> F = integrate(f, (x,0,u))

>>> F

-3*u**4/256 + u**3/16

>>> 1-F.subs(u,3)

67/256

412

4.1.2 Mean and Variance

Compare the following definitions with those on pages 307 and 308 fordiscrete random variables.

For a continuous random variable X with pdf f(x) the

expected value of X

(also called the mean of X) is

E(X) =

∫ ∞−∞

x f(x) dx

Notes —

• X is a random variable and E(X) is a number.

• The expected value of a function g of a random variable X isdefined to be

E(g(X)) =

∫ ∞−∞

g(x) f(x) dx

• For example, using g(x) = x2

E(X2) =

∫ ∞−∞

x2 f(x) dx

413

• If X is a continuous random variable with pdf f(x) then

E(aX + b) = aE(X) + b

for any a, b ∈ R, i.e., expectation is a linear operator.

For a continuous random variable X with pdf f(x)

var(X) = E((X − µ)2) = E(X2)− µ2

where µ = E(X).

Example 4.1.4. (continued from page 409)

A company has been monitoring its daily telephone usage. The dailyuse of time conforms to the following pdf (measured in hours).

f(x) =

{364x

2(4− x) 0 ≤ x ≤ 4

0 otherwise

(d) Find the mean and variance of the distribution of the company’s

414

daily telephone usage.

E(X) =

∫ ∞−∞

x f(x) dx =

∫ 4

0

x 364x

2(4− x) dx

=

∫ 4

0

(− 364x

4 + 316x

3) dx =

[− 3

320x5 + 3

64x4

]4

0

= 125 = 2.4 hours

E(X2) =

∫ ∞−∞

x2 f(x) dx =

∫ 4

0

x2 364x

2(4− x) dx

=

∫ 4

0

(− 364 x

5 + 316 x

4) dx =

[− 3

384 x6 + 3

80 x5

]4

0

= 325 = 6.4

var(X) = E(X2)− (E(X))2 = 6.4− (2.4)2 = 0.64

415

Example 4.1.5. Let X be a continuous random variable with pdfgiven by (see the Example on page 406).

f(x) =

{3x2 0 ≤ x ≤ 1

0 otherwise

Find the mean and variance of X.

Solution:

E(X) =

∫ ∞−∞

x f(x) dx =

∫ 1

0

x 3x2 dx

=

∫ 1

0

3x3 dx =

[34x

4

]1

0

= 34

E(X2) =

∫ ∞−∞

x2 f(x) dx =

∫ 1

0

x2 3x2 dx

=

∫ 1

0

3x4 dx =

[35x

5

]1

0

= 35

var(X) = E(X2)− (E(X))2

= 35 − (3

4)2 = 380

416

Check using SymPy —

f = 3*x**2

E = integrate(x*f, (x,0,1))

E2 = integrate(x**2*f, (x,0,1))

var = E2 - E**2

417

4.2 Exponential Distribution

Situations where

• the data takes on positive values only

• is more likely to be near the origin

• is increasingly less likely to be in an interval of fixed width as itmoves further to the right.

A common situation is when we are measuring times between certainevents of interest. As long as the events occur in a manner similar toevents where the number is well modelled by a Poisson variable thenthe time between them can be well approximated as an exponentialrandom variable. Hence, we must check that the events occur (at leastaproximately) singly, independently and at constant rate.

Example 4.2.1. Some random variables below might be well-modelledby an exponential distribution. What do you think? Is the model likelyto be good or bad?:

• Time between vehicles passing a fixed point on a motorway,

• Time between accidents on an airline,

• How long you have to wait for a lift in an office building,

• etc.

418

Definition 4.2.2.

A continuous random variable X has an exponential distribution withparameter θ (θ > 0) if the pdf of X is

f(x) =

{θ e−θx x ≥ 0

0 x < 0

We write X ∼ Exp(θ).

4.2.1 Examples of Exponential PDF

0 1 2 3 4

0.0

0.5

1.0

1.5

2.0

PDF of exponential distributions

λ= 2

λ= 1

λ= 4

419

The cdf of X ∼ Exp(θ) is given by

FX(x) =

{0 x < 0

1− e−θx x ≥ 0

Proof. For x > 0

FX(x) =

∫ x

−∞f(u) du

=

∫ x

0

θ e−θu du

=

[−e−θu

]x0

= 1− e−θx

�

420

4.2.2 Examples of Exponential CDF

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

CDF of exponential distributions

λ= 2

λ= 1

λ= 4

421

Let X ∼ Exp(θ). Then

µX = E(X) =1

θ

σ2X = var(X) =

1

θ2

Proof. Mean (using “integration by parts”)

E(X) =

∫ ∞0

x θ e−θx dx

=

[x(−e−θx)

]∞0

−∫ ∞

0

(−e−θx) dx

= (0− 0) +

[e−θx

−θ

]∞0

=1

θ

422

Second moment (using “integration by parts”)

E(X2) =

∫ ∞0

x2 θ e−θx dx

=

[x2(−e−θx)

]∞0

−∫ ∞

0

2x (−e−θx) dx

= (0− 0) +2

θ

∫ ∞0

x θ e−θx dx

= 2θ E(X)

=2

θ2

Variance

var(X2) = E(X2)− (E(X))2

=2

θ2−(

1

θ

)2

=1

θ2

�

423

Example 4.2.3. The duration, X, in minutes of phone calls fromcompany business phones is a continuous random variable with pdf

f(x) =

{16e−x/6 x ≥ 0

0 x < 0

(a) Derive the cdf, and show that this function is, in fact, a satisfac-tory cdf.

(b) Calculate the probability that a call will last between 3 and 6minutes.

(c) Find the expected length of a call.

(d) Determine the variance of call length.

Solution:

(a) For x ≥ 0 we have

FX(x) = P (X ≤ x) =

∫ x

0

16e−u/6 du

=[−e−u/6

]x0

= −e−x/6 − (−1)

= 1− e−x/6

424

Hence the CDF is

FX(x) =

{1− e−x/6 x ≥ 0

0 x < 0

We can confirm that this is a satisfactory cdf

• F (−∞) = 0 by definition.

• F (∞) = 1− 0 = 1 since limx→∞

e−x/6 = 0

• F ′(x) = f(x) ≥ 0 so F (x) is an increasing function of x

(b)

P (3 < X ≤ 6) =

∫ 6

3

f(x) dx

= F (6)− F (3)

= (1− e−6/6)− (1− e−3/6)

= −e−1 + e−1/2

= 0.6065306− 0.3678794

= 0.2387

(c) The expected length of a call, using the result above, is

µ = E(X) =1

θ=

1

(16)

= 6 minutes

425

(d) Using the result above

var(X) = σ2 =

(1

θ

)2

=

(1

(16)

)2

= 36

�

Example 4.2.4. A sugar refinery has three processing factories, all ofwhich received raw sugar in bulk. The amount of sugar that one fac-tory can process in one day can be modelled as having an exponentialdistribution with a mean of 4 tons for each of the three plants.

(a) If the factories operate independently, find the probability thatexactly two of the three factories will process more than 4 tonson a given day.

(b) Consider one particular factory. How much raw sugar should bestocked for that factory each day so that the chance of runningout of product is only 0.05?

Solution: X ∼ Exp(14) since E(X) = 4.

(a)

P (X > 4) = 1− P (X ≤ 4)

= 1− (1− e−4/4)

= e−1 = 0.3679 (4dp)

426

Next is a Binomial situation

P (exactly two factories process > 4 tons)

=

(3

2

)(0.3679)2(1− 0.3679) = 0.2567 (4dp)

(b)

P (X > a) = e−a/4 = 0.05

a = −4 ln(0.05) = 11.98 (2dp)

�

4.2.3 The exponential and Poisson distributions

Assume that events occur singly, independently and at constant rate.Let the rate be called λ. Hence, the expected number of events perunit time is λ. Therefore, in a period of time of length t, the expectednumber of events is λt. Hence, the number of events, X, is a Poissonrandom variable with X ∼ Poi(λt). If T represents the waiting timefrom any starting point until the next event, then T ∼ Exp(λ).

Example 4.2.5. Assume that the number of cars passing a traffic lighton a quiet road follows a Poisson distribution with a rate of 60/1hour.

Then, the amount of time (in minutes) between arrivals can be mod-elled by exponential distribution with rate λ = 1.

427

4.2.4 Memoryless (or Forgetfulness) Property

of the Exponential Distribution

If A ∼ Exp(λ) then for all t > 0 and h > 0

P (A > t+ h | A ≥ t) = P (A > h)

Proof.

P (A > t+ h | A ≥ t) =P [(A ≥ t) ∩ (A > t+ h)]

P (A ≥ t)

=P (A > t+ h)

P (A ≥ t)

=e−λ(t+h)

e−λt

= e−λh

= P (A > h)

�

428

• The above result says that the probability distribution of thetime until the next event

does not depend

on how long it has been since the last event.

• The exponential distribution is the

only

continuous distribution with this memoryless property.

429

Example 4.2.6. Suppose that the amount of time a person spendsin a bank is exponentially distributed with mean ten minutes, that is,λ = 1

10 .

(a) What is the probability that a customer will spend more thanfifteen minutes in the bank?

(b) What is the probability that a customer will spend more thanfifteen minutes in the bank given that she is still in the bankafter ten minutes?

Solution:

(a)

P (X > 15) = e−15λ = e−1510 = 0.220

(b) We want P (X > 15 | X > 10).

However, since the exponential distribution does not “remember”that the customer has already spent ten minutes in the bank, thismust equal the probability that an entering customer spends atleast five minutes in the bank.

P (X > 5) = e−5λ = e−510 = 0.604

�

430

Example 4.2.7.

Consider a post office that is run by two clerks.

Suppose that when Mr Smith enters the system he discovers that MrJones is being served by one of the clerks and Mr Brown by the other.

Suppose also that Mr Smith is told that his service will begin as soonas either Mr Jones or Mr Brown leaves.

If the amount of time that a clerk spends with a customer is expo-nentially distributed with mean 1

λ , what is the probability that, of thethree customers, Mr Smith is the last to leave the post office?

Solution:

Consider the time at which Mr Smith first finds a free clerk. At thispoint either Mr Jones or Mr Brown would have just left and the otherone would still be in service. However, by the lack of memory of theexponential, it follows that the amount of time that this other man(either Jones or Brown) would still have to spend in the post office isexponentially distributed with mean 1

λ . That is, it is the same as if hewere just starting his service at this point. Hence, by symmetry, theprobability that he finishes before Smith must equal 1

2 .

�

431

4.2.5 Minimum of Several Independent Exponen-

tial Random Variables

Suppose T1, T2, . . . , Tn are independent exponential random variables,i.e.,

T1 ∼ Exp(λ1), T2 ∼ Exp(λ2), . . . , Tn ∼ Exp(λn)

Let U be the random variable that takes on the value equal to theminimum of the values actually taken on by T1, T2, . . . , Tn, i.e.,

U = min{T1, T2, . . . , Tn}

So if Ti represents the time until a particular kind of event occurs, thenU represents the time until the first of the n different events occurs.

P (U > t) = P (T1 > t and T2 > t and · · · and Tn > t)

= P (T1 > t)P (T2 > t) · · ·P (Tn > t)

= e−λ1te−λ2t · · · e−λnt

= e−(λ1+λ2+···+λn)t

P (U ≤ t) = 1− e−(λ1+λ2+···+λn)t

As a result,

U ∼ Exp(α) where α =n∑i=1

λi

432

4.2.6 Counting Events in Small Time Intervals

Suppose X ∼ Exp(λ) represents the lifetime of a component switchedon at time 0.

Question. Given that the component is working at time t, what isthe probability that it fails during the interval (t, t+ ∆t)?

That is, what is

P (t < X < t+ ∆t | X > t)

By the memoryless property, this is equal to

P (0 < X < ∆t) ≈ λ∆t

Can you see why this probability is approximately λ∆t?

Given that the component is working at time t, the probabilitythat it fails during the next interval ∆t is approximately λ∆t.

433

4.3 Gamma distribution

A continuous random variable X has gamma distribution if thepdf of X is defined by

f(x) =

{1

Γ(k)θkxk−1 exp(−x

θ ), x > 0

0, otherwise

where

• k > 0 is called the shape parameter

• θ > 0 is called the scale parameter

• Γ(k) is the gamma function evaluated at k:

Γ(k) =

∫ ∞0

xk−1e−xdx.

In notation, we write X ∼ Gamma(k, θ).

Question: What is the mean, variance of gamma randomvariable

434

Let X ∼ Gamma(k, θ) then

• E(X) = kθ

• Var(X) = kθ2

Like the exponential distribution, the gamma distribution can some-times be used to model inter-arrival times, but this time it is a littlemore general. It might be used to model the time between adjacentevents, or the time between one event and the next event but one, etc.The fact is that some gamma random variables can be thought of asthe sum of a sequence of independent exponential random variables.

We have the following relationship:

• If k is a positive integer, then gamma random variableX is the sum of k independent exponentially distributedrandom variables, each of which has a mean of θ.

• If X ∼ Exp(λ) then X ∼ Gamma(1, 1λ).

435

4.4 Uniform Distribution

Reading: Croft — 29.12

A uniform continuous random variable lies in a fixed interval [a, b] andis equally likely to lie in any subinterval of length d no matter wherethe subinterval lies in [a, b].

A continuous random variable X has a

uniform distribution on [a, b]

if the pdf of X is

f(x) =

1

b− aif a ≤ x ≤ b

0 otherwise

We write X ∼ U(a, b).

436

��

��

��

��

��

a b

x

pdf004

437

Question: What is the mean and variance of a uniform dis-

tribution?

If X ∼ U(a, b) then

F (x) =

0 x < a

x− ab− a

a ≤ x ≤ b

1 x > b

E(X) =a+ b

2

var(X) =(b− a)2

12

438

Proof. For a < x < b we have

FX(x) =

∫ x

−∞fX(u) du =

∫ x

a

(1

b− a

)du

=

[u

b− a

]xa

=x− ab− a

E(X) =

∫ b

a

x

1

b− a

dx =

[x2

2(b− a)

]ba

=a+ b

2

E(X2) =

∫ b

a

x2

1

b− a

dx

=

[x3

3(b− a)

]ba

=b3 − a3

3(b− a)

var(X) = E(X2)− (E(X))2

=b3 − a3

3(b− a)−(a+ b

2

)2

=(b− a)(a2 + ab+ b2)

3(b− a)− a2 + 2ab+ b2

4

=(b− a)2

12(with a bit of algebra!)

439

Example 4.4.1. The continuous random variable X is uniformlydistributed on the interval [a, b]. Given

E(X) = 4 var(X) = 3

find

(a) a and b (b) P (X > 5)

Solution:

(a)E(X) =

a+ b

2var(X) =

(b− a)2

12

a+ b

2= 4

a+ b = 8

b = 8− aand

(b− a)2

12= 3

(b− a)2 = 36

((8− a)− a)2 = 36

(8− 2a)2 = 36

8− 2a = ±√

36

8− 2a = −6 or 8− 2a = 6

2a = 14 or 2a = 2

a = 7 or a = 1

440

When a = 7 we have b = 1 — but this makes no sense since wewant a < b.

When a = 1 we have b = 7 — which does make sense.

(b) P (X > 5) = 1− P (X ≤ 5) = 1− 5−17−1 = 1− 4

6 = 13

441

4.5 Normal (or Gaussian) Distribution

Reading: Croft — 29.14

442

History.The normal distribution was discovered in 1733 by Abraham DeMoivre (1667–1754) in his investigation of approximating cointossing probabilities. He named the PDF of his discovery theexponential bell-shaped curve.

http://www-groups.dcs.st-and.ac.uk/~history/

BiogIndex.html

443

http://www-groups.dcs.st-and.ac.uk/~history/BiogIndex.html

http://www-groups.dcs.st-and.ac.uk/~history/BiogIndex.html

History.In 1809, Carl Friedrich Gauss (1777–1855) firmly establishedthe importance of the normal distribution by using it to predictthe location of astronomical bodies. As a result, the normaldistribution then became known as the Gaussian distribution,a terminology that is still used, especially in engineering.

Later, in the last half of the 19th century, researchers dis-covered that many variables have distributions that follow orare well-approximated by a Gaussian distribution. Roughlyspeaking, researchers found that it is quite usual, or “normal”,for a variable to have a Gaussian distribution. Consequently,following the lead of noted British statistician Karl Pearson(1857–1936), the Gaussian distribution began to be referred toas the normal distribution.

444

A continuous random variable X has a

normal distribution with parameters µ and σ2

(must have σ > 0) if the pdf of X is

f(x) =1

σ√

2πe−(x−µ)2/2σ2

, −∞ < x <∞

We write X ∼ N(µ, σ2).

Notes —

• If X ∼ N(µ, σ2) then

E(X) = µ var(X) = σ2

• The standard normal distribution has µ = 0 and σ2 = 1 and sowe would write Z ∼ N(0, 1) with pdf

f(z) = 1√2πe−z

2/2

445

−6 −4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

PDF of normal distributions

Figure 4.1: PDF of normal distributions

446

−6 −4 −2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

CDF of normal distributions

Figure 4.2: CDF of normal distributions

447

How do we know (mathematically) that if Z ∼ N(0, 1) then

E(Z) = 0 and var(Z) = 1 ?

Let

g(z) = z f(z) = z 1√2πe−z

2/2

448

Then g(z) is an odd function, i.e., g(−z) = −g(z).

E(Z) =

∫ ∞−∞

z f(z) dz =

∫ ∞−∞

g(z) dz = 0

Further

E(Z2) =

∫ ∞−∞

z2 1√2πe−z

2/2 dz = 1√2π

∫ ∞−∞

z2 e−z2/2 dz

We can use “integration by parts” to evaluate this (nasty) integral.

Settingu = z and dv

dz = z e−z2/2

we have

dudz = 1 and v =

∫z e−z

2/2 dz = − e−z2/2

449

Remember the rule is∫u

(dv

dz

)dz = uv −

∫ (du

dz

)v dz

E(Z2) =1√2π

∫ ∞−∞

z(ze−z

2/2)dz

=1√2π

[−ze−z2/2]∞−∞−∫ ∞−∞−e−z2/2 dz

=

1√2π

(0− 0) +

∫ ∞−∞

e−z2/2 dz

=

∫ ∞−∞

1√2πe−z

2/2 dz =

∫ ∞−∞

f(z) dz = 1

So now we can easily determine the variance of Z.

var(Z) = E(Z2)− (E(Z))2

= 1− (0)2 = 1

In the investigation of probability distributions we have brought to-gether a lot of the tools from earlier in the course.

• Geometric series

• Binomial coefficients (Pascal’s triangle)

• Exponential functions

450

• Integration (probabilities are area under the pdf f(x))

• Integration by parts

451

Back on track . . .

If Z ∼ N(0, 1) then

E(Z) = 0

var(Z) = 1

The standard normal density (pdf) is often denoted by φ (Greeklittle “phi”)

φ(z) = 1√2πe−z

2/2

and the standard normal distribution function (cdf) is oftendenoted by Φ (Greek capital “phi”)

Φ(z) =

∫ z

−∞φ(u) du =

∫ z

−∞

1√2πe−u

2/2 du

which is impossible to write down in a closed form.

452

Table 1. Standard Normal Distribution

P (0 ≤ Z ≤ z) where Z ∼ N(0, 1)

453

z 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.03590.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.07530.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.11410.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.15170.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.18790.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.22240.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.25490.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.28520.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.31330.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.33891.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.36211.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.38301.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.40151.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.41771.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.43191.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.44411.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.45451.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.46331.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.47061.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.47672.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.48172.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.48572.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.48902.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.49162.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.49362.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.49522.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.49642.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.49742.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.49812.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.49863.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.49903.1 0.4990 0.4991 0.4991 0.4991 0.4992 0.4992 0.4992 0.4992 0.4993 0.49933.2 0.4993 0.4993 0.4994 0.4994 0.4994 0.4994 0.4994 0.4995 0.4995 0.49953.3 0.4995 0.4995 0.4995 0.4996 0.4996 0.4996 0.4996 0.4996 0.4996 0.49973.4 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.49983.5 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.49983.6 0.4998 0.4998 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.49993.7 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.49993.8 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.49993.9 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000454

Table 2. Inverse Normal Probabilities

p = P (0 ≤ Z ≤ z) where Z ∼ N(0, 1)

p z p z p z p z p z

0.00 0.0000 0.10 0.2533 0.20 0.5244 0.30 0.8416 0.40 1.28160.01 0.0251 0.11 0.2793 0.21 0.5534 0.31 0.8779 0.41 1.34080.02 0.0502 0.12 0.3055 0.22 0.5828 0.32 0.9154 0.42 1.40510.03 0.0753 0.13 0.3319 0.23 0.6128 0.33 0.9542 0.43 1.47580.04 0.1004 0.14 0.3585 0.24 0.6433 0.34 0.9945 0.44 1.55480.05 0.1257 0.15 0.3853 0.25 0.6745 0.35 1.0364 0.45 1.64490.06 0.1510 0.16 0.4125 0.26 0.7063 0.36 1.0803 0.46 1.75070.07 0.1764 0.17 0.4399 0.27 0.7388 0.37 1.1264 0.47 1.88080.08 0.2019 0.18 0.4677 0.28 0.7722 0.38 1.1750 0.48 2.05370.09 0.2275 0.19 0.4959 0.29 0.8064 0.39 1.2265 0.49 2.3263

0.470 1.8808 0.480 2.0537 0.490 2.3263 0.4990 3.0902 0.49990 3.71900.471 1.8957 0.481 2.0749 0.491 2.3656 0.4991 3.1214 0.49991 3.74550.472 1.9110 0.482 2.0969 0.492 2.4089 0.4992 3.1559 0.49992 3.77500.473 1.9268 0.483 2.1201 0.493 2.4573 0.4993 3.1947 0.49993 3.80820.474 1.9431 0.484 2.1444 0.494 2.5121 0.4994 3.2389 0.49994 3.84610.475 1.9600 0.485 2.1701 0.495 2.5758 0.4995 3.2905 0.49995 3.89060.476 1.9774 0.486 2.1973 0.496 2.6521 0.4996 3.3528 0.49996 3.94440.477 1.9954 0.487 2.2262 0.497 2.7478 0.4997 3.4316 0.49997 4.01280.478 2.0141 0.488 2.2571 0.498 2.8782 0.4998 3.5401 0.49998 4.10750.479 2.0335 0.489 2.2904 0.499 3.0902 0.4999 3.7190 0.49999 4.2649

455

Example 4.5.1. Suppose Z ∼ N(0, 1).

Numbers in boxes represent numbers looked up in the standard normaltable (see page 453).

(a) P (Z ≤ 1.5) = P (Z ≤ 0) + P (0 ≤ Z ≤ 1.5)

= 0.5 + 0.4332

= 0.9332

(b) P (Z ≥ 1.5) = 0.5− P (0 ≤ Z ≤ 1.5)

= 0.5− 0.4332

= 0.0668

Alternatively, note that P (Z ≥ 1.5) = 1− P (Z ≤ 1.5)

(c) P (Z < −2) = P (Z > 2)

= 0.5− P (0 ≤ Z ≤ 2)

= 0.5− 0.4772

= 0.0228

(d) P (−2 ≤ Z ≤ 1) = P (−2 ≤ Z ≤ 0) + P (0 ≤ Z ≤ 1)

= P (0 ≤ Z ≤ 2) + P (0 ≤ Z ≤ 1)

= 0.4772 + 0.3413

= 0.8185

(e) Find the value a such that P (0 ≤ Z ≤ a) = 0.35.

Directly from the inverse standard normal table (see page 455).

a = 1.0364

456

4.5.1 Standardising a Normal Random Variable

Notes —

• If X ∼ N(µ, σ2) then the random variable

Z =X − µσ

is an N(0, 1) random variable

• If Z ∼ N(0, 1) and X = aZ + b then

X ∼ N(b, a2)

Proof. We start with X ∼ N(µ, σ2).

Remember that

P (X ≤ y) = FX(y) =

∫ x=y

x=−∞fX(x) dx

=

∫ x=y

x=−∞

1

σ√

2πexp

−1

2

(x− µσ

)2 dx

Now we use the substitution

z =x− µσ

i.e., x = µ+ σz so dx = σ dz

457

So

P (Z ≤ w) = P (X ≤ µ+ σw)

= FX(µ+ σw)

=

∫ x=µ+σw

x=−∞

1

σ√

2πexp

−12

(x− µσ

)2 dx

=

∫ z=w

z=−∞

1

σ√

2πexp

−12z

2 σ dz

=

∫ z=w

z=−∞

1√2π

e−z2/2 dz

= FZ(w) = P (Z ≤ w)

which is the standard normal cdf.

So the given Z ∼ N(0, 1).

�

Example 4.5.2. Find the following probabilities for a normal randomvariable X that has a mean of 10 and a standard deviation of 2.

Numbers in boxes represent numbers looked up in the standard normaltable (see page 453).

(a)

458

P (10 ≤ X ≤ 12) = P (10−102 ≤ X−10

2 ≤ 12−102 )

= P (0 ≤ Z ≤ 1)

= 0.3413

(b) P (8 ≤ X ≤ 10) = P (8−102 ≤ X−10

2 ≤ 10−102 )

= P (−1 ≤ Z ≤ 0)

= P (0 ≤ Z ≤ 1)

= 0.3413

(c) P (11 ≤ X ≤ 14) = P (11−102 ≤ X−10

2 ≤ 14−102 )

= P (0.5 ≤ Z ≤ 2)

= P (0 ≤ Z ≤ 2)− P (0 ≤ Z ≤ 0.5)

= 0.4772 − 0.1915

= 0.2857

(d) P (9 ≤ X ≤ 11) = P (9−102 ≤ X−10

2 ≤ 11−102 )

= P (−0.5 ≤ Z ≤ 0.5)

= 2× P (0 ≤ Z ≤ 0.5)

= 2× 0.1915

= = 0.3830

(e)

459

P (7 ≤ X ≤ 13) = P (7−102 ≤ X−10

2 ≤ 13−102 )

= P (−1.5 ≤ Z ≤ 1.5)

= 2× P (0 ≤ Z ≤ 1.5)

= 2× 0.4332

= 0.8664

460

Example 4.5.3.

A firm that manufactures and bottles apple juice has a machine thatautomatically fills bottles with 1.6 litres of juice. (The bottle can holdup to 1.7 litres.) Over a long period, the average amount dispensedinto the bottle has been 1.6 litres. However, there is some variability inhow much juice is put in each bottle; the distribution of these amountshas a standard deviation of 0.1 litres.

(a) If the litres of fill per bottle can be assumed to be normally dis-tributed, find the probability that the machine will overflow any onebottle.

(b) Suppose that another machine similar to the one described above isoperating in such a way that the litres of fill have a mean value equalto the dial setting for “amount of liquid” but also has a variance of0.0144 litres. Find the proper setting for the dial so that the 1.7 litrebottle will overflow only 5% of the time. Assume that the amountsdispensed have a normal distribution.

Solution

(a) Let X denote the amount of liquid (in litres) dispensed into one bottleby the filling machine.

Then X is assumed to be normally distributed with a mean of 1.6 anda standard deviation of 0.1, i.e.,

X ∼ N(1.6, 0.12)

A bottle will overflow if the machine attempts to put more than 1.7

461

litres in it.

P (X > 1.7) = P (X−µσ > 1.7−µσ )

= P (Z > 1.7−1.60.1 )

= P (Z > 1)

= 0.5− P (0 < Z < 1)

= 0.5− 0.3413

= 0.1587

(b) Let X denote the amount of liquid dispensed; we now look for a valueof µ such that P (X > 1.7) = 0.05. Now

P (X > 1.7) = P (X−µσ > 1.7−µσ )

= P (Z >1.7− µ

0.12)

If P (Z > z0) = 0.05 then P (0 < Z < z0) = 0.45 and from the inversestandard normal tables, we have z0 = 1.6449.

Thus, it must be that 1.7−µ0.12 = 1.6449 so

µ = 1.7− 0.12(1.6449) = 1.5026 (4dp)

462

Example 4.5.4. At a temperature of 25◦C, the resistances of a typeof thermistor are normally distributed with a mean of 10000 ohmsand a standard deviation of 4000 ohms. The thermistors are to bechecked, and those having resistances between 8000 and 15000 ohmsare to shipped to a vendor. What proportion of these thermistors willactually be shipped?

Solution:X ∼ N(10000, 40002)

P (8000 ≤ X ≤ 15000) = P (8000−100004000 ≤ Z ≤ 15000−10000

4000 )

= P (−12 ≤ Z ≤ 5

4)

= P (0 ≤ Z ≤ 0.5) + P (0 ≤ Z ≤ 1.25)

= 0.1915 + 0.3944

= 0.5859

463

Properties of Normal Distribution

Consider Z ∼ N(0, 1).

f(z) =1√2π

e−z2/2

f ′(z) = −z 1√2π

e−z2/2 = 0 when z = 0

f ′′(z) = z2 1√2πe−z

2/2 − 1√2πe−z

2/2

= − 1√2πe−z

2/2(1− z2) = 0 when z2 = 1

• When Z = −1 we have X = µ− σ.

• When Z = 1 we have X = µ+ σ.

Points of inflection are one standard deviation from the mean.

464

Notes —

• P (−1 < Z < 1) = 2(0.3413) = 0.6826

Or just over 23 of normal observations lie within one standard deviation

of the mean (between points of inflection on the curve).

• P (−2 < Z < 2) = 2(0.4772) = 0.9544

Or just over 95% of normal observations lie within 2 standard devia-tions of the mean.

• P (−3 < Z < 3) = 2(0.4987) = 0.9974

Or almost all normal observations lie within 3 standard deviations ofthe mean.

465

Recap

• Checklist for pdf:

– Show f(x) ≥ 0 for all x

– Show∫∞−∞ f(x) dx = 1

• Checklist for cdf:

F (u) =

∫ u

−∞f(x) dx

– Show 0 ≤ F (x) ≤ 1 for all x

– Show F (x) is non-decreasing

– Show F (−∞) = 0 and F (∞) = 1

• Expected value (mean) and variance

E(X) =

∫ ∞−∞

x f(x) dx

E(X2) =

∫ ∞−∞

x2 f(x) dx

var(X) = E(X2)− (E(X))2

• Exponential distribution X ∼ Exp(λ) (pdf, cdf, mean, variance)

• Uniform distribution X ∼ U(a, b) (pdf, cdf, mean, variance)

• Standard Normal distribution Z ∼ N(0, 1) (pdf, mean, variance, cdffrom tables)

fZ(z) =1√2π

e−z2/2

466

• Normal distribution X ∼ N(µ, σ2) (mean, variance)

– Standardising: Z = X−µσ ∼ N(0, 1).

Exponential Parameters: λ > 0

X ∼ Exp(λ) PDF: f(x) =

{λe−λx x ≥ 0

0 otherwise

Mean: 1λ

Variance: 1λ2

Uniform Parameters: a and b where a < b

X ∼ U(a, b) PDF: f(x) =

{1b−a a ≤ x ≤ b

0 otherwise

Mean: a+b2

Variance: (b−a)2

12

Normal Parameters: µ and σ2 where σ > 0

X ∼ N(µ, σ2) PDF: f(x) = 1√2πσe−(x−µ)2/2σ2

Mean: µ

Variance σ2

467

We can use Microsoft Excel to find the Normal probabilities.

If X ∼ N(µ, σ2) then P (X ≤ x) is given by

= NORMDIST(x,mu,sigma,1)

468

Of course we can use SymPy to do the integration.

See http://docs.scipy.org/doc/scipy-0.14.0/reference/tutorial/stats.html for how to look up pdf, cdf, etc for common distributions.

Binomial(20, 0.4) Binomial(30, 0.4) Binomial(40, 0.4)Binomial(60, 0.4) Binomial(80, 0.4) Binomial(100, 0.4)Binomial(150, 0.4) Binomial(200, 0.4) Binomial(500, 0.4)

469

http://docs.scipy.org/doc/scipy-0.14.0/reference/tutorial/stats.html

http://docs.scipy.org/doc/scipy-0.14.0/reference/tutorial/stats.html

Topic 5

Confidence Intervals,Estimation and Samples

501

5.1 How do we get statistical distribu-tions?

In general, we follow one of the following three routes to get a suitablestatistical distribution for our model.

Route 1

Prior workor Theory

Dist + paraseg., Bin + n,p

Prior workor Theory


Route 2

Distribution

Collect data

Estimateparas


Collect data

Route 3

Knowledge ofsituation, i.e.,type of RVs,range of RVs,...

Propose distribution

Estimate paras

502

5.2 Model fitting cycle

It is not always easy to choose a good model immediately. Therefore,you often come across the model fitting cycle, which is summarised inthe following picture.

SITUATION

Propose model usingroute 1, 2, 3

Fit parameters

Access goodness of fiti.e., compare fittedmodel with data

model OK

STOP

model not

good e

nough

503

5.3 Point estimates and interval estimates

When data are collected, it is often with the purpose of estimatingsome numerical characteristic of the population from which they came.Additionally, as you can see from the three routes above, estimatingparameters is often unavoidable when building a model. In terms of es-timation, we have two types of estimates: point estimates and intervalestimates.

• A point estimate of a population parameter is a single number cal-culated from the sample that is our guess of the true value of thepopulation parameter

• An interval estimate (known as a confidence interval) for a populationparameter is a range of plausible values for the parameter. The inter-val is designed so we have some level of confidence that the intervalcontains the unknown parameter.

For any population parameter, we can find its point estimate and inter-val estimate. Often, interval estimates are derived from point estimatesin combination with a theorem known as the Central Limit Theorem.

5.4 Central Limit Theorem

The Central Limit Theorem is by far the most important result instatistics. Many commonly used statistical methods rely on this the-orem for their validity. In broad terms, the Central Limit Theorem(CLT) says that if we draw a large enough sample from a population,then the distribution of the sample mean is approximately normal, nomatter what population the sample was drawn from. This allows us to

504

compute probabilities for sample means using the normal distribution,even though the population from which the sample was drawn is NOTnecessarily normal.

Central Limit TheoremLet X1, X2, . . . , Xn be a simple random sample from a popula-tion with mean µ and variance σ2. Let

X̄ =X1 +X2 + · · ·+Xn

n

be the sample mean. Then if n is sufficiently large, the samplemean is approximately normally distributed. Specifically, wehave the approximate result:

X̄ ∼ N (µ,σ2

n)

Here we are concerned with inference for several important populationparameters, including the mean and the proportion.

505

5.5 Sample mean

5.5.1 Sample mean as a point estimate of the true

(population) mean

As you have seen in Topic 1, we have several different ways of sampling.Here we focus only on the simplest case: simple random sampling(SRS).

Suppose we have a SRS from a distribution with mean µ and varianceσ2. The sample mean is defined as

X̄ =X1 +X2 + · · ·+Xn

n.

This is a statistic used to estimate the true mean µ of the populationfrom which your data come.

The sample mean X̄ is called the point estimate of the populationmean µ.

When talking about point estimates, we usually address the following:

Question: How good is the point estimate X̄ for the truemean µ?

This can be answered by looking at the properties of the sample meanX̄.

5.5.2 Properties of the sample mean

The sample mean possesses two desirable properties that a point esti-mate should have, these properties are unbiasedness and consistency.

506

• Sample mean X̄ is UNBIASED:

E(X̄) = E

(∑ni=1Xi

n

)=

1

n

n∑i=1

E(Xi) (5.5.1)

=1

n

n∑i=1

µ =1

nnµ = µ.

Note that we could take many samples from some population. Eachtime, a sample gives us a sample mean as an estimate for the true mean.This sample mean is a random quantity due to random sampling. Thenthe unbiasedness of the sample mean X̄ means that, we can expect thatthe average of X̄ is the same as the true (population) mean.

• Moreover, we have

Var(X̄) = Var

(∑ni=1Xi

n

)=

1

n2

n∑i=1

Var(Xi)

=1

n2

n∑i=1

σ2 =1

n2nσ2 =

σ2

n→ 0 as n→∞.

This, together with (5.5.1), implies that X̄ is a CONSISTENT estimateof µ.

The meaning of this consistency is that as the sample size is gettinglarger and larger, the sample mean is getting closer and closer to thetrue mean. Theoretically, in the limit as the sample size goes to infinitythen the sample mean is exactly equal to the true mean.

507

As a short summary, so far we have obtained the following:

E(X̄) = µ

Var(X̄) =σ2

n

Question: What is the full distribution of X̄?

Recall the CLT from section 5.4, we have:

X̄ ∼ N(µ,σ2

n

)as long as the sample size n is large enough (usually, n ≥ 30).

This can be used to find the interval estimate (the confidence interval)of the true mean by following Sections 5.5.3 and 5.5.4 below:

5.5.3 Standardization of the sample mean

As

X̄ ∼ N(µ,σ2

n

)we have

Z =X̄ − µ√

σ2

n

=X̄ − µ

σ√n

∼ N (0, 1)

Therefore, X̄ can be studied through the standard normal distributionN (0, 1).

508

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

z

f(z)

95%

2.5% 2.5%

In order to study the sample mean X̄ through the standard normaldistribution, it will be useful to remember the following probabilities:

P (−1 ≤ Z ≤ 1) = 0.68

P (−1.645 ≤ Z ≤ 1.645) = 0.9

P (−1.96 ≤ Z ≤ 1.96) = 0.95

P (−2.576 ≤ Z ≤ 2.576) = 0.99

5.5.4 Confidence Intervals (CIs) for the sample

mean

The Confidence Interval (CI) for population mean µ is constructedbased on the approximate distribution of the sample mean X̄ (resultingfrom the CLT).

Recall that

Z =X̄ − µ

σ√n

∼ N (0, 1).

509

Hence,

P (−1.96 < Z < 1.96) = 0.95

⇒ P (−1.96 <X̄ − µ

σ√n

< 1.96) = 0.95

⇒ P (−1.96σ√n< X̄ − µ < 1.96

σ√n

) = 0.95

⇒ P (X̄ − 1.96σ√n< µ < X̄ + 1.96

σ√n

) = 0.95

This is equivalent to saying that:

(X̄ − 1.96 σ√

n, X̄ + 1.96 σ√

n

)is a 95% CI for µ

Similarly, we can construct other confidence intervals (90%, 99% forexample) for µ just by changing the corresponding constant given inthe formula.

5.5.5 Interpretation and Verbalizing CIs

In terms of interpreting the result from CIs, the following interpreta-tions are correct:

510

• We are 95% confident that µ is in the CI interval (X)

• In repeated sampling, 95% of the CIs will contain µ (X)

The meaning of this interpretation can be demonstrated as in Figure5.1.

sample 1

sample 295%willcontain

5%willnotcontain

Figure 5.1: Interpretation of CIs

It is incorrect to say that “µ is in a particular CI with probability0.95”. This is a common mistake, so please be well aware of that.

511

5.5.6 Practicality of the CIs

As we have seen, theoretically we have

X̄ ± 1.96σ√n

is a 95% CI for µ.

However, the theoretical standard deviation σ is usually unknown.Therefore, we replace σ by its estimate: the sample standard devi-ation s:

s =

√√√√ 1

n− 1

n∑i=1

(Xi − X̄)2.

In practice, we say that

X̄ ± 1.96s√n

is a 95% CI for µ.

Then,

1.96 s√n

is known as margin of error (MOE) for the 95% CIs of µ.

It is worth noting that

• MOE = 1.645 s√n

for 90% CIs (X̄ ± 1.645 s√n)

• MOE = 2.576 s√n

for 99% CIs (X̄ ± 2.576 s√n)

512

5.5.7 Minimum sample size required for certain

range of estimation error

Since the 95% CI contains the true population mean (µ) 95% of thetime, the estimation error is less than the MOE 95% of the time.Therefore, if you want to have the estimation error (for 95% CIs) tobe less than E, solve the equation

E = MOE = 1.96s√n

which implies

n =

(1.96s

MOE

)2

.

The smallest integer which is greater than n calculated above is theminimum sample size required for the estimation error to be less thanE with 95% confidence.

Example 5.5.1. A web developer keeps track of the number of visitsto her new designed website every day for a month (30 days) andfound out that on average, she got 100 visits per day with a standarddeviation of 5.

1. Find the 95% CI for the average number of visits per day to that newwebsite.

2. If she wants to have the estimation error, for the 95% CIs, to be lessthan 2, how long should she keep observing?

3. Do you have any criticisms of this procedure?

513

Solution: The number the developer got with her experiment is 100which is a sample average of a sample of size n = 30, in notation,X̄ = 100 and s = 5.

1. The 95% CI for the average number of visits per day is

100± 1.965√30

= (98.21, 101.79)

2. Since (1.96× 5

2

)2

= 24.01,

the number of days the developer should keep observing is 25 days.

514

5.6 Sample proportion

5.6.1 Sample proportion as a point estimate of the

population proportion

Consider the case when the population has a Bernoulli distribution,i.e., only 2 possible outcomes for each individual Xi: either Success(Xi = 1) or Failure (Xi = 0), and the probability of success is aconstant p.

Given a sample X1, X2, . . . , Xn from that population of interest, let Xcount the number of successes in the sample:

X =n∑i=1

Xi.

Then as we know, this is a typical binomial scenario andX is a binomialrandom variable with parameters n and p. In notation, we write

X ∼ Bin(n, p).

The probability of success, p is unknown and is what we would like toestimate. Let

p̂ =X

n=

∑ni=1Xi

nbe the sample proportion of successes.

The sample proportion p̂ can be a point estimate of the probability p.

5.6.2 Properties of the sample proportion

The sample proportion also possesses the two desirable properties of apoint estimate.

515

• Sample proportion p̂ is an UNBIASED estimate of p

E(p̂) = E

(X

n

)=

1

nE(X) =

1

nnp = p

• Sample proportion p̂ is a CONSISTENT estimator of p. This is truedue to the fact that

Var(p̂) = Var

(X

n

)=

1

n2Var(X) =

1

n2np(1− p) =

p(1− p)n

and we note that Var(p̂)→ 0 as n→∞ combined with the unbiased-ness of p̂.

Question: Do we have a full distribution of p̂?

The answer is Yes. By the CLT, p̂ is also approximately normal.Specifically,

p̂ ∼ N(p,p(1− p)

n

)

5.6.3 CIs of the true proportion p

Due to the result from the CLT, set Z =p̂− p√p(1−p)n

then Z ∼ N (0, 1).

Therefore, we have

516

P (−1.96 < Z < 1.96) = 0.95

P

−1.96 <p̂− p√p(1−p)n

< 1.96

= 0.95

P

(p̂− 1.96

√p(1− p)

n< p < 1.96

√p(1− p)

n

)= 0.95

Consequently,

The 95% CI of p is(p̂− 1.96

√p(1− p)

n, p̂+ 1.96

√p(1− p)

n

)

Again, in practice, we can replace the unknown p by its point estimatep̂ in the formula of the CIs: so that

517

Practically, the 95% CI of p is(p̂− 1.96

√p̂(1− p̂)

n, p̂+ 1.96

√p̂(1− p̂)

n

)

Only 1.96 changes according to the % of the CIs:(p̂− 1.645

√p̂(1− p̂)

n, p̂+ 1.645

√p̂(1− p̂)

n

)is the 90% CI(

p̂− 2.576

√p̂(1− p̂)

n, p̂+ 2.576

√p̂(1− p̂)

n

)is the 99% CI

On a side note, replacing the unknown p by p̂ for the CI is a naturalapproach and widely used. However, it turns out that this replacementtends to make the confidence interval too short in some cases, even forsome fairly large sample sizes. Some research involving simulationstudies has shown that this effect can be largely compensated for bymodifying both n and p slightly. However, we do not discuss suchmodification here and use the formula given above provided that thesample is significantly large.

Example 5.6.1. A random sample of 400 electronic components man-ufactured by a certain process are tested, and 30 are found to be de-fective.

518

1. Let p represent the proportion of components manufactured by thisprocess that are defective. Find a 95% confidence interval for p.

2. How many components must be sampled so that the 95% CI will specifythe proportion defective to within ±0.02?

Solution:

1. The sample proportion is

p̂ =30

400= 0.075.

The 95% CI for the population proportion p of defective componentsis(

0.075− 1.96

√0.075(1− 0.075)

400, 0.075 + 1.96

√0.075(1− 0.075)

400

)= (0.049, 0.101)

2. We would like to have MOE = 0.02, that means

1.96

√0.075(1− 0.075)

n= 0.02,

which implies

n = 1.962 0.075(1− 0.075)

0.022= 666.27

so practically we need 667 components to be sampled.

519

5.7 Comparison of two population means

Suppose that we have 2 populations:

Population 1 : Mean = µ1 Population 2 : Mean = µ2

Variance = σ21 Variance = σ2

2

SRS of size n1 SRS of size n2

Our interest when comparing two samples often is whether the twosamples have the same mean, if not then what is the difference. Thatis, we consider µ1 − µ2, whose estimate is X̄1 − X̄2. We also haveunbiasedness and consistency of the estimate X̄1 − X̄2:

• Unbiased: E(X̄1 − X̄2) = µ1 − µ2

• Consistent: Var(X̄1 − X̄2) = Var(X̄1) + Var(X̄2) = σ21

n1+ σ2

2

n2

and Var(X̄1 − X̄2)→ 0 as n1 →∞ and n2 →∞.

Moreover, by the CLT we have

X̄1 − X̄2 ∼ N(µ1 − µ2,

σ21

n1+σ2

2

n2

)Therefore,

520

The CIs for µ1 − µ2 is

X̄1 − X̄2 ± z

√s2

1

n1+s2

2

n2

where

• z is selected according to the % of the CIs (i.e., z = 1.96for 95% CI, z = 1.645 for 90% CI, etc.)

• s1 and s2 are sample standard deviations, used as replace-ments for σ1 and σ2

• n1 and n2 must be large

521

engr123: engineering mathematics with logic and statistics

Documents