probability & statistical inference lecture 3

PROBABILITY & STATISTICAL INFERENCE LECTURE 3MSc in Computing (Data Analytics)

Lecture Outline A quick recap Continuous distributions. Question Time

A Quick Recap

Probability & Statistics We want to make

decisions based on evidence from a sample i.e. extrapolate from sample evidence to a general population

To make such decisions we need to be able to quantify our (un)certainty about how good or bad our sample information is.

Population

Representative Sample

Sample Statistic

Describe

Make

Inference

Some Definitions An experiment that can result in different outcomes, even

though it is repeated in the same manner every time, is called a random experiment.

The set of all possible outcomes of a random experiment is called the sample space of an experiment and is denote by S

A sample space is discrete if it consists of a finite or countable infinite set if outcomes.

A sample space is continuous if it contains an interval or real numbers.

An event is a subset of the sample space of a random experiment.

Some Definitions A sample space is discrete if it consists

of a finite or countable infinite set if outcomes.

A sample space is continuous if it contains an interval or real numbers.

An event is a subset of the sample space of a random experiment.

Probability Whenever a sample space consists of n possible outcomes that

are equally likely, the probability of the outcome 1/n.

For a discrete sample space, the probability of an event E, denoted by P(E), equals the sum of the probabilities of the outcome in E.

Some rules for probabilities: For a given sample space containing n events E1, E2,

E3, ........,En

1. All simple event probabilities must lie between 0 and 1:0 <= P(Ei) <= 1 for i=1,2,........,n

2. The sum of the probabilities of all the simple events within a sample space must be equal to 1:1)(

1

n

iiEP

Discrete Random Variable

A Random Variable (RV) is obtained by assigning a numerical value to each outcome of a particular experiment.

Probability Distribution: A table or formula that specifies the probability of each possible value for the Discrete Random Variable (DRV)

DRV: a RV that takes a whole number value only

Summary Continued… For Discrete RV we often have a mathematical formula

which is used to calculate probabilities,

i.e. P(x) = some formula This formula is called the Probability Mass Function

(PMF) Given the PMF you can calculate the mean and

variance by:

When the summation is over all possible values of x

222 )(

)(

xPx

xxP

Binomial Distribution – General Formula

This all leads to a very general rule for calculating binomial probabilities:

In General Binomial (n,p)

n = no. of trials

p = probability of a success

x = RV (no. of successes)

Where P(X=x) is read as the probability of seeing x successes.

xnx ppx

nxXP

)1()(

Binomial Distribution If X is a binomial random variable with

the paramerters p and n then

)1(

)1(2

pnp

pnp

np

Poisson Probability Distribution Probability Distribution for Poisson

Where is the known mean:

x is the value of the RV with possible values 0,1,2,3,….e = irrational constant (like ) with value 2.71828…

The standard deviation , , is given by the simple relationship;

=

!)(

x

exXP

x

Continuous Probability Distributions

Continuous Probability Distributions

Experiments can lead to continuous responses i.e. values that do not have to be whole numbers. For example: height could be 1.54 meters etc.

In such cases the sample space is best viewed as a histogram of responses.

The Shape of the histogram of such responses tells us what continuous distribution is appropriate – there are many.

Lifetime of Component

De

nsity

0.0 0.5 1.0 1.5 2.0 2.5

01

23

4

Waiting TimeD

en

sity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Normal Distribution (AKA Gaussian)

• The Histogram below is symmetric & 'bell shaped'

• This is characteristic of the Normal Distribution

• We can model the shape of such a distribution (i.e. the histogram) by a Curve

Normal Distribution The Curve may not fit the histogram 'perfectly' -

but should be very close Normal Distribution - two parameters,

µ = mean, = standard deviation,

The mathematical formula that gives a bell shaped symmetric curve

f(x) = Height of curve at x =

2

2

2

)(

22

1

x

e

Normal Distribution Why Not P(x) as before?

=> because response is continuous

What is the probability that a person sampled at random is 6 foot?

Equivalent question: what proportion of people are 6 foot?

=> really mean what proportion are 'around 6 foot' ( as good as the measurement device

allows) - so not really one value, but many values close together.

Example: What proportion of graduates earn €35,000?

Would we exclude €35,000.01 or €34,999.99?

Round to the nearest €, €10, €100, €1000?

Continuous measure => more useful to get proportion from €35,000 - €40,000

Some Mathematical Jargon:

The formula for the normal distribution is formally called the normal probability density function (pdf)

The Shaded portion of the Histogram is the Proportion of interest

Can visualise this using the histogram of salaries.

Since the histogram of salaries is symmetric and bell shaped, we model this in statistics with a Normal distribution curve.

Proportion = the proportion of the area of the curve that is shaded

So proportions = proportional area under the curve = a probability of interest

Need;• To know , • To be able to find area under

curve

Area under a curve is found using integration in mathematics.

In this case would need a technique called numerical integration.

Total area under curve is 1. However, the values we need are in

Normal Probability Tables.

The Tables are for a Normal Distribution with = 0 and = 1

• this is called the Standard Normal• Can 'convert' a value from any normal to the

standard normal using standard scores (Z scores)

Value from any NormalDistribution

Standardize

Corresponding Value from

Normal = 0 = 1

Z :score edstandardis

x

Standard Normal

Z scores are a unit-less quantity, measuring how far above/below a certain score (x) is, in standard deviation units.

Example: A score of 35, from a normal distribution with

= 25 and = 5.Z = ( 35 − 25) / 5 => 10/5 = 2

So 35 is 2 standard deviation units above the mean

What about a score of 20 ?

Z = ( 20 - 25) / 5 => − 5 / 5 = − 1

So 20 is 1 standard unit below the mean

Z-Score Example

Positive Z score => score is above the meanNegative Z score => score is below the mean

By subtracting and dividing by the we convert any normal to = 0, =1, so only need one set of tables!

Z-Score Example

From looking at the histogram of peoples weekly receipts, a supermarket knows that the amount people spend on shopping per week is normally distributed with:

= €58 = €15.

Example:

What is the probability that a customer sampled at random will spend less than €83.50 ?

Z = ( x − ) / = ( €83.50 - €58 ) / €15 => 1.7

Area from Z=1.7 to the left can be read in tables

From tables area less than Z = 1.7 => 0.9554

So probability is 0.9554 Or 95.54%

What is the probability that a customer sampled at random will spend more than €83.50 ?

Z = ( x − ) / = ( €83.50 - €58 ) / €15 => 1.7

Area from Z=1.7 to the right can be read in tables

From tables area greater than Z = 1.7 => 1- 0.9554 = 0.0446

So probability is 0.0446 Or 4.46%

Exercise Find the proportion of people who spend

more than €76.75 Find the proportion of people who spend

less than €63.50

Note: The tables can also be used to find other areas

(less than a particular value, or the area between two points)

Characteristics of Normal Distributions

Standard Deviation has particular relevance to Normal distribution

Normal Distribution => Empirical Rule

68%

95%

99.7% Between Z (lower, upper)

%Area

-1,1 68 %

-2,2 95 %

-3,3 99.7 %

-∞, +∞ 100%

The normal distribution is just one of the known continuous probability distributions.

Each have their own probability density function, giving different shaped curves.

In each case, we find probabilities by calculating areas under these curves using integration.

However, the Normal is the most important – as it plays a major role in Sampling Theory.

Other important continuous probability distributions include

• Exponential distribution – especially positively skewed lifetime data.

• Uniform distribution.

• Weibull – especially for ‘time to event’ analysis.

• Gamma distribution – waiting times between Poisson events in time etc.

• Many others…..

Summary – Random Variables

There are two types – discrete RVs and continuous RVs

For both cases we can calculate a mean (μ) and standard deviation (σ)

μ can be interpreted as average value of the RV

σ can be interpreted as the standard deviation of the RV

Summary Continued… For Discrete RV we often have a mathematical formula

which is used to calculate probabilities,

i.e. P(x) = some formula This formula is called the Probability Mass Function

(PMF) Given the PMF you can calculate the mean and

variance by:

When the summation is over all possible values of x

222 )(

)(

xPx

xxP

Summary Continued… For continuous RVs, we use a Probability Density

Function (PDF) to define a curve over the histogram of the values of the random variables.

We integrate this PDF to find areas which are equal to probabilities of interest.

Given the PDF you can calculate the mean and variance by:

Where f(x) is usual mathematical notation for the PDF

-dx )(dx )( 22 xfxxxf

Question Time

Next Week Next week we will start with the practical

part of the course. We will move to Lab 1005 in Aungier Street

probability & statistical inference lecture 3

Documents