1 tr 555 statistics “refresher” lecture 1: probability concepts references: – penn state...

117
1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: Penn State University, Dept. of Statistics Statistical Education Resource Kit a collection of resources used by faculty in Penn State's Department of Statistics in teaching introductory statistics courses. Page maintained by Laura J. Simon, Sept. 2003 Statistics: Making Sense of Data (MIT) William Stout, John Marden and Kenneth Travers http:// www.introductorystatistics.com / Sept. 2003 Tom Maze, stat course prepared for KDOT, 2003

Upload: ethelbert-howard

Post on 26-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

1

TR 555 Statistics “Refresher”Lecture 1: Probability Concepts

References:– Penn State University, Dept. of Statistics

Statistical Education Resource Kit a collection of resources used by faculty in Penn State's

Department of Statistics in teaching introductory statistics courses.  

Page maintained by Laura J. Simon, Sept. 2003 – Statistics: Making Sense of Data (MIT)

William Stout, John Marden and Kenneth Travers http://www.introductorystatistics.com/ Sept. 2003

– Tom Maze, stat course prepared for KDOT, 2003

Page 2: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

2

Outline

Overview of statistics Types of data Describing data numerically and graphically Probability and random variables

Page 3: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

3

Probability and Statistics

Probably is the likelihood of an event occurring relative to all other events

– Example: If a coin is flipped, what is the probability of getting a heads

– 0.5Given that the last flip was a heads what is the probability that the next will be

heads– 0.5

Statistics is the measurement and modeling of random variables– Example:

If our state averages 200 fatal crashes per year, what is the probability of having one crash today. Poisson distribution – = average per time period. 200/365 = 0.55

– P(1 = x) = ((t)x/x!)e-t=(0.55*1)1/1!)e-0.55(1)= 0.32

Page 4: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

4

Data Collection

Designing experiments– Does aspirin help reduce the risk of heart

attacks?

Observational studies– Polls - Clinton’s approval rating

Page 5: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

5

Variable Types

Deterministic– Assume away variation and randomness– Known with certainty– One to one mapping of independent variable to

dependent variable

Relationship

X1

Y1

Page 6: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

6

Variable Types Continued

Random or Stochastic– Recognized uncertainty of an event– One to one distribution mapping of independent

variable to dependent variable

Probability that it could be any of these values

Most Likely Less LikelyLess Likely

Page 7: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

7

Population

The set of data (numerical or otherwise) corresponding to the entire collection of units about which information is sought

Page 8: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

8

Sample

A subset of the population data that are actually collected in the course of a study.

Page 9: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

9

WHO CARES?

In most studies, it is difficult to obtain information from the entire population. We rely on samples to make estimates or inferences related to the population.

Page 10: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

10

Organization and Description of Data

Qualitative vs. Quantitative data Discrete vs. Continuous Data Graphical Displays Measures of Center Measures of Variation

Page 11: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

11

Qualitative (Categorical) Data

The raw (unsummarized) data are merely labels or categories

Quantitative (Numerical) Data

The raw (unsummarized) data are numerical

Page 12: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

12

Qualitative Data Examples

Class Standing (Fr, So, Ju, Sr) Section # (1,2,3,4,5,6) Automobile Make (Ford, Chevrolet, Nissan) Questionnaire response (disagree, neutral,

agree)

Page 13: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

13

Quantitative Data Examples (measures)

Voltage Height Weight SAT Score Number of students arriving late for class Time to complete a task

Page 14: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

14

Discrete Data

Only certain values are possible (there are gaps between the possible values)

Continuous Data

Theoretically, any value within an interval is possible with a fine enough measuring device

Page 15: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

15

Discrete Data Examples

Number of students late for class Number of crimes reported to SC police Number of times the word number is used

(generally, discrete data are counts)

Page 16: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

16

Discrete Variable ModelPoisson Distribution

(0.55*t)x/x!)e-0.55(t)

01

23

45

67

89

1011

1213

1415

# of Fatal Crashes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Pro

babi

lty

Probability of # of Fatals per one day

Page 17: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

17

Continuous Data Examples

Voltage Height Weight Time to complete a homework assignment

Page 18: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

18

Continuous Variable ModelExponential Distribution

0 0.8 1.6 2.4 3.2 4 4.7 5.5

Time till the first fatal accident

0

0.1

0.2

0.3

0.4

0.5

0.6

Pro

babi

lity

Fatality Probability Density Function

Probability of first Fatal at time t = e-t

Page 19: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

19

Continuous Probability Function

0 0.8 1.6 2.4 3.2 4 4.7 5.5

Days

0

0.2

0.4

0.6

0.8

1

1.2

Cum

mul

ativ

e P

roba

bilit

y

Cummulative Probability till first fatal

Cumulative Probability of Time Till First Fatal t = 1 - e-t

Page 20: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

20

Nominal Data

A type of categorical data in which objects fall into unordered categories, for example:– Hair color

blonde, brown, red, black, etc.

– Race Caucasian, African-American, Asian, etc.

– Smoking status smoker, non-smoker

Page 21: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

21

Ordinal Data

A type of categorical data in which order is important. For example …– Class

fresh, sophomore, junior, senior, super senior

– Degree of illness none, mild, moderate, severe, …, going, going, gone

– Opinion of students about riots ticked off, neutral, happy

Page 22: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

22

Binary Data

A type of categorical data in which there are only two categories.

Binary data can either be nominal or ordinal, for example …

– Smoking status smoker, non-smoker

– Attendance present, absent

– Class lower classman, upper classman

Page 23: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

23

Interval and Ratio Data

Interval– Interval is important, but no meaningful zero– e.g, temperature in farenheit

Ratio– has a meaningful zero value– e.g., temperature in Kelvin, crash rate

Page 24: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

24

Who Cares?

The type(s) of data collected in a study determine the type of statistical analysis used.

Page 25: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

25

Proportions

Categorical data are commonly summarized using “percentages” (or “proportions”).– 11% of students have a tattoo– 2%, 33%, 39%, and 26% of the students in class

are, respectively, freshmen, sophomores, juniors, and seniors

Page 26: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

26

Averages

Measurement data are typically summarized using “averages” (or “means”).– Average number of siblings Fall 1998 Stat 250

students have is 1.9.– Average weight of male Fall 1998 Stat 250

students is 173 pounds.– Average weight of female Fall 1998 Stat 250

students is 138 pounds.

Page 27: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

27

Descriptive statistics

Describing data with numbers:measures of location

Page 28: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

28

Mean

Another name for average. If describing a population, denoted as , the

greek letter “mu”. If describing a sample, denoted as x, called “x-

bar”. Appropriate for describing measurement data. Seriously affected by unusual values called

“outliers”.

_

Page 29: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

29

Calculating Sample Mean

nX

X iFormula:

That is, add up all of the data points and divide by the number of data points.

Data (# of classes skipped): 2 8 3 4 1

Sample Mean = (2+8+3+4+1)/5 = 3.6

Do not round! Mean need not be a whole number.

Page 30: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

30

Population Mean

The mean of a random variable X is called the population mean and is denoted

It is also called the expected value of X or the expectation of X and is denoted E(X).

ii xfxXE )(

Page 31: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

31

Median

Another name for 50th percentile. Appropriate for describing measurement

data. “Robust to outliers,” that is, not affected

much by unusual values.

Page 32: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

32

Calculating Sample Median

Order data from smallest to largest.

If odd number of data points, the median is the middle value.

Data (# of classes skipped): 2 8 3 4 1

Ordered Data: 1 2 3 4 8

Median

Page 33: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

33

Calculating Sample Median

Order data from smallest to largest.

If even number of data points, the median is the average of the two middle values.

Data (# of classes skipped): 2 8 3 4 1 8

Ordered Data: 1 2 3 4 8 8

Median = (3+4)/2 = 3.5

Page 34: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

34

Mode

The value that occurs most frequently. One data set can have many modes. Appropriate for all types of data, but most

useful for categorical data or discrete data with only a few number of possible values.

Page 35: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

35

Most appropriate measure of location

Depends on whether or not data are “symmetric” or “skewed”.

Depends on whether or not data have one (“unimodal”) or more (“multimodal”) modes.

Page 36: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

36

Symmetric and Unimodal

Page 37: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

37

Symmetric and Bimodal

Page 38: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

38

Skewed Right

0 100 200 300 400

0

10

20

Number of Music CDs

Fre

quen

cy

Number of Music CDs of Spring 1998 Stat 250 Students

Page 39: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

39

Skewed Left

Page 40: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

40

Choosing Appropriate Measure of Location

If data are symmetric, the mean, median, and mode will be approximately the same.

If data are multimodal, report the mean, median and/or mode for each subgroup.

If data are skewed, report the median.

Page 41: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

41

Descriptive statistics

Describing data with numbers: measures of variability

Page 42: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

42

Range

The difference between largest and smallest data point.

Highly affected by outliers.

Best for symmetric data with no outliers.

2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0

0

10

20

GPA

Fre

quen

cy

GPAs of Spring 1998 Stat 250 Students

Page 43: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

43

Interquartile range

The difference between the “third quartile” (75th percentile) and the “first quartile” (25th percentile). So, the “middle-half” of the values.

IQR = Q3-Q1 Robust to outliers or

extreme observations. Works well for skewed data.

2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0

0

10

20

GPA

Fre

quen

cy

GPAs of Spring 1998 Stat 250 Students

Page 44: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

44

Variance

1n

2)x(x2s

1. Find difference between each data point and mean.

2. Square the differences, and add them up.

3. Divide by one less than the number of data points.

Page 45: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

45

Variance

If measuring variance of population, denoted by 2 (“sigma-squared”).

If measuring variance of sample, denoted by s2 (“s-squared”).

Measures average squared deviation of data points from their mean.

Highly affected by outliers. Best for symmetric data.

Problem is units are squared.

Page 46: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

46

Population Variance

The variance of a random variable X is called the population variance and is denoted

2

ii xfx22

Page 47: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

47

Standard deviation

Sample standard deviation is square root of sample variance, and so is denoted by s.

Units are the original units. Measures average deviation of data points

from their mean. Also, highly affected by outliers.

Page 48: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

48

Population Standard Deviation

The population standard deviation is the square root of the population variance and is denoted

ii xfx22

Page 49: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

49

What is the variance or standard deviation?

(MPH)

Page 50: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

50

Variance or standard deviation

Sex N Mean Median TrMean StDev SE Mean female 126 91.23 90.00 90.83 11.32 1.01 male 100 06.79 110.00 105.62 17.39 1.74 Minimum Maximum Q1 Q3female 65.00 120.00 85.00 98.25male 75.00 162.00 95.00 118.75

Females: s = 11.32 mph and s2 = 11.322 = 128.1 mph2

Males: s = 17.39 mph and s2 = 17.392 = 302.5 mph2

Page 51: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

51

Coefficient of Variation (COV) – not covariance!

Ratio of sample standard deviation to sample mean multiplied by 100.

Measures relative variability, that is, variability relative to the magnitude of the data.

Unitless, so good for comparing variation between two groups.

Page 52: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

52

Coefficient of variation (MPH)

Sex N Mean Median TrMean StDev SE Mean female 126 91.23 90.00 90.83 11.32 1.01 male 100 106.79 110.00 105.62 17.39 1.74 Minimum Maximum Q1 Q3female 65.00 120.00 85.00 98.25male 75.00 162.00 95.00 118.75

Females: CV = (11.32/91.23) x 100 = 12.4

Males: CV = (17.39/106.79) x 100 = 16.3

Page 53: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

53

Choosing Appropriate Measure of Variability

If data are symmetric, with no serious outliers, use range and standard deviation.

If data are skewed, and/or have serious outliers, use IQR.

If comparing variation across two data sets, use coefficient of variation.

Page 54: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

54

Descriptive Statistics

Summarizing data using graphs

Page 55: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

55

Which graph to use?

Depends on type of data Depends on what you want to illustrate Depends on available statistical software

Page 56: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

56

Bar Chart

Summarizes categorical data. Horizontal axis represents categories, while vertical

axis represents either counts (“frequencies”) or percentages (“relative frequencies”).

Used to illustrate the differences in percentages (or counts) between categories.

Middle Oldest Only Youngest

10

20

30

40

Birth Order

Per

cent

Birth Order of Spring 1998 Stat 250 Students

n=92 students

Page 57: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

57

Histogram

Divide measurement up into equal-sized categories. Determine number (or percentage) of measurements

falling into each category. Draw a bar for each category so bars’ heights represent

number (or percent) falling into the categories. Label and title appropriately.

18 19 20 21 22 23 24 25 26 27

0

10

20

30

40

50

Age (in years)

Fre

quen

cy (

Cou

nt)

Age of Spring 1998 Stat 250 Students

n=92 students

Page 58: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

58

Use common sense in determining number of categories to use.

(Trial-and-error works fine, too.)

Number of ranges (see Tufte)

18 23 28

0

10

20

30

40

50

60

Age (in years)

Fre

quen

cy (

Cou

nt)

Age of Spring 1998 Stat 250 Students

n=92 students

2 3 4

0

1

2

3

4

5

6

7

GPA

Fre

quen

cy (

Co

unt)

GPAs of Spring 1998 Stat 250 Students

n=92 students

Page 59: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

59

Dot Plot

Summarizes measurement data.

Horizontal axis represents measurement scale.

Plot one dot for each data point.

160150140130120110100908070Speed

Fastest Ever Driving Speed

Women126

Men100

226 Stat 100 Students, Fall '98

Page 60: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

60

Stem-and-Leaf Plot

Summarizes measurement data.

Each data point is broken down into a “stem” and a “leaf.”

First, “stems” are aligned in a column.

Then, “leaves” are attached to the stems.

Page 61: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

61

Boxplot

smallest observation = 3.20 Q1 = 43.645

Q2 (median) = 60.345

Q3 = 84.96 largest observation = 124.27

0 10 20 30 40 50 60 70 80 90 100 110 120 130

. . . . .

Page 62: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

62

Box Plot

“Whiskers” are drawn to the most extreme data points that are not more than 1.5 times the length of the box beyond either quartile.

– Whiskers are useful for identifying outliers.

“Outliers,” or extreme observations, are denoted by asterisks.

– Generally, data points falling beyond the whiskers are considered outliers.

Useful for comparing two distributions

0

1

2

3

4

5

6

7

8

9

10

Hou

rs o

f sle

ep

Amount of sleep in past 24 hours

of Spring 1998 Stat 250 Students

Page 63: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

63

Using Box Plots to Compare

female male

60

110

160

Gender

Fast

est

Speed (

mph)

Fastest Ever Driving Speed

226 Stat 100 Students, Fall 1998

Page 64: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

64

Scatter Plots

Summarizes the relationship between two measurement variables.

Horizontal axis represents one variable and vertical axis represents second variable.

Plot one point for each pair of measurements.

22 23 24 25 26 27 28 29 30 31

22

23

24

25

26

27

28

29

30

31

Left foot (in cm)

Rig

ht fo

ot (

in c

m)

Foot sizes of Spring 1998 Stat 250 students

n=88 students

Page 65: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

65

No relationship

52 57 62

22

23

24

25

26

27

28

29

30

31

32

Head circumference (in cm)

Left fore

arm

(in

cm

)

Lengths of left forearms and head circumferences

of Spring 1998 Stat 250 Students

n=89 students

Page 66: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

66

Closing comments

Many possible types of graphs. Use common sense in reading graphs. When creating graphs, don’t summarize your

data too much or too little. When creating graphs, label everything for

others. Remember you are trying to communicate something to others!

Page 67: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

67

Probability

You’ll probably like it!

Page 68: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

68

Before we begin …

What is the probability that 2 or more people share the same birthday if …– 5 people are in the sample?– 23 people?– 50 people?– This class?

Page 69: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

69

Probability Properties

The probability of an event “A” (the proportion of times the event is expected to occur in repeated experiments), is denoted P(A).

All probabilities are between 0 and 1.(i.e. 0 < P(A) < 1)

The sum of the probabilities of all possible outcomes must be 1.

Page 70: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

70

Probability Basics

Given that a crash has occurred, what is the probability that it is a fatal crash?– Possible events – Fatal, injury, and property

damage onlyFatal 37,000 P(F) = 0.58%Injury 2,026,000 P(I) = 32.16%PDO 4,226,000 P(D) = 67.08%Total Crashes 6,300,000

Page 71: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

71

Complement

The complement of an event A, denoted by A, is the set of outcomes that are not in A

A means A does not occur

P(A) = 1 - P(A)Some texts use Ac to denote the complement of A

Page 72: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

72

Union

The union of two events A and B, denoted by A U B, is the set of outcomes that are in A, or B, or both

If A U B occurs, then either A or B or both occur

Page 73: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

73

Intersection

The intersection of two events A and B, denoted by AB, is the set of outcomes that

are in both A and B.

If AB occurs, then both A and B occur

Page 74: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

74

Combinations of Events

Union of fatal speed related and run-off the road crashes

Single Vehicle Crash

Speed RelatedCrashes

Intersection of Fatal and Run-off the Road Crashes

All Fatal Crashes (37,795)

21,052

13,357

Page 75: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

75

Addition Law

P(A U B) = P(A) + P(B) - P(AB)

(The probability of the union of A and B is the probability of A plus the probability of B minus the probability of the intersection of A and B)

Page 76: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

76

Mutually Exclusive Events

Two events are mutually exclusive if their intersection is empty.

Two events, A and B, are mutually exclusive if and only if P(AB) = 0

P(A U B) = P(A) + P(B)

Page 77: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

77

Conditional Probability

The probability of event A occurring, given that event B has occurred, is called the conditional probability of event A given event B, denoted P(A|B)

Page 78: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

78

Multiplication Rule

General form P(A/B) = P(A,B)/P(B)e.g., what is the probability of a single vehicle

accident given that it was speed related?

Page 79: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

79

Conditional Probability Example

Total fatal crashes - 37,795 Total speed related crashes – 13,357 Total single vehicle crashes – 21,052 Total single vehicle, speed related crashes - 8,600 If the crash was speed related, what is the probability that it was a

single vehicle crash?– P(sv/sp) = 8600/13357 = 64.38%

If the crash was speed related, what is the probability that it was not a single vehicle crash?

– P(sv/sp) = 1 – 0.6438 = 35.62%

Single VehicleCrashes

Speed RelatedCrashes

21,05213,357

All FatalCrashes37,795

SR+SV8,600

Page 80: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

80

Conditional Probability Example (Cont)

Probability that a fatal crash was speed related = P(sp) – 13,357/ 37,795 = 35.34%

Probability that a fatal crash was a single vehicle = P(sv) – 21,052/37,795 = 55.70%

Probability that a fatal crash is both speeding related and a single vehicle = P(sv,sp)

– 8,600/37,795 = 22.74%

Single VehicleCrashes

Speed RelatedCrashes

21,05213,357

All FatalCrashes37,795

SR+SV8,600

Page 81: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

81

Bayes’ Theorem

P(A/B)P(B) = P(B/A)P(A)P(B/A) = P(A/B)P(B)/P(A)P(sv) = 55.70%P(sp) = 35.34%P(sv/sp) = 64.38%P(sp/sv) = ?P(sp/sv) =

((0.6438)*(0.3534))/0.5570 = 0.3854

Single VehicleCrashes

Speed RelatedCrashes

21,05213,357

All FatalCrashes37,795

SR+SV8,600

Page 82: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

82

Bayes’ Theorem Problem

Given– There were 11,696 off-road fixed object fatal crashes

involving a single vehicle– There were 13,357 fatal crashes involving a speeding vehicle– There were 8,600 fatal crashes involving speeding and single

vehicles– There were 5,400 fatal crashes involving single vehicles,

speeding, and off-road fixed object crashes– The total number of fatal crashes is 37,795– Given that a crash is speeding related, what is the probability

that it will be an off-road single vehicle crash

Page 83: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

83

Bayes’ Problem Answer

What we need to know P(or,sv/sp)What we know

– P(or,sv) = 30.95%– P(sp) = 35.34%– P(sv,sp) = 55.70%– P(sv,sp) = 22.75%– P(sp,or,sv) = 14.29%– P(or,sv/sv) = 55.56%

Page 84: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

84

Answer Continued

Multiplication Rule– P(sp/or,sv)P(or,sv) = P(sp,or,sv)– P(sp/or,sv) = P(sp,or,sv)/P(or,sv)– 46.17% =0.1429/0.3095

Bayes’ Theorem– P(or,sv/sp)= (P(sp/or,sv)*P(or,sv))/P(sp)– 40.43% = (0.4617*0.3095)/0.3534

Page 85: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

85

Independence

Two events A and B are independent if

P(A|B) = P(A)

or

P(B|A) = P(B)

or

P(AB) = P(A)P(B)

Page 86: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

86

Probability Concepts

RandomnessIndependence

Page 87: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

87

Thought Question 1

What does it mean to say that a deck of cards is “randomly” shuffled? Every ordering of the cards is equally likely

There are 8 followed by 67 zeros possible orderings of a 52 card deck

Every card has the same probability to end up in any specified location

Page 88: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

88

The question continued

A 52 card deck is randomly shuffled How often will the tenth card down from

the top be a Club? 1/4 of the time Every card has the same chance to end up

10th. There are 13 clubs and 13 / 52 = 1/4

Page 89: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

89

Law of Large Numbers

Relative frequency of an event gets closer to true probability as number of trials gets larger

Page 90: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

90

Probability values

Probabilities are between 0 and 1 Total probabilities of all possible

outcomes = 1 Probability = 1

means an event always happens

Probability = 0 means an event never happens

Page 91: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

91

Does a prior event matter?

A fair coin is flipped four times. First three flips are heads What’s the probability that the fourth flip

is heads? 1/2 assuming flips are independent

Results of first three flips don’t matter

Page 92: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

92

Independence

The chance that B happens is not affected by whether A had happened.

Page 93: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

93

Does prior event matter?

Ten card drawn without replacement from 52 card deck.

2 Aces are among these 10 cards What’s the probability the tenth card is an

Ace? 2/42 = 1/21

After ten draws, 42 cards remain, 2 of them are Aces

Page 94: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

94

Dependence

The chance that B happens is affected by whether A has happened.

Page 95: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

95

Sequence of Events

You guess at five True False questions. What’s the probability you get them right?

Page 96: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

96

Five right in five guesses

For each question, Pr(correct) = 1/2 Multiply probabilities

(1/2) x (1/2) x (1/2) x (1/2) x (1/2) = 1/32 = 0.031

Page 97: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

97

Card Example

Two cards are taken from normal 52 card deck.

What’s the probability that both are Hearts?

Note - there’s dependence between the two cards

Answer = (13/52) x (12/51) = 1/17 = 0.059

Page 98: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

98

The Birthday Problem

What is the probability that at least two people in this class share the same birthday?

Page 99: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

99

Assumptions

Only 365 days each year. Birthdays are evenly distributed throughout

the year, so that each day of the year has an equal chance of being someone’s birthday.

Page 100: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

100

Take group of 5 people….

Let A = event no one in group shares same birthday.

Then AC = event at least 2 people share same birthday.

P(A) = 365/365 × 364/365 × 363/365 × 362/365 × 361/365

= 0.973

P(AC) = 1 - 0.973 = 0.027

That is, about a 3% chance that in a group of 5 people at least two people share the same birthday.

Page 101: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

101

Take group of 23 people….

Let A = event no one in group shares same birthday.

Then AC = event at least 2 people share same birthday.

P(A) = 365/365 × 364/365 × … × 343/365

= 0.493

P(AC) = 1 - 0.493 = 0.507

That is, about a 50% chance that in a group of 23 people at least two people share the same birthday.

Page 102: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

102

Take group of 50 people….

Let A = event no one in group shares same birthday.

Then AC = event at least 2 people share same birthday.

P(A) = 365/365 × 364/365 × … × 316/365

= 0.03

P(AC) = 1 - 0.03 = 0.97

That is, “virtually certain” that in a group of 50 people at least two people share the same birthday.

Page 103: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

103

Two-way Tables

And various probabilities...

Page 104: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

104

Two-way table of counts

Rows: gender Columns: pierced ears N Y All M 71 19 90 F 4 84 88 All 75 103 178 Cell Contents -- Count

Page 105: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

105

Joint (“”) probabilities

Rows: gender Columns: pierced ears N Y All M 71 19 90 39.89 10.67 50.56 F 4 84 88 2.25 47.19 49.44

All 75 103 178 42.13 57.87 100.00 Cell Contents -- Count % of Tbl

Page 106: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

106

Row conditional probabilities

Rows: gender Columns: pierced ears N Y All M 71 19 90 78.89 21.11 100.00 F 4 84 88 4.55 95.45 100.00 All 75 103 178 42.13 57.87 100.00 Cell Contents -- Count % of Row

Page 107: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

107

Column conditional probabilities

Rows: gender Columns: pierced ears N Y All M 71 19 90 94.67 18.45 50.56 F 4 84 88 5.33 81.55 49.44 All 75 103 178 100.00 100.00 100.00 Cell Contents -- Count % of Col

Page 108: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

108

Expected Value

Coincidences

Page 109: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

109

Roulette Color Bet

18 black, 18 red, and 2 green numbers Bet on one of black or red If correct , win $1 If wrong, lose $1

Page 110: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

110

Is the bet fair?

Fair game : expected value is 0 Expected value =

sum of (outcome x prob) Exp Val. = (+1)(18/38)+(-1)(20/38) = -2/38 Not fair since expected value is not 0.

Page 111: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

111

Color Bet versus Number bet

Both have same expected value How are the bets the same? Long run result is same How are they different? Short run results can be quite different

Page 112: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

112

Prob of Five Straight Losses

Color Bet = (20/38)5 = 0.04 , 4% Number Bet = (37/38)5 = 0.88, 88%

Page 113: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

113

A Spectacular Coincidence ?

Many states draw four digit lottery numbers

Several years ago Mass. and N.H. both drew the same number on the same night

Associated Press wrote that this was a spectacular 1 in 100 million coincidence

Page 114: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

114

Was Associated Press Right ?

Only if number picked is specified in advance of the draws.

Chance both pick the same pre-specified number, for example 2963, is (1/10,000) (1/10,000)

This is 1 in 100 million But the match could have been on any of

10,000 possibilities

Page 115: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

115

The correct analysis

First state could have picked any number Chance the second state matches is

1/10,000 Answer for two specific states is 1/10,000 But there were 15 states doing this almost

every night .

Page 116: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

116

The prob that the 15 states all differ

First state can be any number Prob second state differs = 9,999/10,000 Prob third state is unique = 9,998/10,000 And so on, for 15 states Multiply these prob.'s to get probability

that all 15 differ Answer is about 0.99 that all picked

different numbers

Page 117: 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource

117

Prob at least two states are same

Opposite from all different Prob at least two the same = 1-Prob(all

differ) 1 - 0.99 = 0.01 About 1 in 100 ; a far cry from 1 in 100

million