chapter 5 exploring data: distributions - seongchun kwon's …skwon.org/statistics.pdf ·  ·...

104
Chapter 5 Exploring Data: Distributions 5.1. Displaying Distributions: Histograms

Upload: phamngoc

Post on 24-Mar-2018

225 views

Category:

Documents


8 download

TRANSCRIPT

Page 1: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Chapter 5 Exploring Data: Distributions

5.1. Displaying Distributions: Histograms

Page 2: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Learning Objective

• How to read statistical data?

• How to visualize statistical data?

How to draw Histogram( bar graph )

Page 3: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

A small part of a data set collected from the students in a large statistics class by anonymous responses to a class

questionnaire.

Page 4: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Notation

• First Column: Student number 2 to 8

• Column A: Sex- female or male

• Column B: right-handedness or left-handedness

• Column C: height in inches

• Column D: time spent in studying in minutes per weeknight

• Column E: number of coins each student are carrying

Page 5: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

How can we read the data in the table?

• What is the height of the student number 2?

• Who spends the least amount of time on studying per weeknight?

• How many students use right hand?

• How many students study at least 2 hours per weeknight?

Page 6: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

• What does each row represent?

• What does each column represent?

Page 7: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Terminology

• Any characteristics are called variables.

• Individuals are the objects described by a set of data.

Page 8: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Variables, Individuals

• What are variables in the table?

• How many individuals are we considering?

Page 9: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

The distribution of a variable gives information

( as a table, graph, or formula) about how often the variable takes certain values or intervals of values.

Page 10: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Examples of Frequency Distributions

• Distribution of Sex

• Distribution of Coins

Value F M

Frequency

Page 11: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

relative frequency distribution

The relative frequency distribution of a variable states all observed values of the variable and what fraction (or percentage) of the time each value occurs.

Relative Frequency:= Frequency

Total number of individuals

Page 12: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

relative frequency distribution

• Conversion of decimal to percent and percent to decimal

• Relative frequency distribution of Sex

• Relative frequency distribution of Coins

Page 13: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Grouped frequency distribution

If there are many individuals, then it is better to analyze the data based on the grouped frequency distribution.

Remark:

Visualization of grouped frequency distribution is Histogram.(Bar graph).

Page 14: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80
Page 15: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Individuals, Variables

• What are individuals?

• How many variables do we have?

Page 16: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Procedure to make a grouped frequency distribution

1. Find the minimum and

maximum data values

2. Group neighboring data values into consecutive non-overlapping intervals: You need to decide the interval length which shows the data effectively.

3. Record the relevant frequencies

Page 17: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Method to record the grouped frequency distribution: table or histogram (bar graph)

Class Count

0.0 to 4.9 27

5.0 to 9.9 13

10.0 to 14.9 2

15.0 to 19.9 4

20.0 to 24.9 0

25.0 to 29.9 1

30.0 to 34.9 2

35.0 5o 39.9 0

40.0 to 44.9 1

• In the above table, data values range between 0.7 and 42.1. In this case, it is better to group the data values so that they range between 0 and 45 with interval lengths 5.

• Data values are recorded to one decimal point. This affects how to record the table form of data, rather histogram.

Page 18: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Histogram: What is the interpretation of the bar a, b and c?

Page 19: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Miss Smith's Math class has just taken a test. In order to come up with meaningful grades, Miss Smith will make a histogram to represent the

distribution of grades.

Data

Student Grade

Bullwinkle 84

Rocky 91

Bugs 75

Daffy 68

Wylie 98

Mickey 78

Minnie 77

Lucy 86

Linus 94

Asterix 64

Obelix 59

Donald 54

Sam 89

Taz 76

1. What is the highest score?

2. What is the lowest score?

3. What does the horizontal axis in the histogram represent?

4. What does the vertical axis in the histogram represent?

5. Fill the blank about the table with five bins and draw the histogram.

Class Count

50 - 59

60 - 69

70 - 79

80 – 89

90 - 99

Page 20: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

6. Construct the table with 10 bins and draw the histogram.

7. Which histogram shows students’ overall academic performance in a better way? Why?

Data

Student Grade

Bullwinkle 84

Rocky 91

Bugs 75

Daffy 68

Wylie 98

Mickey 78

Minnie 77

Lucy 86

Linus 94

Asterix 64

Obelix 59

Donald 54

Sam 89

Taz 76

Page 21: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

5.2. Interpreting Histograms

Page 22: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Global shape of histograms: Determining global shapes of histograms may be somewhat subjective although sometimes,

it has significant pattern. Skewed to the left or positively-

skewed: the longer tail of the histogram is on the left side

Skewed to the right or negatively-skewed: The longer tail of the histogram is on the right side

Page 23: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Symmetric: the right and left sides of the histogram are approximately mirror images of each other

Non-specific

Page 24: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Outlier: individual value(observation) that falls outside the overall pattern-It may show some particular aspects of data. However, it may suggest that there is an error in recording.

Page 25: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Important aspects to be discussed in interpreting histogram

outliers, shape (peaks, skewed distribution, symmetric distribution), center( next section ), spread( next section )

Page 26: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Example (Percent of Adult Population of Hispanic Origin by State in 2000 Census revisited)

Class Count

0.0 to 4.9 27

5.0 to 9.9 13

10.0 to 14.9 2

15.0 to 19.9 4

20.0 to 24.9 0

25.0 to 29.9 1

30.0 to 34.9 2

35.0 5o 39.9 0

40.0 to 44.9 1

How many peaks are in the graph?

Page 27: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Example (Percent of Adult Population of Hispanic Origin by State in 2000 Census revisited)

Class Count

0.0 to 4.9 27

5.0 to 9.9 13

10.0 to 14.9 2

15.0 to 19.9 4

20.0 to 24.9 0

25.0 to 29.9 1

30.0 to 34.9 2

35.0 5o 39.9 0

40.0 to 44.9 1

What is the pattern of the graph? Skewed to the right? Skewed to the left? Or Symmetric?

Page 28: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Example (Percent of Adult Population of Hispanic Origin by State in 2000 Census revisited)

Class Count

0.0 to 4.9 27

5.0 to 9.9 13

10.0 to 14.9 2

15.0 to 19.9 4

20.0 to 24.9 0

25.0 to 29.9 1

30.0 to 34.9 2

35.0 5o 39.9 0

40.0 to 44.9 1

What are outliers?

Page 29: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

5.4. Describing Center: Mean and Median

Page 30: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Terminology

• Mean: average

mean =Sum of all data values

Number of data value

𝑥 = 𝑥1 + ⋯+ 𝑥𝑛

𝑛

• Median: mid number in the ordered list

Median=1

2(𝑛 + 1) th value

• Mode: the number that is repeated more often than any other. If no number is repeated, then there is no mode for the list.

Page 31: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

To consider mean, median and mode, we need the data with individual values because we have to use actual values to calculate. Thus, histogram doesn’t give enough information about mean and median. However, mean, median and mode give some information about the shape of histogram.

Page 32: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80
Page 33: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Example

Find the mean, median, and mode for the following list of values:

13, 18, 13, 14, 13, 16, 14, 21, 13

Page 34: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Remark

Mean and median don’t have to be a value from the original list.

Page 35: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Remark

Half of the values in the data set lie below the median and half lie above the median.

Page 36: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Remark

The median is the most commonly quoted figure used to measure property prices. The use of the median avoids the problem of the mean property price which is affected by a few expensive properties that are not representative of the general property market.

Page 37: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Example

Find the mean, median, and mode for the following list of values:

8, 9, 10, 10, 10, 11, 11, 11, 12, 13

Page 38: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Group Discussion

The marks of nine students in a physics test that had a maximum possible mark of 50 are given below:

50 35 37 32 38 39 36 34 35

Find the mean, median and mode of this set of data values. Round to the nearest tenth.

Page 39: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Group Discussion

We have a distribution which is skewed to the left.

1. Draw any histogram which is skewed to the left. You don’t have to mark any values.

2. List median, mean and mod in order from the least to the greatest. Explain why you think so.

Page 40: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

5.5. Describing Spread: The Quartiles

Page 41: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Remark:

• Mean can be affected much by outliers. For example, several very expensive houses in Barbourville can affect the average value of houses in Barbourville.

• Median doesn’t reflect the values from outliers much.

• However, we can infer something by comparing mean and median. To know the spread more clearly, we consider range and quartiles

Page 42: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Terminology:

• Range:=largest value – smallest value

Quartile

• first quartile (designated Q1) = lower quartile = splits lowest 25% of data = 25th percentile

• second quartile (designated Q2) = median M = cuts data set in half = 50th percentile

• third quartile (designated Q3) = upper quartile = splits highest 25% of data, or lowest 75% = 75th percentile

Page 43: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Remark:

The method of calculation is slightly different, depending on whether the given information has values collected from even or odd number of individuals.

Page 44: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Example 1:(even number of data)

After sorting, the city mileages of the 12 gasoline-powered midsized cars are:

15, 16, 18, 19, 20, 20, 21, 21, 21, 22, 24, 27

1. Find the range.

2. Find the first quartile, median, third quartile.

Page 45: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Example 2(odd number of data)

After sorting, the city mileages of the 12 gasoline-powered midsized cars are:

15, 16, 18, 19, 20, 20, 21, 21, 21, 22, 24, 27, 48

1. Find the range.

2. Find the first quartile, median, third quartile.

Page 46: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

5.6. The Five-Number Summary and Boxplots

Page 47: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

The five-number summary

Minimum Q1 M Q3 Maximum

Page 48: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Boxplot( or Box-and-Whisker diagram)

• Visualization of the five-number summary

• Boxplots can be drawn either horizontally or vertically.

• It helps visualize the rough shape of histogram from the information on spread, skewness, and outliers.

Page 49: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Boxplot

Vertical Visualization Horizontal Visualization

Page 50: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Group Activity

Below are the exam scores of 30 students. Make a boxplot of these data.

24 31 38 49 51 55 56 59 62

63 65 66 69 72 72 74 76 81

84 84 86 86 86 88 88 88 91

91 92 99

Page 51: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Answer

Page 52: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Group Activity

Below are the ages of 30 people who died in a city hospital in one month. Make a boxplot of these data.

7 22 25 31 37 38 41 48 49

50 55 58 62 62 64 65 66 66

72 75 76 76 76 85 86 88 88

88 92 94

Page 53: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Answer

Page 54: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

5.7. Describing Spread: The Standard Deviation

Page 55: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Word meaning of deviation

Deviate: to turn aside or move away from what is considered a correct or normal course, standard of behavior, way of thinking, etc (deviated, deviating)

Page 56: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Survey on the quality of a restaurant

0

20

40

60

80

100

120

Terrible(1)

Poor(2)

Average(3)

VeryGood(4)

Excellent(5)

Terrible(1

)

Poor

(2)

Averag

e(3)

Very

Good(4)

Excelle

nt(5)

Food 1 99

Servic

e

5 15 30 35 15

Value 25 75

Atmo

spher

e

10 90

Page 57: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Terminology

(sample) Standard deviation s (of n observations x1, …., xn) It measures standard or average amount of deviation from their mean

Formula:

𝒔 =(𝒙𝟏 − 𝒙 )𝟐+(𝒙𝟐 − 𝒙 )𝟐+⋯+ (𝒙𝒏 − 𝒙 )𝟐

𝒏 − 𝟏

Page 58: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Remark

• Standard deviation is also denoted by σ.

• Standard deviation is zero only when there is no spread.

• Standard deviation is more sensitive to outliers than mean. If there are some outliers or a distribution is a strongly skewed distribution, then a standard deviation doesn’t give much information about the spread of a distribution.

Page 59: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

in the following figures, the standard deviation of a is bigger than the standard deviation of b. However, the standard deviation of c may be

bigger than a because of outliers.

Page 60: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Calculation of Standard deviation

• Calculate the mean. • Write the list of deviation:= observation value- mean • Write the list the squared

deviations. • Add all values in the list of the

squared deviations. • Divide by n-1, where n is the

number of observations. • Square the whole value.

𝒔 =(𝒙𝟏 − 𝒙 )𝟐+(𝒙𝟐 − 𝒙 )𝟐+⋯+ (𝒙𝒏 − 𝒙 )𝟐

𝒏 − 𝟏

Page 61: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Example

On six consecutive Sundays, a tow-truck operator received 9, 7, 11, 10, 13, 7 service calls. Calculate s.

• Calculate the mean. • Use

𝒔 =(𝒙𝟏 − 𝒙 )𝟐+(𝒙𝟐 − 𝒙 )𝟐+⋯+ (𝒙𝒏 − 𝒙 )𝟐

𝒏 − 𝟏

Page 62: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Group Activity

Find the mean and then, standard deviation for the following data series:

12, 6, 7, 3, 15, 10, 18, 5

Page 63: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

5.8. Normal distribution

Normal distribution(bell-shaped distribution): distribution whose shape is described by a normal curve

Page 64: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Normal curve

• smoothed-out histogram. Normal curve is symmetric with the same mean, median and mode

• The area under the curve is exactly 1.

Page 65: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Normal curve

The area under the curve between two vertical lines=proportion (%) of all values of the variable lies in that interval.

Page 66: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Examples of normal distributions

heights of men or women, blood pressure, marks on a standardized test such as SAT

Remark: If you say “That man is tall or has normal height”, then you are talking about rough statistical sense. Relate that with the figure above.

Page 67: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Learning Objective of the remaining section

• Understanding the geometric meaning of the standard deviation in a normal curve.

• Use the standard deviation to obtain the first quartile(25%) and the third quartile(75%) in a normal curve

Page 68: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Concave Up and Down

Concave up

Concave down

Page 69: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Change of Curvature

The point where the curve changes its concavity.

Concave down to concave up

Or

Concave up to concave down

Page 70: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80
Page 71: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Standard deviation and Change of Curvature in normal curve

(Geometric Meaning)

Standard deviation =

distance from the center to

the change-of-curvature points on either side

Page 72: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Standard Deviation and Quartiles in Normal Distribution

(Quantitative meaning of standard deviation)

• First quartile= mean − (0.67 X standard deviation)

• Third quartile = mean + (0.67 X standard deviation)

Page 73: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Example The distribution of heights of American women aged 18-24 is approximately normal with mean 64.5 inches and standard deviation 2.5 inches. a. What is the first

quartile? What is the interpretation of this?

b. What is the third quartile?

c. Between what two values do the middle 50% of scores lie?

Page 74: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Example

The scores of students on a standardized test form a normal distribution with a mean score of 500 and a standard deviation of 100. Between what two values do the middle 50% of scores lie?

Page 75: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Example

The distribution of the scores on a standardized exam is approximately normal with mean 250 and standard deviation 20. Between what two values do the middle 50% of scores lie?

Page 76: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Summary of Normal Curve

• Area under the normal curve= 1

• Middle 50% = from the first quartile to the third quartile

• First quartile=mean-(0.67X standard deviation)

• Third quartile= mean+(0.67X standard deviation)

Page 77: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

5.9. The 68-95-99.7 Rule

Page 78: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

If you use the 68-95-99.7 rule, then you can interpret

the data more effectively. The following figure illustrates the 68-95-99.7 rule,

when mean is 0 and standard deviation is 1.

Page 79: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Normal Distributions 68-95-99.7 Rule

• 68% of the observations fall within 1 standard deviation of the mean.

• 95% of the observations fall within 2 standard deviations of the mean.

• 99.7% of the observations fall within 3 standard deviations of the mean.

Page 80: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Example 1 ( Heights of American Women)

The distribution of heights of American women aged 18-24 is approximately normal with mean 64.5 inches and standard deviation 2.5 inches. Use the 68-95-99.7 rule to interpret the data.

Calculate the intervals of 1, 2 and 3 standard deviations:

• The interval of 1 standard deviation:

• The interval of 2 standard deviation:

• The interval of 3 standard deviation:

Page 81: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Example 1 ( Heights of American Women)

• The interval of 1 standard deviation:64.5 − 2.5, 64.5 + 2.5 =[62, 67]

• The interval of 2 standard deviation:

[64.5 − 2 × 2.5, 64.5 + 2× 2.5] = [59.5, 69.5]

• The interval of 3 standard deviation: [64.5 − 3 × 2.5, 64.5 + 3

× 2.5] = [57, 72]

Apply the 68-95-99.7 rule. • About 68% of young women

are between 62 and 67 inches tall.

• About 95% of young women are between 59.5 and 69.5 inches tall.

• About 99.7% of young women are between 57 and 72 inches tall.

• About 2.5 % of young women are taller than 69.5 inches.

Page 82: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Example

The scores of students on a standardized test form a normal distribution with a mean of 400 and a standard deviation of 30. One thousand students took the test. Find the number of students who score above 460.

Page 83: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Example

The distribution of the scores on a standardized exam is approximately normal with mean 100 and standard deviation 15. What percentage of scores lie between 115 and 130?

Page 84: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Chapter 7 Data for Decisions

Page 85: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Sampling Terminology Example

The population is the entire group

from which you are getting

information.

The population for a study of

childhood cancer in the USA is all

childhood cancer patients in the USA.

A sample is used, when data are

collected from only part of the

population. This sample must be

representative of the population.

Valid conclusions are obtained when

the sample results represent those of

the population.

When a census is conducted, data are

collected from the entire population.

Sample might be childhood cancer

patients in the largest children’s

hospital in each State.

Page 86: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Potential Problem

Problems may arise if a person does not consider bias, use of language, ethics, cost and time, timing, privacy, cultural sensitivity.

Page 87: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Potential

Problem

What it means Example

Bias The question influences in favor

of, or against the topic of the

data collection.

Suppose a person asks: ‘Don’t you think

calories of McDonald’s foods are too

high?’

This person has a bias against the calories

of McDonald’s foods. The bias influences

how the survey questions are written.

Use of

Language

The use of language in question

could lead people to give a

particular answer.

‘Don’t you think calories of McDonald’s

foods are too high?’ may lead people to

answer yes. A better question would be

‘do you think calories of McDonald’s foods

are too low, low, medium, high, too high?’

Timing When the data are collected

could lead to particular results.

A survey is conducted to find opinions on

the need for a winter tire. The answer may

vary depending on whether Barbourville

has a lot of snow or not.

Page 88: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Potential

Problem

What it means Example

Privacy If the topic of the data

collection is personal, a

person may not want to

participate or may give an

untrue answer on purpose.

Anonymous surveys may

help.

Suppose you are a grade 9 teacher

and plan to conduct a survey about

smoking in classes he or she is

teaching. Students who smoke may be

afraid of punishment and may try to

avoid participating survey.

Cultural

Sensitivity

Cultural sensitivity means you

are aware of other cultures.

You must avoid being

offensive and asking

questions that do not apply

to that culture.

You go to Muslim community and

survey their favorite cooking method

of pork. For example, circle your

favorite method of cooking pork:

Fry Barbeque Bake

Page 89: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Potential

Problem

What it means Example

Ethics Ethics dictate that the

collected data must not be

used for purposes other than

those told to the participants.

Otherwise, your actions are

considered unethical.

Suppose you tell to your classmates

that you want to know their favorite

snacks to help you plan your birthday

party. If you use that to sell favorite

snacks to your classmates, then it is

unethical.

Cost The cost of collecting data

must be taken into account.

Printing questionnaires, Pay people to

collect data

Time The time needed for

collecting data must be

considered.

A survey that takes an hour to

complete may be too long. This will

limit the number of people who are

willing to participate.

Page 90: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Inferences

Page 91: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Learning Objective

Statistical Inference: How do we generalize the collected data from samples?

• Parameter vs. Statistic

• Confidence interval: related to 68-95-99.7 rule

• Central limit theorem: related to the sizes of each sample

• Law of large numbers: related to the number of samplings (experiments)

Page 92: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

A survey question:

‘I like buying new clothes, but shopping is often frustration and time consuming.’

Circle your opinion:

Agree disagree

Sampling method: nationwide random selection of 2500 adults

Page 93: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Terminology Meaning Example

Statistical

inferences

Methods for drawing

conclusions about

the entire population

on the bases of data

from a sample.

Drawing conclusions

about an entire

population of 230

million American

adults.

Page 94: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Terminology Meaning Example

Parameter

Notation: p

Fixed (usually unknown)

number that describes a

population such as

proportion, mean or

standard deviation

Suppose that 60% of entire

American population agreed.

Then, 0.6 or 60% is the

parameter.

Statistic

Notation: p

Number that describes a

sample.

Known for the sample

we take

Varies from sample to

sample

Useful to estimate an

unknown parameter

Suppose that 1650 adults

from the random sample of

2500 adults answered that

they agree. Then, the statistic

is the proportion 0.66( 0r 66%

from 1650

2500)

Page 95: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

To draw a conclusion for the entire American adult population, we take the following steps:

Steps, Terminology, Property

Example

Simulation: Drawing many

samples at random from a

population that we specify.

Assumption in SRS: All

possible samples of n objects

are equally likely to occur.

SRS (Simple Random

Sampling): Draw 1000

separate samples of size 100

from a population that we

suppose has a parameter

value p=0.6 by generating a

computer program.

Page 96: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

To draw a conclusion for the entire American adult population, we take the following steps:

Steps, Terminology,

Property

Example

Take a large number of

random samples from the

same population.

Draw 1000 separate samples of size

100 from a population that we

suppose has a parameter value p=0.6

Calculate the sample

proportion p for each

sample.

p

=Count of successes in the sample

size of sample

=# of Agree

100

Page 97: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

To draw a conclusion for the entire American adult population, we take the following steps:

Steps, Terminology,

Property

Example

Make a histogram of

the values of p

Page 98: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

To draw a conclusion for the entire American adult population, we take the following steps:

Steps, Terminology, Property

Example

Examine the distribution displayed in

the histogram for shape, center, and

spread, as well as outliers or other

deviations.

When we analyze the curve, we get the

following information. Center and

spread are based on the actual

calculation with specific sampling(1000

separate samples of size 100).

Shape: the sampling distribution of p

is approximately normal because

each sample sizes are 100.

Center: 0.598

Spread: Standard deviation =0.051

Page 99: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Theorem: (Sampling Distribution of a Sample Proportion)

Assumption: Choose an SRS of size n from a large population that contains population proportion p of successes.

Shape: For large sample sizes( 𝑛 ≥ 30), the sampling distribution of p is approximately normal. (Central limit theorem)

• Center: The mean of the

sampling distribution p = the parameter p.

• Spread: Standard deviation of the sampling distribution

of p =p(1−p)

n

Page 100: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Theorem:

Sampling Distribution of a

Sample Proportion

Example( Continued)

Assumption: Choose an SRS of

size n from a large population

that contains population

proportion p of successes.

n=100

We took a simple random

sample from the population

proportion p=0.6 of successes.

Page 101: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Theorem:

Sampling Distribution of a Sample

Proportion

Example( Continued)

Shape: For large sample sizes(

𝑛 ≥ 30), the sampling distribution

of p is approximately

normal.(Central limit theorem)

Center: The mean of the sampling

distribution p = the parameter p.

Spread: Standard deviation of the

sampling distribution of p =

p(1−p)

n

Shape: the sampling distribution of

p is approximately normal because

each sample sizes are 100.

Center:

Mean of the sampling distribution

p = 0.598 ≈ 0.6 = p (the parameter

p).

Spread: Standard deviation of the

sampling distribution of

p =0.6(1−0.6)

100= 0.0024 ≈ 0.04899

Page 102: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Terminology

Example

Margin of Error=2𝑝 (1−𝑝 )

𝑛

Meaning: It is 95% confident

that true p is within a range of

p ± 2𝑝 (1−𝑝 )

𝑛

Why we use margin of error? In

real life, we don’t know what

true parameter p is. Thus, we

want to draw conclusion about

true parameter p from the

chosen simple random sample.

Margin of error with the chosen sample

we have been considering is

20.049(1 − 0.049)

100≈ 0.043

It is 95% confident that true p is within a

range of p ± 2𝑝 (1−𝑝 )

𝑛= 0.598 ±

0.043 = 0. 555 𝑡𝑜 0.641

Remark: we can see that true parameter

p(=0.6) is in the range of 0.555 and

0.641.

Page 103: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Law of large numbers( informal )

If the number of times a situation is repeated becomes larger and larger, the proportion of successes( i.e., expected outcomes or events really happen) will tend to come closer and closer to the actual probability of success.

Page 104: Chapter 5 Exploring Data: Distributions - Seongchun Kwon's …skwon.org/Statistics.pdf ·  · 2012-10-15Group neighboring data values into consecutive non-overlapping ... 40 60 80

Law of large numbers(formal)

Observe any random phenomenon having numerical outcomes with finite mean 𝜇. As the random phenomenon is repeated a large number of times,

• The proportion of trials on which each outcome occurs gets closer and closer to the probability of that outcome.

• The mean 𝑥 of the observed values gets closer and closer to 𝜇.