part ii : descriptive statistics - binghamton...

44
Chapter 4 - Averages and Standard Deviation PART II : DESCRIPTIVE STATISTICS Dr. Joseph Brennan Math 148, BU Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 1 / 44

Upload: vuthu

Post on 28-May-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Chapter 4 - Averages and Standard DeviationPART II : DESCRIPTIVE STATISTICS

Dr. Joseph Brennan

Math 148, BU

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 1 / 44

Description of Distributions

Similar to chapter 3, we will only be handling variables that arequantitative in nature.

To describe the distribution of a quantitative variable we should specify :

The overall shape of the distribution:

Number of modes,Types of Symmetry,Types of Skew.

Numerical descriptions of the distribution. These are measures of

Center,Spread.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 2 / 44

Measures of Center and Spread

We will consider 3 measures of the center of a distribution :

the mode,

the average (mean),

the median.

We will also discuss 2 measures of the spread of a distribution :

the standard deviation,

the interquartile range.

NotationSuppose we have a data set which consists of n observations.

Denote observations by x1, x2, . . . , xn.

Consider x1 as the first observation and xn is the last n - thobservation.

The subscripts on the observations, xi , are just a way of keeping the nobservations distinct. They do not necessarily indicate order or anyother special facts about the data.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 3 / 44

The Mode

The Mode

The mode is the number that occurs most frequently in a given data. Amode can be visually determined from a histogram as it will coincide witha peak. There may be several modes!

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 4 / 44

The Mean (Average)

The Mean

The mean is the numerical center of data. It is the common average bywhich you have been graded since childhood.

Typically, the mean of a set of data will be denoted by x̄ . The mean of apopulation (found through a census) is denoted µ.

The mean x̄ for a set of observations is determined by adding all valuestogether and dividing by the number of observations. Typically, thenumber of observations will be denoted by n.

x̄ =x1 + x2 + . . .+ xn

n=

1

n

n∑i=1

xi .

The Σ (capital Greek sigma) in the above formula is short for “add them all up”.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 5 / 44

Example 1 (Student Height)

The heights (in inches) of 10 students are given below :

71, 70, 68, 69, 68, 65, 72, 69, 71, 62.

What is a mode height? There happen to be three:

68 69 71

What is the mean height?

x̄ =71 + 70 + 68 + 69 + 68 + 65 + 72 + 69 + 71 + 62

10= 68.5 (inches)

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 6 / 44

Example 2 (Temperature)

A biological experiment takes place in an orchard. The outsidetemperature (in degrees Fahrenheit) is taken every hour. The first 5successive measurements,

63, 66, 69, 70, and 75,

were taken at 8, 9, 10, 11, and 12 a.m., correspondingly.

What is the mode temperature? There isn’t one!

What is the mean temperature?

x̄ =63 + 66 + 69 + 70 + 75

5= 68.6 (degrees Fahrenheit)

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 7 / 44

Example 2 (Temperature)

Suppose that we recorded the last number wrong and accidentallyrecorded a temperature 105 instead of 75. How will the average change?

x̄error =63 + 66 + 69 + 70 + 105

5= 74.6

We can track the actual change between the actual mean and the meanfound in error:

x̄error =63 + 66 + 69 + 70 + 75

5+

30

5= x̄true + 6

A single wrong temperature (an outlier!) has shifted the meantemperature up by 6 degrees!

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 8 / 44

Weakness of the Average

Example 4 illustrates an important weakness of the average.

The average is sensitive to the influence of extremeobservations.

Extreme observations may be outliers, but a skewed distribution that hasno outliers will also shift the mean towards the long tail. This will bediscussed in detail later. In statistical language we say that the average isnot a robust measure of the center.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 9 / 44

Properties of the Average

1 The average is always between the smallest and the biggest numberin the data set.

2 The average is the center of gravity of the histogram.

3 The average is not resistent (robust) to outliers and to a skewness ofa distribution. The average shifts towards the long tail of thedistribution.

4 The average x̄ estimates the (unknown) population mean µ.

5 The average value x̄ is the best predictor for a future value of avariable.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 10 / 44

The Median

The median is the midpoint of a distribution. For a data set the median isthe number such that half of observations are smaller and the other halfare larger. Typically, the median will be denoted x̃ .To find the median of a distribution :

1 Arrange all observations in order of size, from smallest to largest.Besure to list all observations, even if the same values are repeatedseveral times.

2 If the number of observations n is odd, the median x̃ is the centerobservation in the ordered list. The location of the median can befound by counting (n + 1)/2 observations up from the bottom of thelist.

3 If the number of observations n is even, the median x̃ is any numberbetween two center observations in the ordered list.

When n is even, we will usually take the median x̃ to be the average ofthe two center observations in the ordered list.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 11 / 44

Example 1 (Height) and 2 (Temperature) Revisited

Example 1 The sample size for the height of students is n = 10, which is even.

The median (69) is the average of the two middle observations in the list:

62 65 68 68 69 69 70 71 71 72

Example 2 The sample size for the temperature is n = 5, which is odd. The

median is the 3rd observation in this list:

63 66 69 70 75

When the last temperature was recorded wrong, the median is again 69 :

63 66 69 70 105

As we can see, an outlier does not change the median!

The median is resistant (robust) to the influence of extremeobservations.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 12 / 44

Properties of the Median

1 The median is always between the smallest and the biggest number inthe data set.

2 The median is the value which divides the area of the histogram byhalf. The area of the histogram to the left of the median is equal tothe area of the histogram to the right of the median.

3 The median is resistent (robust) to the influence of extremeobservations and to the skewness of a distribution.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 13 / 44

What Measure of Center is Applicable?

Different measures of center are appropriate in different situations.Consider several examples:

1 A shoe store is interested in which size of shoes is of greatest demand.This question is about the mode of the distribution of shoe sizes.

2 An economist studying household incomes is interested in the middleincome value, the economist wants to de-emphasize the impact of thefew very high incomes that are typically present in such a data set.The economist is interested in the median income.

3 An instructor is asked what was the average score on the test.The instructor is interested in the mean grade.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 14 / 44

What Measure of Center is Applicable?

It is important that there is not a unique notion of center! A propermeasure of center should may be chosen based upon the study questionand looking at the available data.

Data whose distribution is roughly symmetric has a median, x̃ , andmean, x̄ , close together.

Note that for a perfectly symmetric distribution the mean equals themedian. Real data sets, however, never have perfectly symmetricdistributions.

Data whose distribution is highly skewed (either left or right) has anoticibly seperate median and mean.

Note that for a highly skewed distribution the median is a preferredcenter, as it is robust, unlike the mean, and isn’t affected by outliers.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 15 / 44

Center Measurements on a Histogram

Figure : Figure 5. Measures of center.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 16 / 44

Measures of Spread

Example 3 Consider two sets of data :

Set A: 48, 49, 51, 53, 45, 47, 55, 50, 51, 51Set B: 10, 65, 17, 89, 100, 40, 21, 99, 34, 25

Figure : Figure 6. Histograms for Sets A and B.

Both sets have the same mean of 50, but the spread of the distribution inset B is much greater than it is in set A.Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 17 / 44

The Sample Standard Deviation

The sample standard deviation measures the spread by looking at how farthe observations are from their average x̄ . Typically, the standarddeviation will be denoted by s.

The formula for the standard deviation is:

s =

√√√√1

n

n∑i=1

(xi − x̄)2

NOTE: There is an alternative way to compute s, which is more efficientin some cases :

s =

√√√√1

n

n∑i=1

x2i − x̄2 =

√x2 − x̄2,

where x2 is the average of squared data values.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 18 / 44

Interpreting the Standard Deviation

s =

√√√√1

n

n∑i=1

(xi − x̄)2

1 Begin with xi − x̄ :

First compute the sample mean x̄ .Second, for each data point xi , record the difference between xi and x̄ .xi − x̄ is a measure of deviation a data point is from the mean.Deviations may be positive or negative. However, the sum of thedeviations is zero.

2 Proceed to (xi − x̄)2:

Squaring the deviations makes them positive.

3 Proceed to 1n

∑ni=1(xi − x̄)2:

We now find the mean of the squared deviations. Do not confuse thismean with the sample mean x̄ .

4 Finish by taking a square root, effectively undoing the square.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 19 / 44

Properties of the Sample Standard Deviation

1 s measures spread about the mean. The standard deviation isconnected to only the mean among center measures.

2 s = 0 only when there is no spread. This happens only when all theobservations are identical. Otherwise s is positive. As observationsbecome more spread out about their mean, s gets larger.

3 s, like the average x̄ , is not robust. Distributions with outliers andstrongly skewed distributions have large standard deviations.

4 s has the same unit of measurement as the original observations.

5 For bell-shaped distributions, s can be interpreted as a deviation oftypical observation from the mean.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 20 / 44

Example 2 (Temperature)

Let us compute the standard deviation for the original data set: 63, 66,69, 70, 75.

Step 1. Compute the average x̄ : x̄ = 68.6 (completed earlier).

Step 2. Compute the deviations: See column 2 below:

Observation Deviation Squared deviation63 63-68.6= -5.6 31.3666 66-68.6= -2.6 6.7669 69-68.6= 0.4 0.1670 70-68.6= 1.4 1.9675 75-68.6= 6.4 40.96

Sum 0 81.2

Step 3. Square the Deviations.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 21 / 44

Example 2 (Temperature)

Step 4. Average squared deviations:

1

n

n∑i=1

(xi − x̄)2 =1

5(31.36 + 6.76 + 0.16 + 1.96 + 40.96) = 16.24

Step 5. Take the square root of the averaged squared deviations:

s =

√√√√1

n

n∑i=1

(xi − x̄)2 =√

16.24 ≈ 4.03

We have computed that the mean temperature is 68.6 degrees Fahrenheit,while the standard deviation among the data points is 4.03 degreesFahrenheit.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 22 / 44

Example 2 (Temperature)

Now let us recompute the standard deviation for the case when the lastobservation was recorded wrong (105 instead of 75). We have alreadycalculated the mean for this case, x̄error = 74.6. Computing:

Observation Deviation Squared deviation63 63-74.6=-11.6 134.5666 66-74.6=-8.6 73.9669 69-74.6=-5.6 31.3670 70-74.6=-4.6 21.16105 105-74.6=30.4 924.16

Sum 0 1185.2

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 23 / 44

Examples

In this case, when the last observation is recorded wrong,

serror =

√1185.2

5≈ 15.40

which is almost 4 times greater than the standard deviation for theoriginal data set.

Remember, the standard deviation is like the mean, NOT robust.Example 5 What is the standard deviation for the data

5, 5, 5, 5, 5, 5, 5, 5, 5, 5 ?

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 24 / 44

Visual Intuition

There are physically intuitive interpretations of mean, median and mode.Consider any histogram sketch associated to a density histogram. The peaks arethe modes and are the easiest to spot.The median is the first place where a vertical line splits the area under thedensity histogram equally.

The mean is the centre of gravity of the histogram, thought of as a physical mass.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 25 / 44

Visual Intuition

To illustrate the point that the mean plays the role of center of gravity andwhy it should differ from the median consider two density histogram whichhave similar shapes but are still different :

mean

mean

Although the median stays the same, the mean moves to the right as thesecond peak moves to the right!Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 26 / 44

Percentiles

The p th percentile of the data is such a value that p percent of theobservations fall at or below it.

The median is the 50th percentile.

The most commonly used percentiles other than the median are thequartiles.

The first quartile, Q1, is the 25th percentile, and the third quartile, Q3,is the 75th percentile.The second quartile, obviously, is the medianitself.NOTE: 50% of the observations are located between the quartiles Q1

and Q3.

Percentiles, and, in particular, quartiles are useful numerical characteristicsof the data distribution. The quartiles are used to compute theinterquartile range.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 27 / 44

The Interquartile Range

The Interquartile Range, IQR is the distance between the first and thirdquartiles:

IQR = Q3 − Q1.

To calculate the quartiles:

Arrange the observations in increasing order.

If the number of observations n is even, split the ordered data setinto 2 parts. Find the first quartile Q1 as the median of the first halfof the data set. Similarly, the third quartile Q3 is the median of thesecond half of the original data set.

If the number of observations n is odd, split the data set in twohalves by excluding the central value (the median). After that find Q1

and Q3 as the medians of the corresponding halves.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 28 / 44

Example 1 (Heights)

The heights (in inches) of 10 students are given below:

71, 70, 68, 69, 68, 65, 72, 69, 71, 62.

Notice that we are dealing with an even number of observations:

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 29 / 44

Examples 2 (Temperature)

A biological experiment takes place in an orchard. The outsidetemperature (in degrees Fahrenheit) is taken every hour for 5 hours:

63, 66, 69, 70, and 75,

Notice that we are dealing with an odd number of observations:

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 30 / 44

Summary of Center and Spread

For a data distribution using a quantitative variable we now have multipleways to describe the center and spread:

Center:Mean,

Median,

Mode.

Spread:Standard Deviation,

Interquartile Range.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 31 / 44

Linear Transformations

Suppose we have a set of numbers which we want to transform to anotherset of numbers which will have different units of measurement. We willconsider only the linear transformations of variables, which have thefollowing form :

xnew = a + bx , (1)

where xnew is the variable in new units, x is the old variable, and a and bare numbers.

The key word is linear. Linear transformations graph as a straight linewith y-intercept a and slope b.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 32 / 44

Examples of Linear Transformations

The distance in kilometers is translated into distance in miles usingthe formula

M = 0.62K , (2)

where M is the distance in miles, and K is the distance in kilometers.For instance, a 10-kilometer race covers 6.2 miles.

The temperature in degrees Fahrenheit is translated into temperaturein degrees Celsius as

C =5

9(F − 32) = −160

9+

5

9F , (3)

where C is the temperature in degrees Celsius, and F is thetemperature in degrees Fahrenheit. For instance, 95◦ F translates into35◦ C while −40◦ F translates into −40◦ C.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 33 / 44

The Effect of a Linear Transformation.

How do the numerical measures of center and spread change after alinear transformation?

We will separately consider 2 special cases of linear transformation:

Data Shifts: a special case of the transformation (1) when b = 1:

xnew = x + a,

which corresponds to adding a constant a to every observation.

Scale Changes: a special case of the transformation (1) when a = 0, b 6= 0:

xnew = bx ,

which corresponds to multiplying each observation by a constant b (positive

or negative). Transformation (2) is an example of the scale transformation.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 34 / 44

Effects of a Shift Transformation

If we act upon a data set by a shift transformation:

xnew = x + a,

the change in spread and center are recorded:

Mean

x̄new = x̄ + a

Median

x̃new = x̃ + a

Standard Deviation

snew = s

1st Quatrile

Q1,new = Q1 + a

3rd Quartile

Q3,new = Q3 + a

Interquartile Range

IQRnew = IQR

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 35 / 44

Example (Football team)

Wrong Scale: Every player of a highschool football team was weighedusing the same scale. If it was discovered later that the scale was 10 lbunder, so we need to add 10 lb to every weight, what would happen toeach of the following measurements?

Characteristic Original After Adjustment

Average 230 lb 240 lb

Median 240 lb 240 lb

Q1 200 lb 210 lb

Q3 280 lb 290 lb

Standard Deviation 50 lb 50 lb

IQR 80 lb 80 lb

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 36 / 44

Effects of a Scale Transformation

If we act upon a data set by a scale transformation:

xnew = bx ,

the change in spread and center are recorded:

Mean

x̄new = bx̄

Median

x̃new = bx̃

Standard Deviation

snew = |b| · s

where | · | denotes absolute value.

1st Quatrile

Q1,new = bQ1

3rd Quartile

Q3,new = bQ3

Interquartile Range

IQRnew = |b| · IQR

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 37 / 44

Example 7 (Football team)

Now suppose we found out that we are supposed to report weights and allthe summary measures in kilograms, not in pounds! Recall that 1 lb =0.453 kilograms. What would happen to each of the followingmeasurements?

Characteristic Original After Adjustment

Mean 230 lb 104 lb

Median 240 lb 109 lb

Q1 200 lb 91 lb

Q3 280 lb 127 lb

Standard Deviation 50 lb 23 lb

IQR 80 lb 36 lb

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 38 / 44

Effects of a General Linear Transformations.

If we act upon a data set by any linear transformation:

xnew = bx + a,

the change in spread and center are recorded:

Mean

x̄new = bx̄ + a

Median

x̃new = bx̃ + a

Standard Deviation

snew = |b| · s

where | · | denotes absolute value.

1st Quatrile

Q1,new = bQ1 + a

3rd Quartile

Q3,new = bQ3 + a

Interquartile Range

IQRnew = |b| · IQR

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 39 / 44

The Empirical Rule

As a general rule, if the data distribution is unimodal, roughly symmetric,and bell-shaped, then:

Approximately 68% of the observations fall within one standarddeviation of the average, i.e., approximately 68% of data values arebetween x̄ − s and x̄ + s.

Approximately 95% of the observations fall within 2 standarddeviations of the average, i.e., approximately 95% of data values arebetween x̄ − 2s and x̄ + 2s.

Approximately 99.7% of the observations fall within 3 standarddeviations of the average, i.e., approximately 99.7% of data values(almost all the observations) are between x̄ − 3s and x̄ + 3s.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 40 / 44

The Empirical Rule

Figure : Figure 1. Empirical Rule Illustration.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 41 / 44

WARNING: The Empirical Rule Strikes Back

Don’t throw caution (and, perhaps, lightsabers) to the wind while usingthe empirical rule! Keep in mind:

The Empirical Rule does NOT give the exact percentages ofobservations within one, two, or three standard deviations, justapproximate percentages.

The Empirical Rule works well just for symmetric bell-shapedhistograms.

The Empirical Rule will not be too off for lightly skewed distributions,but it will be very wrong for moderately or heavily skeweddistributions.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 42 / 44

Example (HANES5 study)

HANES5 (The Health and Nutrition Examination Survey) was a studydone in 2003-2004 recording the height (in inches) of women (see p. 58).

The mean height was x̄ = 63.5 inches and the standard deviation s = 3inches.

The histogram (found on the next slide) is approximately symmetric andbell-shaped, so the Empirical Rule should hold.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 43 / 44

Example (HANES5 study)

The shaded region corresponds to the

women with height within 1 standard

deviation of the average:

x̄ ± s = 63.5± 3 = (60.5, 66.5)

The true area of the shaded region is

72%, which is fairly close to 68%.

The shaded region corresponds to the

women with height within 2 standard

deviation of the average:

x̄ ± 2s = 63.5± 2 · 3 = (57.5, 69.5)

The true area of the shaded region is

97%, which is fairly close to 95%.

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 44 / 44