part ii : descriptive statistics - binghamton...

Chapter 4 - Averages and Standard DeviationPART II : DESCRIPTIVE STATISTICS

Dr. Joseph Brennan

Math 148, BU

Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 1 / 44

Description of Distributions

Similar to chapter 3, we will only be handling variables that arequantitative in nature.

To describe the distribution of a quantitative variable we should specify :

The overall shape of the distribution:

Number of modes,Types of Symmetry,Types of Skew.

Numerical descriptions of the distribution. These are measures of

Center,Spread.


Measures of Center and Spread

We will consider 3 measures of the center of a distribution :

the mode,

the average (mean),

the median.

We will also discuss 2 measures of the spread of a distribution :

the standard deviation,

the interquartile range.

NotationSuppose we have a data set which consists of n observations.

Denote observations by x1, x2, . . . , xn.

Consider x1 as the first observation and xn is the last n - thobservation.

The subscripts on the observations, xi , are just a way of keeping the nobservations distinct. They do not necessarily indicate order or anyother special facts about the data.


The Mode

The Mode

The mode is the number that occurs most frequently in a given data. Amode can be visually determined from a histogram as it will coincide witha peak. There may be several modes!


The Mean (Average)

The Mean

The mean is the numerical center of data. It is the common average bywhich you have been graded since childhood.

Typically, the mean of a set of data will be denoted by x̄ . The mean of apopulation (found through a census) is denoted µ.

The mean x̄ for a set of observations is determined by adding all valuestogether and dividing by the number of observations. Typically, thenumber of observations will be denoted by n.

x̄ =x1 + x2 + . . .+ xn

n=

1

n

n∑i=1

xi .

The Σ (capital Greek sigma) in the above formula is short for “add them all up”.


Example 1 (Student Height)

The heights (in inches) of 10 students are given below :

71, 70, 68, 69, 68, 65, 72, 69, 71, 62.

What is a mode height? There happen to be three:

68 69 71

What is the mean height?

x̄ =71 + 70 + 68 + 69 + 68 + 65 + 72 + 69 + 71 + 62

10= 68.5 (inches)


Example 2 (Temperature)

A biological experiment takes place in an orchard. The outsidetemperature (in degrees Fahrenheit) is taken every hour. The first 5successive measurements,

63, 66, 69, 70, and 75,

were taken at 8, 9, 10, 11, and 12 a.m., correspondingly.

What is the mode temperature? There isn’t one!

What is the mean temperature?

x̄ =63 + 66 + 69 + 70 + 75

5= 68.6 (degrees Fahrenheit)



Suppose that we recorded the last number wrong and accidentallyrecorded a temperature 105 instead of 75. How will the average change?

x̄error =63 + 66 + 69 + 70 + 105

5= 74.6

We can track the actual change between the actual mean and the meanfound in error:

x̄error =63 + 66 + 69 + 70 + 75

5+

30

5= x̄true + 6

A single wrong temperature (an outlier!) has shifted the meantemperature up by 6 degrees!


Weakness of the Average

Example 4 illustrates an important weakness of the average.

The average is sensitive to the influence of extremeobservations.

Extreme observations may be outliers, but a skewed distribution that hasno outliers will also shift the mean towards the long tail. This will bediscussed in detail later. In statistical language we say that the average isnot a robust measure of the center.


Properties of the Average

1 The average is always between the smallest and the biggest numberin the data set.

2 The average is the center of gravity of the histogram.

3 The average is not resistent (robust) to outliers and to a skewness ofa distribution. The average shifts towards the long tail of thedistribution.

4 The average x̄ estimates the (unknown) population mean µ.

5 The average value x̄ is the best predictor for a future value of avariable.


The Median

The median is the midpoint of a distribution. For a data set the median isthe number such that half of observations are smaller and the other halfare larger. Typically, the median will be denoted x̃ .To find the median of a distribution :

1 Arrange all observations in order of size, from smallest to largest.Besure to list all observations, even if the same values are repeatedseveral times.

2 If the number of observations n is odd, the median x̃ is the centerobservation in the ordered list. The location of the median can befound by counting (n + 1)/2 observations up from the bottom of thelist.

3 If the number of observations n is even, the median x̃ is any numberbetween two center observations in the ordered list.

When n is even, we will usually take the median x̃ to be the average ofthe two center observations in the ordered list.


Example 1 (Height) and 2 (Temperature) Revisited

Example 1 The sample size for the height of students is n = 10, which is even.

The median (69) is the average of the two middle observations in the list:

62 65 68 68 69 69 70 71 71 72

Example 2 The sample size for the temperature is n = 5, which is odd. The

median is the 3rd observation in this list:

63 66 69 70 75

When the last temperature was recorded wrong, the median is again 69 :

63 66 69 70 105

As we can see, an outlier does not change the median!

The median is resistant (robust) to the influence of extremeobservations.


Properties of the Median

1 The median is always between the smallest and the biggest number inthe data set.

2 The median is the value which divides the area of the histogram byhalf. The area of the histogram to the left of the median is equal tothe area of the histogram to the right of the median.

3 The median is resistent (robust) to the influence of extremeobservations and to the skewness of a distribution.


What Measure of Center is Applicable?

Different measures of center are appropriate in different situations.Consider several examples:

1 A shoe store is interested in which size of shoes is of greatest demand.This question is about the mode of the distribution of shoe sizes.

2 An economist studying household incomes is interested in the middleincome value, the economist wants to de-emphasize the impact of thefew very high incomes that are typically present in such a data set.The economist is interested in the median income.

3 An instructor is asked what was the average score on the test.The instructor is interested in the mean grade.


What Measure of Center is Applicable?

It is important that there is not a unique notion of center! A propermeasure of center should may be chosen based upon the study questionand looking at the available data.

Data whose distribution is roughly symmetric has a median, x̃ , andmean, x̄ , close together.

Note that for a perfectly symmetric distribution the mean equals themedian. Real data sets, however, never have perfectly symmetricdistributions.

Data whose distribution is highly skewed (either left or right) has anoticibly seperate median and mean.

Note that for a highly skewed distribution the median is a preferredcenter, as it is robust, unlike the mean, and isn’t affected by outliers.


Center Measurements on a Histogram

Figure : Figure 5. Measures of center.


Measures of Spread

Example 3 Consider two sets of data :

Set A: 48, 49, 51, 53, 45, 47, 55, 50, 51, 51Set B: 10, 65, 17, 89, 100, 40, 21, 99, 34, 25

Figure : Figure 6. Histograms for Sets A and B.

Both sets have the same mean of 50, but the spread of the distribution inset B is much greater than it is in set A.Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 17 / 44

The Sample Standard Deviation

The sample standard deviation measures the spread by looking at how farthe observations are from their average x̄ . Typically, the standarddeviation will be denoted by s.

The formula for the standard deviation is:

s =

√√√√1

n

n∑i=1

(xi − x̄)2

NOTE: There is an alternative way to compute s, which is more efficientin some cases :

s =

√√√√1

n

n∑i=1

x2i − x̄2 =

√x2 − x̄2,

where x2 is the average of squared data values.


Interpreting the Standard Deviation

s =

√√√√1

n

n∑i=1

(xi − x̄)2

1 Begin with xi − x̄ :

First compute the sample mean x̄ .Second, for each data point xi , record the difference between xi and x̄ .xi − x̄ is a measure of deviation a data point is from the mean.Deviations may be positive or negative. However, the sum of thedeviations is zero.

2 Proceed to (xi − x̄)2:

Squaring the deviations makes them positive.

3 Proceed to 1n

∑ni=1(xi − x̄)2:

We now find the mean of the squared deviations. Do not confuse thismean with the sample mean x̄ .

4 Finish by taking a square root, effectively undoing the square.


Properties of the Sample Standard Deviation

1 s measures spread about the mean. The standard deviation isconnected to only the mean among center measures.

2 s = 0 only when there is no spread. This happens only when all theobservations are identical. Otherwise s is positive. As observationsbecome more spread out about their mean, s gets larger.

3 s, like the average x̄ , is not robust. Distributions with outliers andstrongly skewed distributions have large standard deviations.

4 s has the same unit of measurement as the original observations.

5 For bell-shaped distributions, s can be interpreted as a deviation oftypical observation from the mean.



Let us compute the standard deviation for the original data set: 63, 66,69, 70, 75.

Step 1. Compute the average x̄ : x̄ = 68.6 (completed earlier).

Step 2. Compute the deviations: See column 2 below:

Observation Deviation Squared deviation63 63-68.6= -5.6 31.3666 66-68.6= -2.6 6.7669 69-68.6= 0.4 0.1670 70-68.6= 1.4 1.9675 75-68.6= 6.4 40.96

Sum 0 81.2

Step 3. Square the Deviations.



Step 4. Average squared deviations:

1

n

n∑i=1

(xi − x̄)2 =1

5(31.36 + 6.76 + 0.16 + 1.96 + 40.96) = 16.24

Step 5. Take the square root of the averaged squared deviations:

s =

√√√√1

n

n∑i=1

(xi − x̄)2 =√

16.24 ≈ 4.03

We have computed that the mean temperature is 68.6 degrees Fahrenheit,while the standard deviation among the data points is 4.03 degreesFahrenheit.



Now let us recompute the standard deviation for the case when the lastobservation was recorded wrong (105 instead of 75). We have alreadycalculated the mean for this case, x̄error = 74.6. Computing:

Observation Deviation Squared deviation63 63-74.6=-11.6 134.5666 66-74.6=-8.6 73.9669 69-74.6=-5.6 31.3670 70-74.6=-4.6 21.16105 105-74.6=30.4 924.16

Sum 0 1185.2


Examples

In this case, when the last observation is recorded wrong,

serror =

√1185.2

5≈ 15.40

which is almost 4 times greater than the standard deviation for theoriginal data set.

Remember, the standard deviation is like the mean, NOT robust.Example 5 What is the standard deviation for the data

5, 5, 5, 5, 5, 5, 5, 5, 5, 5 ?


Visual Intuition

There are physically intuitive interpretations of mean, median and mode.Consider any histogram sketch associated to a density histogram. The peaks arethe modes and are the easiest to spot.The median is the first place where a vertical line splits the area under thedensity histogram equally.

The mean is the centre of gravity of the histogram, thought of as a physical mass.


Visual Intuition

To illustrate the point that the mean plays the role of center of gravity andwhy it should differ from the median consider two density histogram whichhave similar shapes but are still different :

mean

mean

Although the median stays the same, the mean moves to the right as thesecond peak moves to the right!Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 26 / 44

Percentiles

The p th percentile of the data is such a value that p percent of theobservations fall at or below it.

The median is the 50th percentile.

The most commonly used percentiles other than the median are thequartiles.

The first quartile, Q1, is the 25th percentile, and the third quartile, Q3,is the 75th percentile.The second quartile, obviously, is the medianitself.NOTE: 50% of the observations are located between the quartiles Q1

and Q3.

Percentiles, and, in particular, quartiles are useful numerical characteristicsof the data distribution. The quartiles are used to compute theinterquartile range.


The Interquartile Range

The Interquartile Range, IQR is the distance between the first and thirdquartiles:

IQR = Q3 − Q1.

To calculate the quartiles:

Arrange the observations in increasing order.

If the number of observations n is even, split the ordered data setinto 2 parts. Find the first quartile Q1 as the median of the first halfof the data set. Similarly, the third quartile Q3 is the median of thesecond half of the original data set.

If the number of observations n is odd, split the data set in twohalves by excluding the central value (the median). After that find Q1

and Q3 as the medians of the corresponding halves.


Example 1 (Heights)

The heights (in inches) of 10 students are given below:

71, 70, 68, 69, 68, 65, 72, 69, 71, 62.

Notice that we are dealing with an even number of observations:


Examples 2 (Temperature)

A biological experiment takes place in an orchard. The outsidetemperature (in degrees Fahrenheit) is taken every hour for 5 hours:

63, 66, 69, 70, and 75,

Notice that we are dealing with an odd number of observations:


Summary of Center and Spread

For a data distribution using a quantitative variable we now have multipleways to describe the center and spread:

Center:Mean,

Median,

Mode.

Spread:Standard Deviation,

Interquartile Range.


Linear Transformations

Suppose we have a set of numbers which we want to transform to anotherset of numbers which will have different units of measurement. We willconsider only the linear transformations of variables, which have thefollowing form :

xnew = a + bx , (1)

where xnew is the variable in new units, x is the old variable, and a and bare numbers.

The key word is linear. Linear transformations graph as a straight linewith y-intercept a and slope b.


Examples of Linear Transformations

The distance in kilometers is translated into distance in miles usingthe formula

M = 0.62K , (2)

where M is the distance in miles, and K is the distance in kilometers.For instance, a 10-kilometer race covers 6.2 miles.

The temperature in degrees Fahrenheit is translated into temperaturein degrees Celsius as

C =5

9(F − 32) = −160

9+

5

9F , (3)

where C is the temperature in degrees Celsius, and F is thetemperature in degrees Fahrenheit. For instance, 95◦ F translates into35◦ C while −40◦ F translates into −40◦ C.


The Effect of a Linear Transformation.

How do the numerical measures of center and spread change after alinear transformation?

We will separately consider 2 special cases of linear transformation:

Data Shifts: a special case of the transformation (1) when b = 1:

xnew = x + a,

which corresponds to adding a constant a to every observation.

Scale Changes: a special case of the transformation (1) when a = 0, b 6= 0:

xnew = bx ,

which corresponds to multiplying each observation by a constant b (positive

or negative). Transformation (2) is an example of the scale transformation.


Effects of a Shift Transformation

If we act upon a data set by a shift transformation:

xnew = x + a,

the change in spread and center are recorded:

Mean

x̄new = x̄ + a

Median

x̃new = x̃ + a

Standard Deviation

snew = s

1st Quatrile

Q1,new = Q1 + a

3rd Quartile

Q3,new = Q3 + a

Interquartile Range

IQRnew = IQR


Example (Football team)

Wrong Scale: Every player of a highschool football team was weighedusing the same scale. If it was discovered later that the scale was 10 lbunder, so we need to add 10 lb to every weight, what would happen toeach of the following measurements?

Characteristic Original After Adjustment

Average 230 lb 240 lb

Median 240 lb 240 lb

Q1 200 lb 210 lb

Q3 280 lb 290 lb

Standard Deviation 50 lb 50 lb

IQR 80 lb 80 lb


Effects of a Scale Transformation

If we act upon a data set by a scale transformation:

xnew = bx ,


Mean

x̄new = bx̄

Median

x̃new = bx̃

Standard Deviation

snew = |b| · s

where | · | denotes absolute value.

1st Quatrile

Q1,new = bQ1

3rd Quartile

Q3,new = bQ3

Interquartile Range

IQRnew = |b| · IQR


Example 7 (Football team)

Now suppose we found out that we are supposed to report weights and allthe summary measures in kilograms, not in pounds! Recall that 1 lb =0.453 kilograms. What would happen to each of the followingmeasurements?

Characteristic Original After Adjustment

Mean 230 lb 104 lb

Median 240 lb 109 lb

Q1 200 lb 91 lb

Q3 280 lb 127 lb

Standard Deviation 50 lb 23 lb

IQR 80 lb 36 lb


Effects of a General Linear Transformations.

If we act upon a data set by any linear transformation:

xnew = bx + a,


Mean

x̄new = bx̄ + a

Median

x̃new = bx̃ + a

Standard Deviation

snew = |b| · s

where | · | denotes absolute value.

1st Quatrile

Q1,new = bQ1 + a

3rd Quartile

Q3,new = bQ3 + a

Interquartile Range

IQRnew = |b| · IQR


The Empirical Rule

As a general rule, if the data distribution is unimodal, roughly symmetric,and bell-shaped, then:

Approximately 68% of the observations fall within one standarddeviation of the average, i.e., approximately 68% of data values arebetween x̄ − s and x̄ + s.

Approximately 95% of the observations fall within 2 standarddeviations of the average, i.e., approximately 95% of data values arebetween x̄ − 2s and x̄ + 2s.

Approximately 99.7% of the observations fall within 3 standarddeviations of the average, i.e., approximately 99.7% of data values(almost all the observations) are between x̄ − 3s and x̄ + 3s.


The Empirical Rule

Figure : Figure 1. Empirical Rule Illustration.


WARNING: The Empirical Rule Strikes Back

Don’t throw caution (and, perhaps, lightsabers) to the wind while usingthe empirical rule! Keep in mind:

The Empirical Rule does NOT give the exact percentages ofobservations within one, two, or three standard deviations, justapproximate percentages.

The Empirical Rule works well just for symmetric bell-shapedhistograms.

The Empirical Rule will not be too off for lightly skewed distributions,but it will be very wrong for moderately or heavily skeweddistributions.


Example (HANES5 study)

HANES5 (The Health and Nutrition Examination Survey) was a studydone in 2003-2004 recording the height (in inches) of women (see p. 58).

The mean height was x̄ = 63.5 inches and the standard deviation s = 3inches.

The histogram (found on the next slide) is approximately symmetric andbell-shaped, so the Empirical Rule should hold.


Example (HANES5 study)

The shaded region corresponds to the

women with height within 1 standard

deviation of the average:

x̄ ± s = 63.5± 3 = (60.5, 66.5)

The true area of the shaded region is

72%, which is fairly close to 68%.

The shaded region corresponds to the

women with height within 2 standard

deviation of the average:

x̄ ± 2s = 63.5± 2 · 3 = (57.5, 69.5)

The true area of the shaded region is

97%, which is fairly close to 95%.


part ii : descriptive statistics - binghamton...

Documents