describing distributions with numbers chapter 2. what we will do we are continuing our exploration...

26
Describing Distributions with Numbers Chapter 2

Upload: malcolm-harrington

Post on 21-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

Describing Distributions with Numbers

Chapter 2

Page 2: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

What we will do

• We are continuing our exploration of data.• In the last chapter we graphically depicted

data• Now we are going to look at how we can

describe data using “summary” statistics• We will look at statistics that provide

measures of central tendency• We will also look at statistics that provide

measures of dispersion

Page 3: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

Sometimes Statistics are So Simple…

• Sometimes statistics are so simple we have to do something to make them look fancier than they are. Enter “The Mean”.

• The mean simply means taking the average of something.

• You all know how to do this. You add up the group, then you divide it by the number of items in the group.

Page 4: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

But just to make sure you know I know what I am doing I have a formula

Xi

nX1

Page 5: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

We may talk about these formulas but…

• Don’t worry, we may talk about the formulas that mathematically describe statistics so you can get a better understanding of how they work.

• I might also hand calculate a few to demonstrate this

• But no one today hand calculates real data• Neither should you that is why we have

software

Page 6: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

The Median

• The Median is the mid point of a distribution. Half the observations have values less than the median, half have values more

• The formula looks like this

• Note the formula gives the location of the median (the observation which has a value equal to the median) not its value

2/)1( NM

Page 7: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

Here is where Stem & Leaf Graphs can come in handy (N=20)

Page 8: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

Mean and Median which one?• In general the Mean is more susceptible to

distortion by – abnormally large cases, in the language of the

book a distribution skewed to the right– or abnormally small cases, in the language of

the book a distribution skewed to the left.

• For example, one Bill Gates among a thousand people will seriously distort the “Mean” income of this sample. However, it will have little or no impact on the “Median” Income

Page 9: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

Level of Measure Matters Also

• You cannot take the mean of a categorical variable (one measured at the nominal or ordinal level).

• You can however calculate the median of a variable measured at the ordinal level.

• This is a good point to stop and remind you about the stupidity of machines.

• Unless the variables are tagged in the data set as to level of measure, your computer really won’t care and will happily chug along calculating even meaningless statistics such as the mean of your categorical variables.

Page 10: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

One more

• The Mode is the measure of central tendency for nominal data. It is simply the category with the largest number of cases.

Page 11: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

If all we knew was how well the data clumped together…

• Even though the Median is less susceptible to distortion by an abnormally large or small case, it can still provide a very weak description of your data if the observations are widely dispersed.

• This is why we are often interested in the Quartiles

Page 12: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

Just like the Median only smaller

• Quartiles are just like the Median only on a smaller scale. Instead of defining the mid point of the distribution they define the break-point between:– The first quarter and the second quarter– The break between the second quarter and

the third quarter (which is the Median by the way)

– The break between the third quarter and the fourth quarter

Page 13: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

The Five-Number Summary

• Moore is very big on the use of the five-number summary to summarily describe data.

• Minimum value

• Q1

• M

• Q3

• Maximum value

Page 14: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

You can graphically depict this with a box plot

• Fortunately all the computer programs we are employing can easily generate both the numerical summary and the accompanying box plots

• SPSS can generate all this and more using its “Frequencies” and “Explore” commands. Excel does the job just as nicely.

Page 15: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

Here is an example of an SPSS Box plot for before tax income for men and women in Ontario from the

Survey of Household Spending

Page 16: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

• Notice on the previous slide how the distance from the first quartile to the median and then to the third quartile is not necessarily symmetrical and then that the whiskers on the box plot are also not symmetrical. This is an indication of skew

• Unlike the example in the book my whiskers indicate not max and min value but percentiles,

Page 17: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

Here is the five number summary for Men and Women

Page 18: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

Spotting outliers

• Obviously our box plots provide an excellent way to spot outliers.

• A statistic that can also help is the “interquartile range”. This is just the range between quartile one and three.

• When an observation lies 11/2 times the Interquartile range above quartile three or below quartile 1, it is often considered to be an outlier.

Page 19: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

While I used ratio level data…

• While I used ratio level data for my example of the five-number summary, it should be noted that there is nothing here (quartiles, Median, maximum, minimum value) that would not work with data measured at the interval or ordinal level

Page 20: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

Range

• Along with quartiles (which works when data is at least measured at the ordinal level) we must also remember to look at “Range” which is the only measure of dispersion that works at the nominal level.

Page 21: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

Standard Deviation

• The best way to describe Standard Deviation (notation S) is that it is the square root of Variance (notation S2)

• So why do you need variance? A bit of math if you look at the formula in your book.

Page 22: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

The Formula for S2

• Variance is the sum of the squared distances of each observation from the mean over N-1 (N-1 being the degree of freedom).

)(22

1

1 xxS in

Page 23: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

The Formula for S2 involves a squaring

• We have to square these distances as, otherwise -- in a symmetrical distribution -- they would cross cancel and there would be no variance.

• The problem with variance is all that squaring produces numbers that are very large and not too intuitive to read on their own (though you will see later that variance is an important tool and even a building block for other things).

Page 24: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

• Taking the square root produces a much more usable number (S).

• Quite simply, when you know

and S

• You can go up and down a list of numbers and figure out which list is more concentrated about its mean and which is more diffuse and which are similar

X

Page 25: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

If you want a quick exampleFrequency Value

1 0

1 1

1 2

1 3

1 4

1 5

1 6

1 7

1 8

1 9

1 10

N= 11 ∑ = 55

Mean = 5 S2=11

S= 3.3

Frequency Value

1 0

1 2

1 4

1 6

1 8

1 10

1 12

1 14

1 16

1 18

1 20

N= 11 ∑ = 110

Mean = 10 S2= 44

S= 6.6

Page 26: Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted

But once again, keep in mind…

If the mean is susceptible to distortion from extreme variables, S is doubly so due to all those squarings