statistics - loudoun county public schools · statistics part i: introduction to statistics...

Statistics

Part I: Introduction to Statistics

Statistics can be used as a tool to help demystify data. Everyday examples of the use of statistics include

election polls, market research, exercise regimes, and drug abuse surveys. Statistics are a means of

organizing and analyzing data (numbers) systematically so that they have meaning. In fact, that might be

a good working definition for statistics – giving meaning to numbers.

Frequency Distribution.

A group of numbers has little meaning until it is organized. Frequency distributions and graphs allow us

to visually interpret sets of data and look at data in a meaningful way.

An organized list enables us to see clusters or patterns in data that would be less obvious in an

unorganized list. Scores are listed in ascending or descending order so that groupings of same scores can

be recognized. For example, scores on an exam might include:

91, 92, 87, 99, 83, 84, 82, 93, 89, 91, 85, 94, 91, 98, 90

Frequency Distribution of ungrouped scores: 99 1

98 1

97 0 (Frequency counts of zero are always included in the table, as shown)

96 0

95 0

94 1

93 1

92 1

91 3

90 1

89 1

88 0

87 1

86 0

85 1

84 1

83 1

82 1

81 0

80 0

N = 15 (N represents the total number of observations or scores.)

Grouped Frequency Distribution of same scores

95 – 99 2

90 – 94 7 The width of the intervals in groups frequency tables must be equal.

85 – 89 3 There should be no overlap between the intervals.

80 – 84 3

N = 15

The grouped distribution shows that the scores tend to cluster in the 90 – 94 range. The class can then see

how their own score compares in relation to the groups. This clustering is not apparent from the original

random listing of test scores.

Part II: Graphs

Graphs allow us to quickly summarize the data collected. In a glance we can attain some level of

meaning from the numbers.

Pie Chart

A pie chart is a circle within which all of the data points or numbers are contained in the form

of percentages. Suppose that you are a car dealer and are interested in knowing what

percentage of cares you should order in each color. A pie chart divided by the percentages of

each color would be a quick visual representation of the data from the previous year’s car sales.

Bar Graph

A common method used for representing nominal data is with a bar graph. The height of the bars indicate

the percentages or frequency of each category.

Line Graph

Line graphs indicate changes that occur during experiments. It shows the change in relationship between

the independent and dependent variables. Perhaps you are investigating the amount of change in SAT

scores between groups retaking the test: those who take an SAT prep course and those who do not. The

independent variable (the two groups in this case) is always on the vertical axis and the dependent

variable (the SAT scores) is always on the horizontal axis.

Frequency Polygraph

This line graph has the same vertical and horizontal labels as the histogram. However, each score’s

frequency of occurrence is marked with a point of the graph, and then all points are connected with a line.

More than one distribution of scores can be plotted on the same polygon for visual comparisons. For

instance, if you wanted to compare sales of red cars with sales of white cares between the years of 1985

and 1995, a frequency polygon would be an easy way to visually compare the data from sales of each

color.

The frequency polygon is especially useful in showing the asymmetry of data. When the data is not

evenly distributed, it is referred to as skewed distribution and the graph is not symmetrical.

A negatively skewed graph would be indicated by a high frequency (or clustering) of data on the high

end with a few data points on the low end, because skewness is always indicative of the “tail” or low

end of the graph as indicated by low frequency of occurrences.

A positively skewed graph would be indicated by a high frequency (or clustering) of data on the low

end with a few data points on the high end.

Activity 1: Graphing

Students will draw a graph of the Grouped Frequency Distribution of the sample grades above using a

Frequency Polygon. Make sure that the grouped scores are listed on the horizontal (or X) axis and the

frequencies on the vertical (or Y) axis.

Students will use a Grouped Frequency Distribution of the scores, but this time using different groupings.

You will again use four groupings, but group them differently that how they are grouped above. You will

then draw a graph of the second Grouped Frequency Distribution.

Are the two graphs skewed in any way? If so, how?

End product

1. Second method of Grouped Frequency Distribution of the scores

2. Two graphs (one of the original Grouped Frequency Distribution above, and one of the Grouped

Frequency Distribution that the student regrouped in another manner)

3. Identification of skew in graphs

Remember when making a graph the increments should go from low to high, away from the axis.

Part III: Correlation

Background & Independent and Dependent Variables

Correlation describes the relationship between two variables. The result of a psychological study may

reveal that one trait or behavior accompanies another. In other words: when one trait is present, another

trait is also present. When this occurs, we say that there is a correlation between the two traits.

Example #1: How is studying related to grades?

One variable may be measured by the number of hours a student studied for an exam, and the other

variable by the score on the exam.

The “hours studied” variable is considered the independent variable (x) because it causes the observed

variation in the “exam grade” variable, which is considered the dependent variable (y). The independent

variable causes the change in the dependent variable.

If the findings show that the more a student studied, the higher the grade the student received on the

exam, than this would be an example of a positive correlation. A positive correlation indicates a direct

relationship between variables: an increase in one variable is accompanied by an increase in another

variable or a decrease in one variable is accompanied by a decrease in another variable.

Example #2: How is television viewing related to grades?

In this example, television could be measured by asking an individual to calculate the average number of

hours a day he or she watches. Grades could be measured by the grade point average (GPA) obtained at

the end of the school year. It is quite likely that such a study would show that as the amount of television

viewing increases, the GPA decreases. If that were the case, this would be an example of a negative

correlation. A negative correlation indicates an inverse relationship between variables: an increase in

one variable is accompanied by a decrease in another variable, or a decrease in one variable in

accompanied by an increase in another variable.

Correlation Coefficients

Correlations are measured with numbers ranging from -1.0 to +1.0. These numbers are called correlation

coefficients. The correlation coefficient is a statistical measure of relationship: it reveals how closely two

things vary together and thus how well either one predicts the other. A correlation coefficient is

represented mathematically as the Greek letter rho – or a small r.

A zero could indicate no correlation between variables. In other words: there is no relationship between

the two variables.

As the number moves closer to +1.0, the coefficient shows an increasing positive correlation. As the

number moves closer to -1.0, the stronger the negative correlation between the variables.

+1.0 and -1.0 are perfect correlations while 0 is no correlation.

Perfect correlations rarely occur in the real world. For example, the correlation between studying and

grades is less than a perfect one. Not all students who study many hours always get the highest grades,

and sometimes students who claim to have studied very little get very high grades. Therefore, the

correlation between studying and grades will be less than 1.0.

Thus correlations of +.79 and -.79 are bother strong correlations with

the first being a strong positive correlation and the second a strong

negative correlation. Therefore, -.88 is a stronger correlation than .45

and .75 is a stronger correlation than -.22. It is the absolute value of the

number that indicates the strength of the correlation. The sign (+ or -)

indicates the direction of the relationship between the variables.

Scatter Plots

Scatter plots give a visual representation of correlations: the x variable (the independent variable) on the

horizontal and the y variable (the dependent variable) on the vertical axis.

Though the correlation coefficient might be very high or strong, this does not imply that one variable

causes the other. Correlation does not imply causation. This point often confuses people. Though there

may be causation, the correlation research process is not one of proof. It is rather a process that indicates

that when one variable exists, the other variable will also be present or affected. For example, there is a

positive correlation between wearing glasses and IQ. However, wearing glasses does not cause a person

to have a high IQ. Nor does having a high IQ cause a person to have to wear glasses. However, there is a

third factor – the amount a person reads – that can influence wearing glasses and IQ. The takeaway is,

that despite a strong correlation coefficient – correlation does not equal causation!

r = .992558

Graph #1

r = .958648

r = .8057

The above three graphs show examples when correlation does not equal causation. The three graphs

show a high correlation between two factors that common sense would tell you have no connection with

each other.

Correlation is useful in showing a connection between two factors that was not previously observed. If a

high correlation (either positive or negative) is observed, generally it will be investigated to see if one

factor is, in fact, causing the other. However, in the above cases, there really does not need to be much

investigation done to show that really there is no connection between the two factors.

Basically, if there was a cause and effect (but there isn't), you'd be saying:

Divorce causes people to eat more margarine (at least in Maine!): or, eating margarine causes people

to get divorced (at least in Maine), for Graph #1

Eating mozzarella cheese causes people to get their Ph.D. in civil engineering: or, getting a Ph.D. in

civil engineering causes you to eat more mozzarella cheese, for Graph #2

The more people die from spider bites, the longer they make the winning words in the Scripps Spelling

Bee: or, the longer the winning words are in the Scripps Spelling Bee, spiders are compelled to kill

people, for Graph #3.

Obviously none of the above is true, and shows the danger of believing there to be causation when there

is a strong correlation.

Graph #2

Graph #3

Activity 2: Correlation

Students will indicate if the below pairs of Independent and Dependent Variables would represent a

positive or a negative correlation.

Independent Variable Dependent Variable

4. Size of television Selling price of television

5. Number of absences from school GPA

6. Number of hours working Amount of free time

7. Level of advertising Amount of sales

8. Size of sports team payroll Number of games won

Students will plot the below data on grade point averages and the number of hours of television watched

per week.

GPA TV hours GPA TV hours GPA TV hours

3.9 10 2.5 8 2.7 30

3.2 15 3.5 10 3.0 20

2.1 44 4.0 6 3.7 12

1.5 39 3.8 7 2.4 33

1.8 35 3.6 9 2.2 25

2.5 22 2.9 18 4.0 30

9. Students will then describe the correlation that is indicated by the plotted data (both direction and

strength)

10. Is there a correlation between the number of hours of TV a student watches a week and his/her GPA?

11. Does this mean that we can say that TV watching has an effect on GPA? Why or why not?

12. What other factors might be involved?

End product

Answers to questions 4 - 8

Scatter plot

• Answers to questions 9 – 12

The Purpose of Correlation & Illusory Correlation

Statistics can help us see what the naked eye sometimes misses. Wondering

if tall people are more or less easygoing, you collect two sets of data: scores

of men’s heights and men’s temperaments. You measure the heights of 20

men and have someone else independently assess their temperaments (from

zero from extremely calm to 100 for highly reactive).

With all the relevant date right in front of you, you can tell whether there is:

1) a positive correlation between height and reactive temperament

2) very little or no correlation

3) a negative correlation

Comparing the columns in Table 1.1, most people detect very little

relationship between height and temperament. In fact, the correlation in this

imaginary example is moderately positive, +.63, as we can see if we display

the data as a scatter plot. In the below scatter plot, the upward, oval shaped

slope of the cluster of points as one moves to the right shows that our two

imaginary sets of scores (height and reactivity) tend to rise together.

If we fail to see the relationship when data are presented as systematically as in Table 1.1, how much less

likely are we notice them in every day life? To see what is in front of us, we sometimes need statistical

illumination.

Just as correlation can show that there are relationships between two factors that aren’t immediately

obvious, correlations also restrain our “seeing” relationships that actually do not exist. A perceived

nonexistent correlation is an illusory correlation. Such illusory thinking helps explain why for so many

years people believed (and still believe) that sugar made children hyperactive, that getting cold and wet

caused one to catch a cold, and that weather changes trigger arthritis pain. For the last one, researchers

recorded both the patients’ pain reports and the daily weather – temperature, humidity, and barometric

pressure. Despite patients’ beliefs, the weather was uncorrelated with their discomfort, either on the same

day or up to two days earlier or later.

Part IV: Measures of Central Tendency

These are numbers that attempt to describe the “typical” or “average” score in a distribution. There three

measures of central tendency: mode, median, and mean. When each of these is used depends on the

situation.

Mode

The mode is the most frequently occurring score in a set of scores. If two different scores occur most

frequently, then it is a bimodal distribution.

Median

The median is the score that falls in the middle when the scores are ranked in ascending or descending

order. Half the numbers will be below that number and half the numbers will be above it. The median

score is the best indicator of central tendency when there is a skew, because the median score is

unaffected by extreme scores, sometimes referred to as Outliers.

Mean

This is the arithmetic average of a set of scores. This is the score used by teachers to indicate your

semester grade. The mean is calculated by dividing the sum of all the scores by the total number of

scores. The mean is always pulled in the direction of extreme scores – the mean is pulled toward any

skew in the distribution.

Example

Below are listed 5 temperatures from each of the past two weeks. If we wanted to know what the

temperatures during each of these weeks was MOST like, the median would be the best indicator.

Week 1: 71 74 76 79 98 mean = 79.6 median = 76

Week 2: 70 74 76 77 78 mean = 75 median = 76

In the above example, the mean is affected by extreme scores. The average of week is between the 4th

and 5th scores indicating a higher daily temperature than actually happened. In this case, the median of 76

is far more indicative of the week’s daily temperature.

Which Measure to Use

Measures of central tendency can be misleading. Suppose your mother has planned a family reunion on

Sunday when you and other family members have other things to do. Your family protests, saying that

they don’t want to spend the day with a bunch of “old fogies.” Your mother attempts to convince each

family member separately that the reunion won’t be so bad. Mom says to your younger sister that the

average age is 10, tells you that the average age in 18, and tells your dad that the average age is 36. Now

each family member feels better about spending the day at the family gathering, as long as they don’t talk

to each other and figure out that something must be amiss in that the “average” age that each is expecting

to come to the gathering is different. In fact, your mother can claim that she told each of you’re the truth

but that she was deceptive, in that she used different measures of “average” for each of you. She used the

mode for your sister, the median for you, and the mean for your dad. The ages of those who are expected

to attend are shown in the following table:

Age in years Name/relation

3 Cousin Susie

7 Cousin Joey

10 Twin Shanda

10 Twin Wanda

15 Cousin Marty

17 Cousin Juan

18 Cousin Pat

44 Aunt Harriet

49 Uncle Stewart

58 Aunt Rose

59 Uncle Don

82 Grandma Faye

96 Great Aunt Lucy

What would have helped you to tell the whole story accurately? If the family members had known each

measures of central tendency’s value and the meaning of each, they would have figured out the mother’s

scheme. Knowing that the median is 18 and the mean is 36 would immediately indicate that there is a

skew in the ages – or there are some significantly older relatives coming to the gathering. One can

sometimes chose the measure of central tendency that comes closest to supporting a particular viewpoint

or bias. Your must ask yourself, “What measure of average is being used?”

Activity 3: Calculating Central Tendency

13. Students will calculate the mode, median, and mean for the heights of an AP Psychology class (data

below)

The students will then graph the data using:

14. Bar Graph

15. Frequency Polygon

End product Mean, median, & mode of data

Two graphs

Note You will not need to know how to calculate median for AP Psychology. For this activity, use outside

resources to find the formula/determine the median of the data.

For the class, you just need to know the differences amongst the three measures of central tendency and

when you would use each one.

Student Height Student Height Student Height

A 5’ 2” I 5’ 8” Q 5’ 9”

B 6’ 0” J 5’ 6” R 5’ 6”

C 5’ 6” K 5’ 7” S 5’ 9”

D 5’ 10” L 5’ 8” T 5’ 2”

E 5’ 3” M 5’ 8” U 5’ 4”

F 5’ 7” N 6’ 1” V 4’ 11”

G 5’ 6” O 5’ 5”

H 5’ 4” P 5’ 11”

Suggestion: It will be much easier if you convert the heights to inches.

Part V: Measures of Variation

Sometimes referred to as “Measures of Variability”

Measures of variation indicate how much spread there is in a distribution. If you collect data on the ages

of those students in the 11th grade, there will be very little variability. However, if you collected data on

the shoe sizes of those same students, there would be a great deal of variability. The amount of variability

within or between groups has an impact on the research process itself. So it is import to have measures of

variability.

Range

The range is the difference between the lowest and highest scores. This is a gross indicator or variance

within a group of scores of data. The range of scores can be increased significantly with a single outlying

score. For instance, test scores from two different classes of the same course might be:

Class one: 94, 92, 85, 81, 80, 73, 62 range = 32

Class two: 85, 83, 82, 81, 80, 79, 77 rage = 8

For each of the classes, the mean is 81 and the median is 81. Overall however, the classes did not perform

the same. The scores in class one are much more spread out than in class two. This is indicated by the

difference in the ranges of the two data sets. The scores for the second class tend to cluster closer to the

mean.

Variance

This is a measure of how different the scores are from each other. The difference between the scores is

measured by the distance of each score from the mean of all the scores. In calculating the variance, each

score has an impact on the overall variance. The variance is how much spread there is overall among the

scores. Interval and ratio data are necessary for calculating variance. Note that variance is reported in

squared units rather than the units in which the data was originally collected.

To calculate variance:

Standard Deviation

This measure of variability is also based on how different the scores are from each other. Calculating the

variance is the first step in determining the standard deviation. However, remember that the variance is

reported in squared units rather then the original units. The standard deviation is the square root of the

variance of the scores. This puts the variability measure in the same units as the original data. So if you

wanted to know the variability of scores on a test, the standard deviation would tell you how much the

scores varied in terms of test points rather than squared test points. It is more meaningful.

Activity 4: Calculating and Understanding Standard Deviation/Variance

Students will calculate the standard deviation of the heights of an AP Psychology class (use the same data

set as the previous activity)

Now assume that the students are all standing on chairs, which adds one foot to their heights. Calculate

the standard deviation of the heights of the class with the students standing on their chairs.

Has the standard deviation changed from the first set of data to the second? Explain why.

End product 16. Standard deviation of the first set of data

17. Standard deviation of the second set of data

18. Explanation of connection between the standard deviation of the two sets of data.

There are two ways to think about Standard Deviation:

1. Consistency. Generally, the more consistent the numbers, the lower the Standard Deviation.

2. The average distance from the average. The closer the scores are to the average, the lower the

Standard Deviation. The further away the scores are from the average, the higher the Standard

Deviation.

Take a look at the comparison of the bowling scores of ten sets of two people – Anna and Sally. The

highest score is a 300.

Game # Anna Sally

1 250 290

2 275 190

3 280 210

4 270 260

5 260 285

6 260 275

7 270 230

8 275 200

9 273 295

10 263 200

Mean 267.6 243.5

Median 270 245

Standard deviation 8.7 39.7

Max score 280 295

Minimum score 250 190

Range 30 105

Looking at Anna’s scores. Her median score is 270 with a Standard Deviation of 8.7, which means that

the majority of her scores will fall between 261.3 and 278.7 (the median minus the Standard Deviation

and the Median number plus the Standard Deviation). And if you look at the numbers, that is the case.

The more the numbers cluster around the mean/median (the more consistent the scores), the lower the

Standard Deviation. Also, generally if you have a low Standard Deviation, the range also tends to be

small.

Looking at Sally’s scores. Her median score is 245 with a Standard Deviation of 39.7, which means that

the majority of her scores will fall between 205.3 and 284.7 (the median minus the Standard Deviation

and the Median number plus the Standard Deviation). The more the numbers are spread out from the

mean/median (the less consistent the scores), the higher the Standard Deviation. And the range is low

higher than Anna’s.

Part V: The Normal Curve

Going back to graphing; we looked at when a graph is asymmetrically distributed, such as having a left or

a right skew. But what if the data is symmetrical?

When the data is symmetrical, it is said to have

Normal Distribution; the data tends to be

grouped around a central value with no bias

left or right).

If graphed, scores distributed in a “normal

manner”: the data is symmetrical about the

center and the results typically form bell-

shaped curve. The resulting curve is called a

Bell-curve or Normal curve.

Results fall mostly in the middle with fewer and fewer scores in the extreme; either above or below the

median.

With Normal Distribution: Mean = median = mode (or very, very close)

50% of values less than the mean

50% greater than the mean

What type of data results in a normal curve?

Any data when there is any sort of natural variation, usually comes out on a Normal Curve, such as

people’s heights, weights, IQ’s, errors in measurements, etc. It is used a lot in the natural sciences, social

sciences, and statistics.

Example

Often when you’re doing a lab in science, you run multiple trials. Rarely would you get the same results

each time. The results might vary for various reasons: user error, contamination if the material, faulty

equipment, etc.

Let’s take a simple example, determining the boiling point of water in Celsius. You run 20 trials and

receive the below results:

Trial Temp: Celsius Trial Temp: Celsius

1 99.8 11 101.0

2 100.2 12 99.9

3 100.3 13 100.2

4 99.5 14 100.3

5 100.6 15 99.8

6 100.2 16 99.7

7 100.8 17 100.4

8 99.8 18 100.6

9 99.5 19 99.9

10 100.2 20 100.2

Mean 100.1

Median 100.2

Mode 100.2

Standard deviation 0.4

Range 1.5

The resulting histogram will look like a Normal Curve. Keep in mind, in nature, rarely do Bell-Curves

look exactly like a bell.

Activity 5: The Normal Curve

During the month of October, you record how long it takes you to drive to school each morning. Let’s

call it from the time you get in your car until the time you park your car in a spot. It should take you 25

minutes, but some mornings it might take you shorter (all the traffic lights are green, you find a spot right

away, etc.) and some mornings it takes you more time (you get behind a school bus, a family of ducks is

crossing the road, you have trouble finding a spot, etc.).

The results of how long it takes you is below:

Date Time in minutes

Monday, October 03 23

Tuesday, October 04 21

Wednesday, October 05 25

Thursday, October 06 23

Friday, October 07 24















0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25

Boiling Point of Water in Celsius

19: Students will calculate the mean of the data

20: Students will calculate the median of the data

21: Students will calculate the mode of the data

22: Students will calculate the standard deviation of

the data

23: Students will calculate the range of the data

24: Students will create a Normal Curve using the

above data

Part VI: The Empirical Rule

With Normal Distribution, we can expect the following:

68% of the values will fall within one standard deviation of the mean

95% of the values will fall within two standard deviations of the mean

99.7% of the values will fall within three standard deviations of the mean

This is referred to as the Empirical Rule.

For the most part, we only deal with one Standard Deviation.

Things to keep in mind regarding the Empirical Rule:

It has to Normal Distribution.

With Normal Distribution: mean = median = mode (or very close).

If 68% falls within one Standard Deviation, what percentage is below one Standard Deviation? Because

the median is when half the numbers fall below that number and half the numbers are above it, if 68%

falls within one Standard Deviation, 34% will be below one Standard Deviation and 34% will be above

Standard Deviation.

What percentage does not fall within one Standard Deviation of the median and mean? That’s easy –

32%. So if 68% of the numbers fall within one Standard Deviation, then 32% is more than one Standard

Deviation from the average.

Activity 6: The Empirical Rule

25. The AP Psychology midterm has normally distributed scores, a mean of 75, and a standard deviation

of 10. Approximately what percentage of test takers scored 85 or above? Show your work.

34% 34%

16% 16%

statistics - loudoun county public schools · statistics part i: introduction to statistics...

Documents