statistics - loudoun county public schools · statistics part i: introduction to statistics...
TRANSCRIPT
Statistics
Part I: Introduction to Statistics
Statistics can be used as a tool to help demystify data. Everyday examples of the use of statistics include
election polls, market research, exercise regimes, and drug abuse surveys. Statistics are a means of
organizing and analyzing data (numbers) systematically so that they have meaning. In fact, that might be
a good working definition for statistics – giving meaning to numbers.
Frequency Distribution.
A group of numbers has little meaning until it is organized. Frequency distributions and graphs allow us
to visually interpret sets of data and look at data in a meaningful way.
An organized list enables us to see clusters or patterns in data that would be less obvious in an
unorganized list. Scores are listed in ascending or descending order so that groupings of same scores can
be recognized. For example, scores on an exam might include:
91, 92, 87, 99, 83, 84, 82, 93, 89, 91, 85, 94, 91, 98, 90
Frequency Distribution of ungrouped scores: 99 1
98 1
97 0 (Frequency counts of zero are always included in the table, as shown)
96 0
95 0
94 1
93 1
92 1
91 3
90 1
89 1
88 0
87 1
86 0
85 1
84 1
83 1
82 1
81 0
80 0
N = 15 (N represents the total number of observations or scores.)
Grouped Frequency Distribution of same scores
95 – 99 2
90 – 94 7 The width of the intervals in groups frequency tables must be equal.
85 – 89 3 There should be no overlap between the intervals.
80 – 84 3
N = 15
The grouped distribution shows that the scores tend to cluster in the 90 – 94 range. The class can then see
how their own score compares in relation to the groups. This clustering is not apparent from the original
random listing of test scores.
Part II: Graphs
Graphs allow us to quickly summarize the data collected. In a glance we can attain some level of
meaning from the numbers.
Pie Chart
A pie chart is a circle within which all of the data points or numbers are contained in the form
of percentages. Suppose that you are a car dealer and are interested in knowing what
percentage of cares you should order in each color. A pie chart divided by the percentages of
each color would be a quick visual representation of the data from the previous year’s car sales.
Bar Graph
A common method used for representing nominal data is with a bar graph. The height of the bars indicate
the percentages or frequency of each category.
Line Graph
Line graphs indicate changes that occur during experiments. It shows the change in relationship between
the independent and dependent variables. Perhaps you are investigating the amount of change in SAT
scores between groups retaking the test: those who take an SAT prep course and those who do not. The
independent variable (the two groups in this case) is always on the vertical axis and the dependent
variable (the SAT scores) is always on the horizontal axis.
Frequency Polygraph
This line graph has the same vertical and horizontal labels as the histogram. However, each score’s
frequency of occurrence is marked with a point of the graph, and then all points are connected with a line.
More than one distribution of scores can be plotted on the same polygon for visual comparisons. For
instance, if you wanted to compare sales of red cars with sales of white cares between the years of 1985
and 1995, a frequency polygon would be an easy way to visually compare the data from sales of each
color.
The frequency polygon is especially useful in showing the asymmetry of data. When the data is not
evenly distributed, it is referred to as skewed distribution and the graph is not symmetrical.
A negatively skewed graph would be indicated by a high frequency (or clustering) of data on the high
end with a few data points on the low end, because skewness is always indicative of the “tail” or low
end of the graph as indicated by low frequency of occurrences.
A positively skewed graph would be indicated by a high frequency (or clustering) of data on the low
end with a few data points on the high end.
Activity 1: Graphing
Students will draw a graph of the Grouped Frequency Distribution of the sample grades above using a
Frequency Polygon. Make sure that the grouped scores are listed on the horizontal (or X) axis and the
frequencies on the vertical (or Y) axis.
Students will use a Grouped Frequency Distribution of the scores, but this time using different groupings.
You will again use four groupings, but group them differently that how they are grouped above. You will
then draw a graph of the second Grouped Frequency Distribution.
Are the two graphs skewed in any way? If so, how?
End product
1. Second method of Grouped Frequency Distribution of the scores
2. Two graphs (one of the original Grouped Frequency Distribution above, and one of the Grouped
Frequency Distribution that the student regrouped in another manner)
3. Identification of skew in graphs
Remember when making a graph the increments should go from low to high, away from the axis.
Part III: Correlation
Background & Independent and Dependent Variables
Correlation describes the relationship between two variables. The result of a psychological study may
reveal that one trait or behavior accompanies another. In other words: when one trait is present, another
trait is also present. When this occurs, we say that there is a correlation between the two traits.
Example #1: How is studying related to grades?
One variable may be measured by the number of hours a student studied for an exam, and the other
variable by the score on the exam.
The “hours studied” variable is considered the independent variable (x) because it causes the observed
variation in the “exam grade” variable, which is considered the dependent variable (y). The independent
variable causes the change in the dependent variable.
If the findings show that the more a student studied, the higher the grade the student received on the
exam, than this would be an example of a positive correlation. A positive correlation indicates a direct
relationship between variables: an increase in one variable is accompanied by an increase in another
variable or a decrease in one variable is accompanied by a decrease in another variable.
Example #2: How is television viewing related to grades?
In this example, television could be measured by asking an individual to calculate the average number of
hours a day he or she watches. Grades could be measured by the grade point average (GPA) obtained at
the end of the school year. It is quite likely that such a study would show that as the amount of television
viewing increases, the GPA decreases. If that were the case, this would be an example of a negative
correlation. A negative correlation indicates an inverse relationship between variables: an increase in
one variable is accompanied by a decrease in another variable, or a decrease in one variable in
accompanied by an increase in another variable.
Correlation Coefficients
Correlations are measured with numbers ranging from -1.0 to +1.0. These numbers are called correlation
coefficients. The correlation coefficient is a statistical measure of relationship: it reveals how closely two
things vary together and thus how well either one predicts the other. A correlation coefficient is
represented mathematically as the Greek letter rho – or a small r.
A zero could indicate no correlation between variables. In other words: there is no relationship between
the two variables.
As the number moves closer to +1.0, the coefficient shows an increasing positive correlation. As the
number moves closer to -1.0, the stronger the negative correlation between the variables.
+1.0 and -1.0 are perfect correlations while 0 is no correlation.
Perfect correlations rarely occur in the real world. For example, the correlation between studying and
grades is less than a perfect one. Not all students who study many hours always get the highest grades,
and sometimes students who claim to have studied very little get very high grades. Therefore, the
correlation between studying and grades will be less than 1.0.
Thus correlations of +.79 and -.79 are bother strong correlations with
the first being a strong positive correlation and the second a strong
negative correlation. Therefore, -.88 is a stronger correlation than .45
and .75 is a stronger correlation than -.22. It is the absolute value of the
number that indicates the strength of the correlation. The sign (+ or -)
indicates the direction of the relationship between the variables.
Scatter Plots
Scatter plots give a visual representation of correlations: the x variable (the independent variable) on the
horizontal and the y variable (the dependent variable) on the vertical axis.
Though the correlation coefficient might be very high or strong, this does not imply that one variable
causes the other. Correlation does not imply causation. This point often confuses people. Though there
may be causation, the correlation research process is not one of proof. It is rather a process that indicates
that when one variable exists, the other variable will also be present or affected. For example, there is a
positive correlation between wearing glasses and IQ. However, wearing glasses does not cause a person
to have a high IQ. Nor does having a high IQ cause a person to have to wear glasses. However, there is a
third factor – the amount a person reads – that can influence wearing glasses and IQ. The takeaway is,
that despite a strong correlation coefficient – correlation does not equal causation!
r = .992558
Graph #1
r = .958648
r = .8057
The above three graphs show examples when correlation does not equal causation. The three graphs
show a high correlation between two factors that common sense would tell you have no connection with
each other.
Correlation is useful in showing a connection between two factors that was not previously observed. If a
high correlation (either positive or negative) is observed, generally it will be investigated to see if one
factor is, in fact, causing the other. However, in the above cases, there really does not need to be much
investigation done to show that really there is no connection between the two factors.
Basically, if there was a cause and effect (but there isn't), you'd be saying:
Divorce causes people to eat more margarine (at least in Maine!): or, eating margarine causes people
to get divorced (at least in Maine), for Graph #1
Eating mozzarella cheese causes people to get their Ph.D. in civil engineering: or, getting a Ph.D. in
civil engineering causes you to eat more mozzarella cheese, for Graph #2
The more people die from spider bites, the longer they make the winning words in the Scripps Spelling
Bee: or, the longer the winning words are in the Scripps Spelling Bee, spiders are compelled to kill
people, for Graph #3.
Obviously none of the above is true, and shows the danger of believing there to be causation when there
is a strong correlation.
Graph #2
Graph #3
Activity 2: Correlation
Students will indicate if the below pairs of Independent and Dependent Variables would represent a
positive or a negative correlation.
Independent Variable Dependent Variable
4. Size of television Selling price of television
5. Number of absences from school GPA
6. Number of hours working Amount of free time
7. Level of advertising Amount of sales
8. Size of sports team payroll Number of games won
Students will plot the below data on grade point averages and the number of hours of television watched
per week.
GPA TV hours GPA TV hours GPA TV hours
3.9 10 2.5 8 2.7 30
3.2 15 3.5 10 3.0 20
2.1 44 4.0 6 3.7 12
1.5 39 3.8 7 2.4 33
1.8 35 3.6 9 2.2 25
2.5 22 2.9 18 4.0 30
9. Students will then describe the correlation that is indicated by the plotted data (both direction and
strength)
10. Is there a correlation between the number of hours of TV a student watches a week and his/her GPA?
11. Does this mean that we can say that TV watching has an effect on GPA? Why or why not?
12. What other factors might be involved?
End product
Answers to questions 4 - 8
Scatter plot
• Answers to questions 9 – 12
The Purpose of Correlation & Illusory Correlation
Statistics can help us see what the naked eye sometimes misses. Wondering
if tall people are more or less easygoing, you collect two sets of data: scores
of men’s heights and men’s temperaments. You measure the heights of 20
men and have someone else independently assess their temperaments (from
zero from extremely calm to 100 for highly reactive).
With all the relevant date right in front of you, you can tell whether there is:
1) a positive correlation between height and reactive temperament
2) very little or no correlation
3) a negative correlation
Comparing the columns in Table 1.1, most people detect very little
relationship between height and temperament. In fact, the correlation in this
imaginary example is moderately positive, +.63, as we can see if we display
the data as a scatter plot. In the below scatter plot, the upward, oval shaped
slope of the cluster of points as one moves to the right shows that our two
imaginary sets of scores (height and reactivity) tend to rise together.
If we fail to see the relationship when data are presented as systematically as in Table 1.1, how much less
likely are we notice them in every day life? To see what is in front of us, we sometimes need statistical
illumination.
Just as correlation can show that there are relationships between two factors that aren’t immediately
obvious, correlations also restrain our “seeing” relationships that actually do not exist. A perceived
nonexistent correlation is an illusory correlation. Such illusory thinking helps explain why for so many
years people believed (and still believe) that sugar made children hyperactive, that getting cold and wet
caused one to catch a cold, and that weather changes trigger arthritis pain. For the last one, researchers
recorded both the patients’ pain reports and the daily weather – temperature, humidity, and barometric
pressure. Despite patients’ beliefs, the weather was uncorrelated with their discomfort, either on the same
day or up to two days earlier or later.
Part IV: Measures of Central Tendency
These are numbers that attempt to describe the “typical” or “average” score in a distribution. There three
measures of central tendency: mode, median, and mean. When each of these is used depends on the
situation.
Mode
The mode is the most frequently occurring score in a set of scores. If two different scores occur most
frequently, then it is a bimodal distribution.
Median
The median is the score that falls in the middle when the scores are ranked in ascending or descending
order. Half the numbers will be below that number and half the numbers will be above it. The median
score is the best indicator of central tendency when there is a skew, because the median score is
unaffected by extreme scores, sometimes referred to as Outliers.
Mean
This is the arithmetic average of a set of scores. This is the score used by teachers to indicate your
semester grade. The mean is calculated by dividing the sum of all the scores by the total number of
scores. The mean is always pulled in the direction of extreme scores – the mean is pulled toward any
skew in the distribution.
Example
Below are listed 5 temperatures from each of the past two weeks. If we wanted to know what the
temperatures during each of these weeks was MOST like, the median would be the best indicator.
Week 1: 71 74 76 79 98 mean = 79.6 median = 76
Week 2: 70 74 76 77 78 mean = 75 median = 76
In the above example, the mean is affected by extreme scores. The average of week is between the 4th
and 5th scores indicating a higher daily temperature than actually happened. In this case, the median of 76
is far more indicative of the week’s daily temperature.
Which Measure to Use
Measures of central tendency can be misleading. Suppose your mother has planned a family reunion on
Sunday when you and other family members have other things to do. Your family protests, saying that
they don’t want to spend the day with a bunch of “old fogies.” Your mother attempts to convince each
family member separately that the reunion won’t be so bad. Mom says to your younger sister that the
average age is 10, tells you that the average age in 18, and tells your dad that the average age is 36. Now
each family member feels better about spending the day at the family gathering, as long as they don’t talk
to each other and figure out that something must be amiss in that the “average” age that each is expecting
to come to the gathering is different. In fact, your mother can claim that she told each of you’re the truth
but that she was deceptive, in that she used different measures of “average” for each of you. She used the
mode for your sister, the median for you, and the mean for your dad. The ages of those who are expected
to attend are shown in the following table:
Age in years Name/relation
3 Cousin Susie
7 Cousin Joey
10 Twin Shanda
10 Twin Wanda
15 Cousin Marty
17 Cousin Juan
18 Cousin Pat
44 Aunt Harriet
49 Uncle Stewart
58 Aunt Rose
59 Uncle Don
82 Grandma Faye
96 Great Aunt Lucy
What would have helped you to tell the whole story accurately? If the family members had known each
measures of central tendency’s value and the meaning of each, they would have figured out the mother’s
scheme. Knowing that the median is 18 and the mean is 36 would immediately indicate that there is a
skew in the ages – or there are some significantly older relatives coming to the gathering. One can
sometimes chose the measure of central tendency that comes closest to supporting a particular viewpoint
or bias. Your must ask yourself, “What measure of average is being used?”
Activity 3: Calculating Central Tendency
13. Students will calculate the mode, median, and mean for the heights of an AP Psychology class (data
below)
The students will then graph the data using:
14. Bar Graph
15. Frequency Polygon
End product Mean, median, & mode of data
Two graphs
Note You will not need to know how to calculate median for AP Psychology. For this activity, use outside
resources to find the formula/determine the median of the data.
For the class, you just need to know the differences amongst the three measures of central tendency and
when you would use each one.
Student Height Student Height Student Height
A 5’ 2” I 5’ 8” Q 5’ 9”
B 6’ 0” J 5’ 6” R 5’ 6”
C 5’ 6” K 5’ 7” S 5’ 9”
D 5’ 10” L 5’ 8” T 5’ 2”
E 5’ 3” M 5’ 8” U 5’ 4”
F 5’ 7” N 6’ 1” V 4’ 11”
G 5’ 6” O 5’ 5”
H 5’ 4” P 5’ 11”
Suggestion: It will be much easier if you convert the heights to inches.
Part V: Measures of Variation
Sometimes referred to as “Measures of Variability”
Measures of variation indicate how much spread there is in a distribution. If you collect data on the ages
of those students in the 11th grade, there will be very little variability. However, if you collected data on
the shoe sizes of those same students, there would be a great deal of variability. The amount of variability
within or between groups has an impact on the research process itself. So it is import to have measures of
variability.
Range
The range is the difference between the lowest and highest scores. This is a gross indicator or variance
within a group of scores of data. The range of scores can be increased significantly with a single outlying
score. For instance, test scores from two different classes of the same course might be:
Class one: 94, 92, 85, 81, 80, 73, 62 range = 32
Class two: 85, 83, 82, 81, 80, 79, 77 rage = 8
For each of the classes, the mean is 81 and the median is 81. Overall however, the classes did not perform
the same. The scores in class one are much more spread out than in class two. This is indicated by the
difference in the ranges of the two data sets. The scores for the second class tend to cluster closer to the
mean.
Variance
This is a measure of how different the scores are from each other. The difference between the scores is
measured by the distance of each score from the mean of all the scores. In calculating the variance, each
score has an impact on the overall variance. The variance is how much spread there is overall among the
scores. Interval and ratio data are necessary for calculating variance. Note that variance is reported in
squared units rather than the units in which the data was originally collected.
To calculate variance:
Standard Deviation
This measure of variability is also based on how different the scores are from each other. Calculating the
variance is the first step in determining the standard deviation. However, remember that the variance is
reported in squared units rather then the original units. The standard deviation is the square root of the
variance of the scores. This puts the variability measure in the same units as the original data. So if you
wanted to know the variability of scores on a test, the standard deviation would tell you how much the
scores varied in terms of test points rather than squared test points. It is more meaningful.
Activity 4: Calculating and Understanding Standard Deviation/Variance
Students will calculate the standard deviation of the heights of an AP Psychology class (use the same data
set as the previous activity)
Now assume that the students are all standing on chairs, which adds one foot to their heights. Calculate
the standard deviation of the heights of the class with the students standing on their chairs.
Has the standard deviation changed from the first set of data to the second? Explain why.
End product 16. Standard deviation of the first set of data
17. Standard deviation of the second set of data
18. Explanation of connection between the standard deviation of the two sets of data.
There are two ways to think about Standard Deviation:
1. Consistency. Generally, the more consistent the numbers, the lower the Standard Deviation.
2. The average distance from the average. The closer the scores are to the average, the lower the
Standard Deviation. The further away the scores are from the average, the higher the Standard
Deviation.
Take a look at the comparison of the bowling scores of ten sets of two people – Anna and Sally. The
highest score is a 300.
Game # Anna Sally
1 250 290
2 275 190
3 280 210
4 270 260
5 260 285
6 260 275
7 270 230
8 275 200
9 273 295
10 263 200
Mean 267.6 243.5
Median 270 245
Standard deviation 8.7 39.7
Max score 280 295
Minimum score 250 190
Range 30 105
Looking at Anna’s scores. Her median score is 270 with a Standard Deviation of 8.7, which means that
the majority of her scores will fall between 261.3 and 278.7 (the median minus the Standard Deviation
and the Median number plus the Standard Deviation). And if you look at the numbers, that is the case.
The more the numbers cluster around the mean/median (the more consistent the scores), the lower the
Standard Deviation. Also, generally if you have a low Standard Deviation, the range also tends to be
small.
Looking at Sally’s scores. Her median score is 245 with a Standard Deviation of 39.7, which means that
the majority of her scores will fall between 205.3 and 284.7 (the median minus the Standard Deviation
and the Median number plus the Standard Deviation). The more the numbers are spread out from the
mean/median (the less consistent the scores), the higher the Standard Deviation. And the range is low
higher than Anna’s.
Part V: The Normal Curve
Going back to graphing; we looked at when a graph is asymmetrically distributed, such as having a left or
a right skew. But what if the data is symmetrical?
When the data is symmetrical, it is said to have
Normal Distribution; the data tends to be
grouped around a central value with no bias
left or right).
If graphed, scores distributed in a “normal
manner”: the data is symmetrical about the
center and the results typically form bell-
shaped curve. The resulting curve is called a
Bell-curve or Normal curve.
Results fall mostly in the middle with fewer and fewer scores in the extreme; either above or below the
median.
With Normal Distribution: Mean = median = mode (or very, very close)
50% of values less than the mean
50% greater than the mean
What type of data results in a normal curve?
Any data when there is any sort of natural variation, usually comes out on a Normal Curve, such as
people’s heights, weights, IQ’s, errors in measurements, etc. It is used a lot in the natural sciences, social
sciences, and statistics.
Example
Often when you’re doing a lab in science, you run multiple trials. Rarely would you get the same results
each time. The results might vary for various reasons: user error, contamination if the material, faulty
equipment, etc.
Let’s take a simple example, determining the boiling point of water in Celsius. You run 20 trials and
receive the below results:
Trial Temp: Celsius Trial Temp: Celsius
1 99.8 11 101.0
2 100.2 12 99.9
3 100.3 13 100.2
4 99.5 14 100.3
5 100.6 15 99.8
6 100.2 16 99.7
7 100.8 17 100.4
8 99.8 18 100.6
9 99.5 19 99.9
10 100.2 20 100.2
Mean 100.1
Median 100.2
Mode 100.2
Standard deviation 0.4
Range 1.5
The resulting histogram will look like a Normal Curve. Keep in mind, in nature, rarely do Bell-Curves
look exactly like a bell.
Activity 5: The Normal Curve
During the month of October, you record how long it takes you to drive to school each morning. Let’s
call it from the time you get in your car until the time you park your car in a spot. It should take you 25
minutes, but some mornings it might take you shorter (all the traffic lights are green, you find a spot right
away, etc.) and some mornings it takes you more time (you get behind a school bus, a family of ducks is
crossing the road, you have trouble finding a spot, etc.).
The results of how long it takes you is below:
Date Time in minutes
Monday, October 03 23
Tuesday, October 04 21
Wednesday, October 05 25
Thursday, October 06 23
Friday, October 07 24
Tuesday, October 11 26
Wednesday, October 12 23
Thursday, October 13 22
Friday, October 14 21
Monday, October 17 19
Tuesday, October 18 23
Wednesday, October 19 24
Thursday, October 20 26
Friday, October 21 23
Monday, October 24 23
Tuesday, October 25 23
Wednesday, October 26 24
Thursday, October 27 25
Friday, October 28 19
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20 25
Boiling Point of Water in Celsius
19: Students will calculate the mean of the data
20: Students will calculate the median of the data
21: Students will calculate the mode of the data
22: Students will calculate the standard deviation of
the data
23: Students will calculate the range of the data
24: Students will create a Normal Curve using the
above data
Part VI: The Empirical Rule
With Normal Distribution, we can expect the following:
68% of the values will fall within one standard deviation of the mean
95% of the values will fall within two standard deviations of the mean
99.7% of the values will fall within three standard deviations of the mean
This is referred to as the Empirical Rule.
For the most part, we only deal with one Standard Deviation.
Things to keep in mind regarding the Empirical Rule:
It has to Normal Distribution.
With Normal Distribution: mean = median = mode (or very close).
If 68% falls within one Standard Deviation, what percentage is below one Standard Deviation? Because
the median is when half the numbers fall below that number and half the numbers are above it, if 68%
falls within one Standard Deviation, 34% will be below one Standard Deviation and 34% will be above
Standard Deviation.
What percentage does not fall within one Standard Deviation of the median and mean? That’s easy –
32%. So if 68% of the numbers fall within one Standard Deviation, then 32% is more than one Standard
Deviation from the average.
Activity 6: The Empirical Rule
25. The AP Psychology midterm has normally distributed scores, a mean of 75, and a standard deviation
of 10. Approximately what percentage of test takers scored 85 or above? Show your work.
34% 34%
16% 16%