statistics that deceive. simpson’s paradox it is a widely accepted rule that the larger the data...
TRANSCRIPT
Statistics That Deceive
Simpson’s Paradox
It is a widely accepted rule that the larger the data set, the better
Simpson’s Paradox demonstrates that a great deal of care has to be taken when combining smaller data sets into a larger one
Sometimes the conclusions from the larger data set are opposite the conclusion from the smaller data sets
Example: Simpson’s Paradox
First Half Second Half Total Season
Carson .400 .250 .264
Kennington .350 .200 .336
Baseball batting statistics for two players:
How could Carson beat Kennington for both halves individually,but then have a lower total season batting average?
Example Continued
First Half Second Half Total Season
Carson 4/10 (.400) 25/100 (.250) 29/110 (.264)
Kennington 35/100 (.350)
2/10 (.200) 37/110 (.336)
We weren’t told how many at bats each player had:
Carson’s dismal second half and Kennington’s great first halfhad higher weights than the other two values.
Another Example
Average college physics grades for students in an engineering program:
HS Physics No HS PhysicsNumber of Students 50 5Average Grade 80 70
Average college physics grades for students in a liberal arts program:
HS Physics No HS PhysicsNumber of Students 5 50Average Grade 95 85
It appears that in both classes, taking high school physics improvesyour college physics grade by 10.
Example continued
In order to get better results, let’s combine our datasets.
In particular, let’s combine all the students that took high school physics.
More precisely, combine the students in the engineering program thattook high school physics with those students in the liberal arts program that took high school physics.
Likewise, combine the students in the engineering program that did nottake high school physics with those students in the liberal arts program that did not take high school physics.
But be careful! You can’t just take the average of the two averages,because each dataset has a different number of values.
Example continuedAverage college physics grades for students who took high school physics:
# Students Grades WeightEngineering 50 80 50/55*80=72.7Lib Arts 5 95 5/55*95=8.6Total 55Average (72.7 + 8.6) 81.3
Average college physics grades for students who did not take high school physics:
# Students Grades WeightEngineering 5 70 5/55*70=6.4Lib Arts 50 85 50/55*85=77.3Total 55Average (6.4 + 77.3) 83.7
Did the students that did not have high school physics actually do better?
Example another wayAverage college physics grades for students who took high school physics:
# Students Grades Grade PtsEngineering 50 80 4000Lib Arts 5 95 475Total 55 4475Average (4000/4475*80 + 475/4475*95) 81.3
Average college physics grades for students who did not take high school physics:
# Students Grades Grade PtsEngineering 5 70 350Lib Arts 50 85 4250Total 55 4600Average (350/4600*70 + 4250/4600*85) 83.7
Did the students that did not have high school physics actually do better?
The Problem
Two problems with combining the data There was a larger percentage of one
type of student in each table The engineering students had a more
rigorous physics class than the liberal arts students, thus there is a hidden variable
So be very careful when you combine data into a larger set
More …
There are many real examples of this type of situation which leads to an apparent contradiction
The deceptive results is based on this [remember this]: If you view the same data in 2 different ways or break it into 2 different parts, you CAN get different results!