chi-square test chi-square test or 2 test. chi-square test countsused to test the counts of...
TRANSCRIPT
Chi-square testChi-square testor
2 test
Chi-square testChi-square test•Used to test the countscounts of
categorical data•ThreeThree types
–Goodness of fit (univariate)– Independence (bivariate)–Homogeneity (univariate with two samples)
22 distribution – distribution –
df=3
df=5
df=10
22 distribution distribution
•Different df have different curves
•Skewed right•As df increases, curve shifts
toward right & becomes more like a normal curvenormal curve
2 2 assumptionsassumptions• SRS SRS – reasonably random sample• Have countscounts of categorical data &
we expect each category to happen at least once
• Sample sizeSample size – to insure that the sample size is large enough we should expect at least five in each category.
***Be sure to list expected counts!!
Combine these together:
All expected counts are at
least 5.
2 2 formulaformula
exp
expobs 22
2 2 Goodness of fit testGoodness of fit test
•Uses univariate data•Want to see how well the observed counts “fit” what we expect the counts to be
•Use 22cdf functioncdf function on the calculator to find p-valuesp-values
Based on df –Based on df –
df = number of df = number of categoriescategories - 1 - 1
Hypotheses – written in Hypotheses – written in wordswords
H0: proportions are equal
Ha: at least one proportion is not the same
Be sure to write in context!
Does your zodiac sign determine how successful you will be? Fortune magazine collected the zodiac signs of 256 heads of the largest 400 companies. Is there sufficient evidence to claim that successful people are more likely to be born under some signs than others?
Aries 23 Libra 18 Leo20
Taurus 20 Scorpio 21 Virgo 19
Gemini 18 Sagittarius19 Aquarius24
Cancer 23 Capricorn 22 Pisces29
How many would you expect in each sign if there were no difference between them?
How many degrees of freedom?
I would expect CEOs to be equally born under all signs.So 256/12 = 21.333333Since there are 12 signs –
df = 12 – 1 = 11
Assumptions:
•Have a random sample of CEO’s
•All expected counts are greater than 5. (I expect 21.33 CEO’s to be born in each sign.)
H0: The proportions of CEO’s born under each sign are the same.
Ha: At least one of the proportion of CEO’s born under each sign is different.
P-value = 2cdf(5.094, 10^99, 11) = .9265 = .05
Since p-value > , I fail to reject H0. There is not
sufficient evidence to suggest that the CEOs are born under some signs more than others.
094.5
3.21
3.2129...
3.21
3.2120
3.21
3.2123222
2
A company says its premium mixture of nuts contains 10% Brazil nuts, 20% cashews, 20% almonds, 10% hazelnuts and 40% peanuts. You buy a large can and separate the nuts. Upon weighing them, you find there are 112 g Brazil nuts, 183 g of cashews, 207 g of almonds, 71 g or hazelnuts, and 446 g of peanuts. You wonder whether your mix is significantly different from what the company advertises?
Why is the chi-square goodness-of-fit test NOT appropriate here?
What might you do instead of weighing the nuts in order to use chi-square?
Because we do NOT have countscounts
of the type of nuts.We could countcount the
number of each type of nut and then perform a
2 test.
Offspring of certain fruit flies may have yellow or ebony bodies and normal wings or short wings. Genetic theory predicts that these traits will appear in the ratio 9:3:3:1 (yellow & normal, yellow & short, ebony & normal, ebony & short) A researcher checks 100 such flies and finds the distribution of traits to be 59, 20, 11, and 10, respectively. What are the expected counts? df?
Are the results consistent with the theoretical distribution predicted by the genetic model? (see next page)
Expected counts:Y & N = 56.25Y & S = 18.75E & N = 18.75E & S = 6.25We expect 9/16 of the 100
flies to have yellow and normal wings. (Y & N)
Since there are 4 categories,
df = 4 – 1 = 3
Assumptions:
•Have a random sample of fruit flies
•All expected counts are greater than 5. Expected counts:Y & N = 56.25, Y & S = 18.75, E & N = 18.75, E & S = 6.25
H0: The proportions of fruit flies are the same as the theoretical model.
Ha: At least one of the proportions of fruit flies is not the same as the theoretical model.
df = 4 – 1 = 3 P-value = 2cdf(5.671, 10^99, 3) = .129 = .05
Since p-value > , I fail to reject H0. There is not sufficient evidence to suggest that the distribution of fruit flies is not the same as the theoretical model.
671.5
25.625.610
...75.18
75.182025.56
25.5659 2222
2 test for independence•Used with categorical, bivariate
data from ONE sample•Used to see if the two categorical
variables are associated (dependent) or not associated (independent)
•Key phrasing: independent/ dependent, related to, affected by, etc.
Assumptions & Assumptions & formula remain the formula remain the
same!same!
Hypotheses – written in Hypotheses – written in wordswords
H0: two variables are independentHa: two variables are dependent
Be sure to write in context!
The following data are based on the results of an (old) survey to examine media preferences. Does the evidence support the claim that the type of media preferred and age group are not independent?
TV fans Radio Listeners
Newspaper Readers
Totals
18 to 34 29 41 31 101 35 to 49 22 27 29 78
50 or older 50 32 40 122Totals 101 100 100 301
If preferred media type is independent, how would we expect this table to be filled in?
TV fans Radio Listeners
Newspaper Readers
Totals
18 to 34 33.89 33.55 33.55 101
35 to 49 26.17 25.91 25.91 78
50 or older 40.94 40.53 40.53 122
Totals 101 100 100 301
Expected Counts Expected Counts
•Assuming H0 is true,
totaltable
alcolumn tot totalrow counts expected
Degrees of freedomDegrees of freedom
)1c)(1(r df
Or cover up one row & one column & count the number of cells remaining!
Assumptions:
•Have a random sample of people
•All expected counts are greater than 5.
(expected counts listed on the previous slide)
H0: age & media preference are independent
Ha: age and & media preference are dependent
P-value = .1144 df = 4 = .05
Since p-value > , I fail to reject H0. There is not sufficient evidence to suggest that age and media preference are dependent.
4399.72
22 test for homogeneity test for homogeneity•Used with a single categoricalsingle categorical
variable from two (or more) two (or more) independent samplesindependent samples
•Used to see if the two populations are the same (homogeneous)
•Key phrasing: is different from, a difference between, is the same, etc.
Assumptions & formula remain the same!
Expected counts & df are found the same way as test for independence.
OnlyOnly change is the hypotheses!
Hypotheses – written in Hypotheses – written in wordswords
H0: the proportions for the two (or more) distributions are the sameHa: At least one of the proportions for the distributions is different
Be sure to write in context!
The following data is on drinking behavior for independently chosen random samples of male and female students. Does there appear to be a gender difference with respect to drinking behavior? (Note: low = 1-7 drinks/wk, moderate = 8-24 drinks/wk, high = 25 or more drinks/wk) Men Women Total
None 140 186 326
Low 478 661 1139
Moderate 300 173 473
High 63 16 79
Total 981 1036 2017
Assumptions:
•Have 2 random sample of students
•All expected counts are greater than 5.
H0: the proportions of drinking behaviors is the same
for female & male students Ha: at least one of the proportions of
drinking behavior is different for female & male students
P-value = 8.67 x 10^-21 df = 3 = .05
Since p-value < , I reject H0. There is sufficient evidence to suggest that drinking behavior is not the same for female & male students.
53.96...
4.167
4.167186
6.158
6.158140 222
Expected Counts:
M F
0 158.6 167.4
L 554.0 585.0
M 230.1 243.0
H 38.4 40.6