chi squared tests. introduction two statistical techniques are presented. both are used to analyze...
TRANSCRIPT
Introduction
• Two statistical techniques are presented. Both are used to analyze nominal data.– A goodness-of-fit test for a multinomial experiment.– A contingency table test of independence.
• The test statistics in both cases follow the 2
distribution.
• The hypothesis tested involves the “success” probabilities p1, p2, …, pk.of a multinomial distribution.
• The multinomial experiment is an extension of the binomial experiment.– There are n independent trials.– The outcome of each trial can be classified into one of k
categories, called cells.– The probability pi for an outcome to fall into cell i remains
constant for each trial. By assumption, p1 + p2 + … +pk = 1.
– Trials in the experiment are independent.
Chi-Squared Goodness-of-Fit Test
• Our objective is to find out whether there is sufficient evidence to reject a pre-specified set of values for p i .
• The hypotheses:
€
H0 : p1 = a1, p2 = a2, ..., pk = akH1 : At least one pi ≠ ai
• The test builds on comparing actual frequency and the expected frequency of occurrences in all cells.
• Example 16.1– Two competing companies A and B have been
dominant players in the market. Both companies conducted recent advertising campaigns on their products.
– Market shares before the campaigns were:• Company A = 45%• Company B = 40%• Other competitors = 15%.
An Example
• Example 16.1 – continued– To study the effect of the campaigns on the market shares, a
survey was conducted.
– 200 customers were asked to indicate their preference regarding the products advertised.
– Survey results:• 102 customers preferred the company A’s product,• 82 customers preferred the company B’s product,• 16 customers preferred the competitors product.
• Example 16.1 – continued
Can we conclude at 5% significance level that the market shares were affected by the advertising campaigns?
• Solution– The population investigated is the brand preferences.– The data are nominal (A, B, or other)– This is a multinomial experiment (three categories).– The question of interest: Are p1, p2, and p3 different
after the campaign from their values prior to the campaigns?
1
2
3
1
2
3
• The hypotheses are:H0: p1 = .45, p2 = .40, p3 = .15H1: At least one pi changed.
The expected frequency for eachcategory (cell) if the null hypothesis is true is shown below:
90 = 200(.45)
30 = 200(.15)
102 82
16
What actual frequencies did the sample return?
80 = 200(.40)
• The statistic is:
Intuitively, this measures the extent of differences between the observed and the expected frequencies.
• The rejection region is:€
2 =( f i − ei)
2
eii=1
k
∑where ei = npi
€
2 > χ α ,k−12
• Example 16.1 – continued
18.830
)3016(80
)8082(90
)90102( 22k
1i
22 =
−+
−+
−= ∑
=
€
α ,k−12 = χ .05,3−1
2 = 5.99147
The p − value = P(χ 2 > 8.18) = .01679
[this come from Excel : = CHIDIST(8.18,2)]
• Example 16.1 – continued
0
0.005
0.01
0.015
0.02
0.025
0 2 4 6 8 10 12
Conclusion: Since 8.18 > 5.99, there is sufficient evidence at 5% significance level to reject the null hypothesis. At least one of the probabilities pi is different. Thus, at least two market shares have changed.
P valueAlpha
5.99 8.18Rejection region
2 with 2 degrees of freedom
Required Conditions – The Rule of Five
• The test statistic used to perform the test is only approximately Chi-squared distributed.
• For the approximation to apply, the expected cell frequency has to be at least 5 for all cells (npi 5).
• If the expected frequency in a cell is less than 5, combine it with other cells.
Chi-squared Test of a Contingency Table
• This test is used to test whether…– two nominal variables are related?– there are differences between two or more
populations of a nominal variable?• To accomplish the test objectives, we need to
classify the data according to two different criteria.
• The idea is also based on goodness of fit.
• Example 16.2– In an effort to better predict the demand for courses
offered by a certain MBA program, it was hypothesized that students’ academic background affect their choice of MBA major, thus, their courses selection.
– A random sample of last year’s MBA students was selected. The following contingency table summarizes relevant data.
Degree Accounting Finance MarketingBA 31 13 16 60
BENG 8 16 7 31BBA 12 10 17 60
Other 10 5 7 3961 44 47 152
There are two ways to view this problem
If each undergraduate degree is considered a population, do these populations differ?
If each classification is considered a nominal variable, are these twovariables dependent?
The observed values
• Solution– The hypotheses are:
H0: The two variables are independent
H1: The two variables are dependent
k is the number of cells in the contingency table.
– The test statistic
∑=
−=
k
1i i
2ii2
e)ef(
– The rejection region
2)1c)(1r(,
2−−α>
Since ei = npi but pi is unknown, we need to estimate the unknown probability from the data, assuming H0 is true.
Under the null hypothesis the two variables are independent:
P(Accounting and BA) = P(Accounting)*P(BA)
Undergraduate MBA MajorDegree Accounting Finance Marketing Probability
BA 60 60/152BENG 31 31/152BBA 39 39/152Other 22 22/152
61 44 47 152Probability 61/152 44/152 47/152
The number of students expected to fall in the cell “Accounting - BA” iseAcct-BA = n(pAcct-BA) = 152(61/152)(60/152) = [61*60]/152 = 24.08
= [61/152][60/152].
60
61 152
The number of students expected to fall in the cell “Finance - BBA” iseFinance-BBA = npFinance-BBA = 152(44/152)(39/152) = [44*39]/152 = 11.29
44
39
152
Estimating the expected frequencies
eij = (Column j total)(Row i total)Sample size
• The expected frequency of cell of row i and column j in the contingency table is calculated by:
∑=
−=
k
1i i
2ii2
e)ef(
Undergraduate MBA MajorDegree Accounting Finance Marketing
BA 31 (24.08) 13 (17.37) 16 (18.55) 60BENG 8 (12.44) 16 (8.97) 7 (9.58) 31BBA 12 (15.65) 10 (11.29) 17 (12.06) 39Other 10 (8.83) 5 (6.39) 7 (6.80) 22
61 44 47 152
The expected frequency
31 24.08
31 24.08
31 24.08
31 24.08
31 24.08
(31 - 24.08)2
24.08 +….+
5 6.39
5 6.39
5 6.395 6.39
(5 - 6.39)2
6.39 +….+
7 6.80
7 6.80
7 6.80
(7 - 6.80)2
6.80
7 6.80
2= = 14.70
€
2 =( f i − ei)
2
eii=1
k
∑
Calculation of the 2 statistic
• Solution – continued
• Conclusion: Since 2 = 14.70 > 12.5916, there is sufficient evidence to infer at 5% significance level that students’ undergraduate degree and MBA students courses selection are dependent.
• Solution – continued– The critical value in our example is:
€
α ,(r−1)(c−1)2 = χ .05,(4 −1)(3−1)
2 = 12.5916
Degree MBA Major3 11 11 11 12 21 3. .
. .
Code:Undergraduate degree 1 = BA2 = BENG3 = BBA4 = OTHERSMBA Major 1 = ACCOUNTING2 = FINANCE3 = MARKETING
Contingency Table1 2 3 Total
1 31 13 16 602 8 16 7 313 12 10 17 394 10 5 7 22Total 61 44 47 152Test Statistic CHI-Squared = 14.7019P-Value = 0.0227
Select the Chi squared / raw data Option from Data Analysis Plus under tools. See Xm16-02
Define a code to specify each nominal value. Input the data in columns one column for each category.
Using the computer