chapter 10
DESCRIPTION
Chapter 10. Categorical Data Analysis. Inference for a Single Proportion ( p ). Goal: Estimate proportion of individuals in a population with a certain characteristic ( p ). This is equivalent to estimating a binomial probability - PowerPoint PPT PresentationTRANSCRIPT
Chapter 10
Categorical Data Analysis
Inference for a Single Proportion ()• Goal: Estimate proportion of individuals in a population with a
certain characteristic (). This is equivalent to estimating a binomial probability
• Sample: Take a SRS of n individuals from the population and observe y that have the characteristic. The sample proportion is y/n and has the following sampling properties:
)5)1(, : thumbof (Rule samples largefor normalely approximat :Shape
1 :Error Standard Estimated
)1( :ondistributi sampling of Dev. Std. andMean
:proportion Sample
^^
^
^
^^
nnn
SE
n
n
y
Large-Sample Confidence Interval for
• Take SRS of size n from population where is true (unknown) proportion of successes. – Observe y successes
– Set confidence level (1-) and obtain z/2 from z-table
mC
zmn
n
y
^
2/
^^
^
:for interval confidence %
SE :error ofMargin
1SE :Error Standard Estimated
:EstimatePoint
^
^
Example - Ginkgo and Azet for AMS• Study Goal: Measure effect of Ginkgo and
Acetazolamide on occurrence of Acute Mountain Sickness (AMS) in Himalayan Trackers
• Parameter: = True proportion of all trekkers receiving Ginkgo&Acetaz who would suffer from AMS.
• Sample Data: n=126 trekkers received G&A, y=18 suffered from AMS
)204,.082(.061.143.:for CI %95
061.)031(.96.1:%)95%100)1((error ofMargin
031.126
)86)(.14(.SE143.
126
18^
^
m
Wilson’s “Plus 4” Method• For moderate to small sample sizes, large-sample
methods may not work well wrt coverage probabilities• Simple approach that works well in practice (n10):
– Pretend you have 4 extra individuals, 2 successes, 2 failures
– Compute the estimated sample proportion in light of new “data” as well as standard error:
m
zmn
n
y
~
2/
~~
~
:for interval confidence %100)1(
SE :error ofMargin 4
1SE :Error Standard Estimated
4
2 :EstimatePoint
~
~
Example: Lister’s Tests with Antiseptic
• Experiments with antiseptic in patients with upper limb amputations (John Lister, circa 1870)
• n=12 patients received antiseptic y=1 died
)40,.0()3988,.0038.(1913.1875.:for CI %95
1913.)0976(.96.1:%)95)100%-1(error( ofMargin
0976.16
)8125(.1875.SE1875.
16
3
412
21~
~
Significance Test for a Proportion
• Goal test whether a proportion () equals some null value 0 H0:
)(2value-::
)(value-::
)(value-::
)1( :StatisticTest
2/0
0
0
0
0
^
obsobsa
obsobsa
obsobsa
o
obs
zZPPzzRRH
zZPPzzRRH
zZPPzzRRHn
z
Large-sample test works well when n0 and n(1-0) 5
Ginkgo and Acetaz for AMS
• Can we claim that the incidence rate of AMS is less than 25% for trekkers receiving G&A?
• H0: =0.25 Ha: < 0.25
0030.)75.2( value-
645.1:)05.(
75.2039.
107.
118)75(.25.
25.143. :StatisticTest
25.0143.0126
1818126
05.
0
^
ZPP
zzRR
z
yn
obs
obs
Strong evidence that incidence rate is below 25% (< 0.25)
Comparing Two Population Proportions
• Goal: Compare two populations/treatments wrt a nominal (binary) outcome
• Sampling Design: Independent vs Dependent Samples
• Methods based on large vs small samples
• Contingency tables used to summarize data
• Measures of Association: Absolute Risk, Relative Risk, Odds Ratio
Contingency Tables
• Tables representing all combinations of levels of explanatory and response variables
• Numbers in table represent Counts of the number of cases in each cell
• Row and column totals are called Marginal counts
2x2 Tables - Notation
n1+n2(n1+n2)-(y1+y2)
y1+y2Outcome
Total
n2n2-y2y2Group 2
n1n1-y1y1Group 1
Group
Total
Outcome
Absent
Outcome
Present
Example - Firm Type/Product Quality
17213438Outcome
Total
84795Vertically
Integrated
885533Not
Integrated
Group
Total
Low
Quality
High
Quality
• Groups: Not Integrated (Weave only) vs Vertically integrated (Spin and Weave) Cotton Textile Producers
• Outcomes: High Quality (High Count) vs Low Quality (Count)
Source: Temin (1988)
Notation• Proportion in Population 1 with the characteristic
of interest: 1
• Sample size from Population 1: n1
• Number of individuals in Sample 1 with the characteristic of interest: y1
• Sample proportion from Sample 1 with the characteristic of interest:
• Similar notation for Population/Sample 2
1
11
^
n
y
Example - Cotton Textile Producers
1 - True proportion of all Non-integretated firms that would produce High quality
2 - True proportion of all vertically integretated firms that would produce High quality
060.084
5584
375.088
333388
2
22
^
22
1
11
^
11
n
yyn
n
yyn
Notation (Continued)
• Parameter of Primary Interest: 1-2, the difference in the 2 population proportions with the characteristic (2 other measures given below)
• Estimator:
• Standard Error (and its estimate):
• Pooled Estimated Standard Error when :
2
^
1
^
D
2
2
^
2
^
1
1
^
1
^
2
22
1
11
11)1()1(
nnSE
nn DD
21
21^
21
^^ 111
nn
yy
nnSE
PD
Cotton Textile Producers (Continued)
• Parameter of Primary Interest: , the difference in the 2 population proportions that produce High quality output
• Estimator: • Standard Error (and its estimate):
• Pooled Estimated Standard Error when :
315.0060.0375.02
^
1
^
D
0577.003335.84
)94.0(060.0
88
)625.0(375.011
2
1
^
2
^
1
1
^
1
^
nn
SED
221.08488
5330633.
84
1
88
1779.0221.0
^
PDSE
Significance Tests for
• Deciding whether can be done by interpreting “plausible values” of from the confidence interval:
– If entire interval is positive, conclude ( > 0)
– If entire interval is negative, conclude ( < 0)
– If interval contains 0, do not conclude that
• Alternatively, we can conduct a significance test:– H0: Ha: (2-sided) Ha: (1-sided)
– Test Statistic:
– RR: |zobs| z/2 (2-sided) zobs z (1-sided)
– P-value: 2P(Z|zobs|) (2-sided) P(Z zobs) (1-sided)
21
^^
2
^
1
^
111
nn
zobs
Example - Cotton Textile Production
0)98.4(2 value-
96.1:
98.40633.0
315.0
841
881
)779.0(221.0
060.0375.0
111
:
)0(:
)0(:
025.
21
^^
2
^
1
^
2121
21210
ZPP
zzRR
nn
zTS
H
H
obs
obs
A
Again, there is strong evidence that non-integrated performs are more likely to produce high quality output than integrated firms
Associations Between Categorical Variables
• Case where both explanatory (independent) variable and response (dependent) variable are qualitative
• Association: The distributions of responses differ among the levels of the explanatory variable (e.g. Party affiliation by gender)
Contingency Tables• Cross-tabulations of frequency counts where the
rows (typically) represent the levels of the explanatory variable and the columns represent the levels of the response variable.
• Numbers within the table represent the numbers of individuals falling in the corresponding combination of levels of the two variables
• Row and column totals are called the marginal distributions for the two variables
Example - Cyclones Near Antarctica• Period of Study: September,1973-May,1975
• Explanatory Variable: Region (40-49,50-59,60-79) (Degrees South Latitude)
• Response: Season (Aut(4),Wtr(5),Spr(4),Sum(8)) (Number of months in parentheses)
• Units: Cyclones in the study area
• Treating the observed cyclones as a “random sample” of all cyclones that could have occurred
Source: Howarth(1983), “An Analysis of the Variability of Cyclones around Antarctica and Their Relation to Sea-Ice Extent”, Annals of the Association of American Geographers, Vol.73,pp519-537
Example - Cyclones Near Antarctica
Region\Season Autumn Winter Spring Summer Total40 -49S 370 452 273 422 151750 -59S 526 624 513 1059 272260 -79S 980 1200 995 1751 4926Total 1876 2276 1781 3232 9165
For each region (row) we can compute the percentage of storms occuring during each season, the conditional distribution. Of the 1517 cyclones in the 40-49 band, 370 occurred in Autumn, a proportion of 370/1517=.244, or 24.4% as a percentage.
Region\Season Autumn Winter Spring Summer Total% (n)
40 -49S 24.4 29.8 18.0 27.8 100.0 (1517)50 -59S 19.3 22.9 18.9 38.9 100.0 (2722)60 -79S 19.9 24.4 20.2 35.5 100.0 (4926)
Example - Cyclones Near Antarctica
40-49S
50-59S
60-79S
region
Bars show Means
Autumn Winter Spring Summer
season
10.00
20.00
30.00
40.00re
gp
ct
Graphical Conditional Distributions for Regions
Guidelines for Contingency Tables• Compute percentages for the response (column)
variable within the categories of the explanatory (row) variable. Note that in journal articles, rows and columns may be interchanged.
• Divide the cell totals by the row (explanatory category) total and multiply by 100 to obtain a percent, the row percents will add to 100
• Give title and clearly define variables and categories.
• Include row (explanatory) total sample sizes
Independence & Dependence
• Statistically Independent: Population conditional distributions of one variable are the same across all levels of the other variable
• Statistically Dependent: Conditional Distributions are not all equal
• When testing, researchers typically wish to demonstrate dependence (alternative hypothesis), and wish to refute independence (null hypothesis)
Pearson’s Chi-Square Test
• Can be used for nominal or ordinal explanatory and response variables
• Variables can have any number of distinct levels• Tests whether the distribution of the response
variable is the same for each level of the explanatory variable (H0: No association between the variables
• r = # of levels of explanatory variable• c = # of levels of response variable
Pearson’s Chi-Square Test
• Intuition behind test statistic– Obtain marginal distribution of outcomes for
the response variable– Apply this common distribution to all levels of
the explanatory variable, by multiplying each proportion by the corresponding sample size
– Measure the difference between actual cell counts and the expected cell counts in the previous step
Pearson’s Chi-Square Test
• Notation to obtain test statistic– Rows represent explanatory variable (r levels)
– Cols represent response variable (c levels)
n..n.c…n.2n.1Total
nr.nrc…nr2 nr1 r
………………
n2. n2c …n22 n212
n1.n1c …n12 n111
Totalc…21
Pearson’s Chi-Square Test
• Observed frequency (nij): The number of individuals falling in a particular cell
• Expected frequency (Eij): The number we would expect in that cell, given the sample sizes observed in study and the assumption of independence. – Computed by multiplying the row total and the
column total, and dividing by the overall sample size.
– Applies the overall marginal probability of the response category to the sample size of explanatory category
Pearson’s Chi-Square Test
• Large-sample test (all Eij > 5)
• H0: Variables are statistically independent (No association between variables)
• Ha: Variables are statistically dependent (Association exists between variables)
• Test Statistic:
• P-value: Area above in the chi-squared distribution with (r-1)(c-1) degrees of freedom. (Critical values in Table 8)
ij
ijijobs E
En 22 )(
2obs
Example - Cyclones Near Antarctica
Region\Season Autumn Winter Spring Summer Total40 -49S 370 452 273 422 151750 -59S 526 624 513 1059 272260 -79S 980 1200 995 1751 4926Total 1876 2276 1781 3232 9165
Note that overall: (1876/9165)100%=20.5% of all cyclones occurred in Autumn. If we apply that percentage to the 1517 that occurred in the 40-49S band, we would expect (0.205)(1517)=310.5 to have occurred in the first cell of the table. The full table of Eij:
Region\Season Autumn Winter Spring Summer Total40 -49S 310.5 376.7 294.8 535.0 151750 -59S 557.2 676.0 529.0 959.9 272260 -79S 1008.3 1223.3 957.3 1737.1 4926Total 1876 2276 1781 3232 9165
Observed Cell Counts (nij):
Example - Cyclones Near Antarctica
Region Season n_ij E_ij (n-E)^2 ((n-E)^2)/E40-49S Autumn 370 310.5 3540.25 11.401771340-49S Winter 452 376.7 5670.09 15.052004240-49S Spring 273 294.8 475.24 1.6120759840-49S Summer 422 535.0 12769 23.867289750-59S Autumn 526 557.2 973.44 1.7470208250-59S Winter 624 676.0 2704 450-59S Spring 513 529.0 256 0.4839319550-59S Summer 1059 959.9 9820.81 10.231076260-79S Autumn 980 1008.3 800.89 0.7942973360-79S Winter 1200 1223.3 542.89 0.4437913860-79S Spring 995 957.3 1421.29 1.484686160-79S Summer 1751 1737.1 193.21 0.11122561
71.2291706
Computation of 2obs
Example - Cyclones Near Antarctica
• H0: Seasonal distribution of cyclone occurences is independent of latitude band
• Ha: Seasonal occurences of cyclone occurences differ among latitude bands
• Test Statistic:
• RR: obs2 .05,6
2 = 12.59
• P-value: Area in chi-squared distribution with (3-1)(4-1)=6 degrees of freedom above 71.2
From Table 8, P(222.46)=.001 P< .001
2.712 obs
SPSS Output - Cyclone ExampleREGION * SEASON Crosstabulation
370 452 273 422 1517
310.5 376.7 294.8 535.0 1517.0
24.4% 29.8% 18.0% 27.8% 100.0%
526 624 513 1059 2722
557.2 676.0 529.0 959.9 2722.0
19.3% 22.9% 18.8% 38.9% 100.0%
980 1200 995 1751 4926
1008.3 1223.3 957.3 1737.1 4926.0
19.9% 24.4% 20.2% 35.5% 100.0%
1876 2276 1781 3232 9165
1876.0 2276.0 1781.0 3232.0 9165.0
20.5% 24.8% 19.4% 35.3% 100.0%
Count
Expected Count
% within REGION
Count
Expected Count
% within REGION
Count
Expected Count
% within REGION
Count
Expected Count
% within REGION
40-49S
50-59S
60-79S
REGION
Total
Autumn Winter Spring Summer
SEASON
Total
Chi-Square Tests
71.189a 6 .000
71.337 6 .000
23.418 1 .000
9165
Pearson Chi-Square
Likelihood Ratio
Linear-by-LinearAssociation
N of Valid Cases
Value dfAsymp. Sig.
(2-sided)
0 cells (.0%) have expected count less than 5. Theminimum expected count is 294.79.
a.
P-value
Misuses of chi-squared Test
• Expected frequencies too small (all expected counts should be above 5, not necessary for the observed counts)
• Dependent samples (the same individuals are in each row, see McNemar’s test)
• Can be used for nominal or ordinal variables, but more powerful methods exist for when both variables are ordinal and a directional association is hypothesized
Measures of Association
• Absolute Risk (AR):
• Relative Risk (RR):
• Odds Ratio (OR): o1 / o2 (o = /(1-))
• Note that if (No association between outcome and grouping variables):– AR=0– RR=1– OR=1
Relative Risk
• Ratio of the probability that the outcome characteristic is present for one group, relative to the other
• Sample proportions with characteristic from groups 1 and 2:
2
22
^
1
11
^
n
y
n
y
Relative Risk• Estimated Relative Risk:
2
^
1
^
RR
95% Confidence Interval for Population Relative Risk:
2
2
^
1
1
^
96.196.1
)1()1(71828.2
))(,)((
yyve
eRReRR vv
Relative Risk
• Interpretation– Conclude that the probability that the outcome
is present is higher (in the population) for group 1 if the entire interval is above 1
– Conclude that the probability that the outcome is present is lower (in the population) for group 1 if the entire interval is below 1
– Do not conclude that the probability of the outcome differs for the two groups if the interval contains 1
Example - Concussions in NCAA Athletes
• Units: Game exposures among college socer players 1997-1999
• Outcome: Presence/Absence of a Concussion• Group Variable: Gender (Female vs Male)• Contingency Table of case outcomes:
Outcome
GenderConcussion
NoConcussion Total
Female 158 74924 75082
Male 101 75633 75734
Total 259 150557 150816Source: Covassin, et al (2003)
Example - Concussions in NCAA Athletes
)13.2,27.1(1.62e,1.62e
:Risk Relative Populationfor 95%CI
1273.0162.101
0013.1
158
0021.1
62.10013.
0021.)/(
es)player/gam male 1000per sConcussion (1.3
0013.075734
101 :Males Among
es)player/gam female 1000per sConcussion (2.1
0021.075082
158 :Females Among
)1.96(.1273)1.96(.1273-
^
^
^
^
vv
MFRRM
F
M
F
There is strong evidence that females have a higher risk of concussion
Odds Ratio
• Odds of an event is the probability it occurs divided by the probability it does not occur
• Odds ratio is the odds of the event for group 1 divided by the odds of the event for group 2
• Sample odds of the outcome for each group:
22
22
11
1
111
111 /)(
/
yn
yodds
yn
y
nyn
nyodds
Odds Ratio
• Estimated Odds Ratio:
)(
)(
)/(
)/(
112
221
222
111
2
1
yny
yny
yny
yny
odds
oddsOR
95% Confidence Interval for Population Odds Ratio
222111
96.196.1
111171828.2
))(,)((
ynyynyve
eOReOR vv
Odds Ratio
• Interpretation– Conclude that the probability that the outcome
is present is higher (in the population) for group 1 if the entire interval is above 1
– Conclude that the probability that the outcome is present is lower (in the population) for group 1 if the entire interval is below 1
– Do not conclude that the probability of the outcome differs for the two groups if the interval contains 1
Osteoarthritis in Former Soccer Players
• Units: 68 Former British professional football players and 136 age/sex matched controls
• Outcome: Presence/Absence of Osteoathritis (OA)• Data:• Of n1= 68 former professionals, y1 =9 had OA, n1-y1=59 did not
• Of n2= 136 controls, y2 =2 had OA, n2-y2=134 did not
)80.48,14.2(23.10,23.10
:Ratio Odds Populationfor CI 95%
797.6355.134
1
2
1
59
1
9
1
23.100149.
1525.
0149.134
21525.
59
9
)797(.96.1)797(.96.1
2
1
211
11
ee
vv
odds
oddsOR
oddsXn
Xodds
Source: Shepard, et al (2003) Interval > 1