1.1
Analysis of categorical response data
Topic covered in lecture 1:
• What is categorical data
Response and explanatory variables
Measurement scales for categorical data
• Course coverage
• Tabulated count data and related questions
• Non tabulated categorical data
• Sampling design for tables
• Links with other methods
1.2
What is categorical data?: The measurement scale for
the response consists of a number of categories
Variable Measurement Scale
Farm system Dairy, Beef, Tillage etc.
Mortality Dead, alive
Food texture Very soft, Soft, Hard, Very hard
Litter size 0, 1, 2, 3 and >3
Types of data discussed in this course
Response variable(s) is categorical
Explanatory variable(s) may be categorical or
continuous
Example 1: Does Post-operative survival (categorical response) depend on the explanatory variables?
Sex (categorical) Age (continuous) Example 2: In a random sample of Irish farmers is there a relationship between attitudes to the EU and farm system.
Farm system (categorical) Attitude to EU (categorical/ordinal)?
(Two response variables - no explanatory variables) Could one of these be regarded as explanatory?
1.3
Measurement scales for categorical data
Nominal - no underlying order
Variable Measurement Scale
Farm system Dairy, Beef, Tillage etc.
Weed Species Stellaria media, Poa annua, etc.
Ordinal - underlying order in the scale
Variable Measurement Scale
Food texture Very soft, Soft, Hard, Very hard
Disease diagnosis Very likely, Likely, Unlikely
Education Primary, Secondary, Tertiary
Interval - underlying numerical distance between scale
points
Variable Measurement Scale
Litter size 0, 1, 2, 3 and >3
Age class <1, 1-2, 2-3.5, 3.5-5, >5
Education years in education
1.4
Tabulated count data and questions
Single level table
Example 1: A geneticist carries out a crossing
experiment between F1 hybrids of a wild type and a
mutant genotype and obtains an F2 progeny of 90
offspring with the following characteristics.
Wild Type Mutant Total
80 10 90
Evidence that a wild tpe is dominant, giving on average
3:1 offspring phenotype in its favour?
Two-way table
Example 1- A sample 124 mice was divided into two
groups, 84 receiving a standard dose of pathogenic
bacteria followed by an antiserum and a control group
of 40 not receiving the antiserum. After 3 weeks the
numbers dead and alive i9n each group were counted.
Outcome
Dead Alive Total % dead
+ antiserum 19 65 84 23
- antiserum 18 22 40 45
Total 37 87 124
Association between mortality and treatment ?
Is the mortality rate the same for both treatments?
1.5
Example 2 - Categorical response and categorical
explanatory variable: The opinion poll after the Good
Friday Agreement with respondents classified by
religion (R - Catholic or Protestant)
Favour Oppose Undec. Total %
Favour
Catholic 258 32 62 352 73 Protestant 149 91 208 448 33 Total 407 123 270 800 51 % Cath 63 26 23
1. Evidence that a majority of decided voters (all
voters) support the agreement?
2. Support pattern the same for Protestants and
Catholics?
1.6
Example 3 (Snedecor and Cochran): Categorical
response and interval categorical explanatory variable.
The table below shows the number of aphids alive and
dead after spraying with four concentrations of solutions
of sodium oleate. Has the higher concentration given a
significantly different percentage kill? Is there a
relationship between concentration and mortality?
Concentration of sodium
oleate (%)
0.65 1.10 1.6 2.1 Total
Dead 55 62 100 72 289
Alive 22 13 12 5 52
Total 77 75 112 77 341
% Dead 71.4 82.7 89.3 93.5 84.8
Is mortality related to sodium oleate concentration?
1.7
Example 4 Categorical response and interval
categorical explanatory variable (Cornfield 1962):
Blood pressure (BP) was measured on a sample of
males aged 40-59, who were also classified by
whether they developed coronary heart disease (CHD)
in a 6-year follow-up period. The data were classified
by BP (interval categorical variable in 8 classes) and
CHD (CHD or No-CHD).
BP CHD No
CHD
Total % CHD
<117 3 153 156 1.9
117 - 126 17 235 252 6.7
127 - 136 12 272 284 4.2
137 - 146 16 255 271 5.9
147 - 156 12 127 139 8.6
157 - 166 8 77 85 9.4
167 - 186 16 83 99 16.2
>186 8 35 43 18.6
Total 92 1237 1329
1.Is the incidence of CHD independent of BP?
2.Simple relationship between the probability of CHD
and the level of BP?
1.8
Multiway table - relationship between categorical
responses or categorical response and several
categorical explanatory variables:
Example 1: The NI opinion poll with respondents further classified by where they lived in Northern Ireland (L) (ARL table)
West - rural and strong nationalist/Catholic Belfast - mixed population North East - industrial and Unionist/Protestant.
Favour Oppose Undecided West Catholic 73 20 20 Protestant 47 34 69
Belfast Catholic 90 9 21 Protestant 54 23 66
North East Catholic 95 3 21 Protestant 48 34 73
Total 407 123 270
1. Evidence that a majority of decided voters (all voters)
support the agreement?
2. Difference in support pattern between Protestants
and Catholics?
3. Difference in support pattern between Protestants
and Catholics consistent over region?
4. Within the Catholic (Protestant) population does the
strength of support change with region? ETC ETC
1.9
Example 2: Grouped binomial data - patterns of
psychotropic drug consumption in a sample from West
London (Murray et al 1981, Psy Med 11,551-60). Sex Age
Group Psych. case
On drugs
Total
M 1 No 9 531 M 2 No 16 500 M 3 No 38 644 M 4 No 26 275 M 5 No 9 90 M 1 Yes 12 171 M 2 Yes 16 125 M 3 Yes 31 121 M 4 Yes 16 56 M 5 Yes 10 26 F 1 No 12 588 F 2 No 42 596 F 3 No 96 765 F 4 No 52 327 F 5 No 30 179 F 1 Yes 33 210 F 2 Yes 47 189 F 3 Yes 71 242 F 4 Yes 45 98 F 5 Yes 21 60
Is Pychotropic drug use affected by gender, age or
psychological state and are there interactions among
these effects?
1.10
Non-tabulated data and questions
Example 1: Individual plants were monitored the
survival of plants of Legousia in an experiment to
see whether they survived after 3 months. Survived -
yes is scored 1 and Survived -no scored 0. Also
recorded were
CO2 treatment – 2 levels low and high
Density of Legousia
Density of companion species
Height of the plant (mm) two weeks after planting.
Most individuals will have a unique profile in these
three additional variables and so tabulation of the data
by them is not feasible. The individual data is
presented. Density
Subject Surv CO2 Ht Leg. Comp 1 0 L 35 20 30 2 1 L 68 22 27 3 1 H 43 16 33 4 0 L 27 4 16 … … … … … … … … … … … …
1.Is survival related to the explanatory variables
(CO2, Height, density-self, density-companions.)?
2.Can the probability of survival be predicted from the
subject’s profile?
1.11
Example 2: A sample of 62 patients who had
angioplasty for coronary artery disease were
followed to see if they reblocked (restenosed) after 6
months RS -yes is scored 1 and RS -no scored 0 (a
binary response categorical variable). Also
recorded were
Age in years - ‘continuous’ variate
Blood pressure (BP) - continuous variate
Sex - nominal categorical (?)
Cholesterol - continuous
Most individuals will have a unique profile in these four
additional variables and so tabulation of the data by
them is not feasible. The individual data is presented.
Subject RS Age BP Sex Cholest.
1 0 35 117 m 1 2 1 68 154 f 5 3 1 43 123 f 2 4 0 27 110 m 3 … … … … … …
3.Is RS related to the explanatory variables (Age, BP,
Sex and Cholesterol)?
4.Can the probability of RS be predicted from the
subject’s profile?
1.12
Sampling designs - two and multiway tables
Single sample (no margin fixed) simultaneously
classified by several categorical variables. Used in
Cross-sectional studies.
Example: A simple random sample of 200 students
was classified by gender and attitude to EU
integration.
EU integration
Favour Oppose Total
Male 43 53 96
Female 61 33 104
Total 104 86
This is a snapshot of opinion at a moment in time -
hence Cross-sectional.
1.13
One margin fixed: Samples of fixed size are selected
for one category and individuals are classified by the
other category(s).
Example 1 (Clinical trial - a prospective study): Of
400 HIV positive pregnant women 200 are assigned at
random to each of Breast feeding (BF) or Formula
feeding (FF). Two years after birth the child’s HIV
status is determined.
Child’s status (???) Total
HIV + HIV -
BF 62 138 200
FF 45 155 200
Example 2 (Cohort study - a prospective study): 400
HIV positive pregnant women are asked to select
either Breast feeding (BF) or Formula feeding (FF).
Two years after birth the child’s HIV status is
determined. Here the sample totals are determined by
the mothers’ choices.
Example 3 (Case-control or retrospective study): A
sample of 200 HIV+ and another of 200 HIV- two year
old children are selected and classified by whether
they were BF or FF. Here the HIV outcome numbers
are controlled - cannot compute % HIV from BF and
FF.
1.14
Past Present Future
Cohort
→
←
Cases and controls
Cross-sectional
1.15
Notes on sampling designs
• In more complex studies more than one margin may
be fixed.
Example 1: Any replicated factorial experiment
where the response is binary
Example 2: Physicians health study. NEJM 1988,
262-264. Four treatments
Treatment Aspirin beta
carotene A No No B Yes No C No Yes D Yes Yes
Example 3: 2x2 table with both margins fixed?
• The statistical properties differ considerably between
sampling schemes, nevertheless the methods to be
discussed below apply, with some modifications, to
data collected using any of these sampling schemes.
1.16
Relationships with regression methods.
Traditionally categorical data analysis has been viewed
as completely distinct from and unconnected with
regression and ANOVA methods. We show that there
are many strong links and that many concepts transfer
naturally between the methods.
1.17
SAS Analysis of example 1
A sample 124 mice was divided into two groups, 84
receiving a standard dose of pathogenic bacteria
followed by an antiserum and a control group of 40 not
receiving the antiserum. After 3 weeks the numbers
dead and alive i9n each group were counted.
Outcome
Dead Alive Total % dead
+ antiserum 19 65 84 23
- antiserum 18 22 40 45
Total 37 87 124
Association between mortality and treatment ?
Is the mortality rate the same for both treatments?
1.18
SAS program for analysis of example 1 data
PROC FREQ OPTIONS LINESIZE=72 PAGESIZE= 59 NOCENTER ; DATA ANTISER; INPUT ANTISER $ MORTALI $ COUNT ; CARDS ; A__plus Dead 19 A_plus Alive 65 A_minus Dead 18 A_minus Alive 22 ; PROC FREQ ;
TABLES ANTISER*MORTALI/CHISQ EXPECTED DEVIATION CELLCHI2 NOROW NOCOL
NOPERCENT NOCUM; WEIGHT COUNT ; RUN ;
1.19
Table of ANTISER by MORTALI ANTISER MORTALI Frequency ‚ Expected ‚ Deviation ‚ Cell Chi-Square‚Alive ‚Dead ‚ Total ŲŲŲŲŲŲŲŲŲŲŲŲŲŲŲˆŲŲŲŲŲŲŲŲˆŲŲŲŲŲŲŲŲˆ A_minus ‚ 22 ‚ 18 ‚ 40 ‚ 28.065 ‚ 11.935 ‚ ‚ -6.065 ‚ 6.0645 ‚ ‚ 1.3105 ‚ 3.0814 ‚ ŲŲŲŲŲŲŲŲŲŲŲŲŲŲŲˆŲŲŲŲŲŲŲŲˆŲŲŲŲŲŲŲŲˆ A_plus ‚ 65 ‚ 19 ‚ 84 ‚ 58.935 ‚ 25.065 ‚ ‚ 6.0645 ‚ -6.065 ‚ ‚ 0.624 ‚ 1.4673 ‚ ŲŲŲŲŲŲŲŲŲŲŲŲŲŲŲˆŲŲŲŲŲŲŲŲˆŲŲŲŲŲŲŲŲˆ Total 87 37 124 Statistics for Table of ANTISER by MORTALI Statistic DF Value Prob ChiChiChiChi----Square 1 6.4833 Square 1 6.4833 Square 1 6.4833 Square 1 6.4833 0.0109 0.0109 0.0109 0.0109 Likeli Ratio ChiLikeli Ratio ChiLikeli Ratio ChiLikeli Ratio Chi----Squ 1 6.2846 0.0122Squ 1 6.2846 0.0122Squ 1 6.2846 0.0122Squ 1 6.2846 0.0122
Sample Size = 124
Observed counts
Outcome
Dead Alive Total % Dead
+ antiserum 19 65 84 23
1.20
- antiserum 18 22 40 45
Total 37 87 124 30
Expected (blue) counts if outcome is independent of treatment
Outcome
Dead Alive Total % Dead
+ antiserum .3*84
25.2
.7*84
58.8
84 23
- antiserum .3*40
12.0
.7*40
28.0
40 45
Total 37 87 124 30
Is there a discrepancy between obsewrved and expected? Chisquared = (Observed-expected)
2/expected
1.21
SAS Analysis of example 3
The table below shows the number of aphids alive and
dead after spraying with four concentrations of solutions
of sodium oleate. Has the higher concentration given a
significantly different percentage kill? Is there a
relationship between concentration and mortality?
Concentration of sodium
oleate (%)
0.65 1.10 1.6 2.1 Total
Dead 55 62 100 72 289
Alive 22 13 12 5 52
Total 77 75 112 77 341
Is mortality independent of sodium oleate concentration?
1.22
SAS program for analysis of Insecticide data
PROC FREQ OPTIONS LINESIZE=72 PAGESIZE= 59 NOCENTER ; DATA INSECT; INPUT SODOL D_AL COUNT ; CARDS ; 0.65 1 55 1.10 1 62 1.6 1 100 2.1 1 72 0.65 2 22 1.10 2 13 1.6 2 12 2.1 2 5 ; PROC FREQ ; TABLES D_AL*SODOL/CHISQ EXPECTED DEVIATION CELLCHI2 NOROW NOCOL NOPERCENT NOCUM; WEIGHT COUNT ; RUN ;
1.23
Output from SAS PROC FREQ. TABLE OF D_AL BY SODOL D_AL SODOL FREQUENCY| EXPECTED | DEVIATION| CELL CHI2| 0.65| 1.1| 1.6| 2.1| TOTAL ---------+--------+--------+--------+--------+ 1 | 55 | 62 | 100 | 72 | 289 | 65.3 | 63.6 | 94.9 | 65.3 | | -10.3 | -1.6 | 5.1 | 6.7 | |1.61249 |.038436 |.271785 |.696522 | ---------+--------+--------+--------+--------+ 2 | 22 | 13 | 12 | 5 | 52 | 11.7 | 11.4 | 17.1 | 11.7 | | 10.3 | 1.6 | -5.1 | -6.7 | |8.96172 |.213617 | 1.5105 |3.87106 | ---------+--------+--------+--------+--------+ TOTAL 77 75 112 77 341
1.24
STATISTICS FOR TABLE OF D_AL BY SODOL STATISTIC DF VALUE PROB ------------------------------------------------------ CHI-SQUARE 3 17.176 0.001
LIKELIHOOD RATIO CHI-SQUARE 3 16.633 0.001
MANTEL-HAENSZEL CHI-SQUARE 1 16.157 0.000 PHI 0.224 CONTINGENCY COEFFICIENT 0.219 CRAMER'S V 0.224
Conclusion: Insect mortality is not independent of dose. Mortality is not constant
as dose changes.
Sodium oleate (%)
0.65 1.10 1.6 2.1 Total
Dead 55 62 100 72 289
Alive 22 13 12 5 52
Total 77 75 112 77 341
% Dead 71.4 82.7 89.3 93.5 84.8
1.25
Group two lowest and two highest levels
1.26
Analysis of CHD data
Blood pressure (BP) was measured on a sample of
males aged 40-59, who were also classified by
whether they developed coronary heart disease (CHD)
in a 6-year follow-up period. The data were classified
by BP (interval categorical variable in 8 classes) and
CHD (CHD or No-CHD).
BP CHD No
CHD
Total % CHD
<117 3 153 156 1.9
117 - 126 17 235 252 6.7
127 - 136 12 272 284 4.2
137 - 146 16 255 271 5.9
147 - 156 12 127 139 8.6
157 - 166 8 77 85 9.4
167 - 186 16 83 99 16.2
>186 8 35 43 18.6
Total 92 1237 1329
3.Is the incidence of CHD independent of BP?
4.Simple relationship between the probability of CHD
and the level of BP?