a short introduction to epidemiology chapter 9: data analysis neil pearce centre for public health...
TRANSCRIPT
A short introduction to epidemiology
Chapter 9: Data analysis
Neil PearceCentre for Public Health
ResearchMassey University
Wellington, New Zealand
Chapter 9
Data analysis
• Basic principles
• Basic analyses
• Control of confounding
Basic principles
• Effect estimation
• Confidence intervals
• P-values
Testing and estimation• The effect estimate provides an estimate of the effect (e.g.
relative risk, risk difference) of exposure on the occurrence of disease
• The confidence interval provides a range of values in which it is plausible that the true effect estimate may lie
• The p-value is the probability that differences as large or larger as those observed could have arisen by chance if the null hypothesis (of no association between exposure and disease) is correct
• The principal aim of an individual study should be to estimate the size of the effect (using the effect estimate and confidence interval) rather than just to decide whether or not an effect is present (using the p-value)
Problems of significance testing• The p-value depends on two factors: the size of the effect;
and the size of the study• A very small difference may be statistically significant if the
study is very large, whereas a very large difference may not be significant if the study is very small.
• The purpose of significance testing is to reach a decision based on a single study. However, decisions should be based on information from all available studies, as well as non-statistical considerations such as the plausibility and coherence of the effect in the light of current theoretical and empirical knowledge (see chapter 10).
Chapter 9
Data analysis
• Basic principles
• Basic analyses
• Control of confounding
Basic analyses• Measures of occurrence
– Incidence proportion (risk)– Incidence rate– Incidence odds
• Measures of effect– Risk ratio– Rate ratio– Odds ratio
Example:
C
E
c
a
E
b
M0d
N0N1 T
C
M1
Example: Smoking and Ovarian Cancer
98
158
60
E
36
40
76
E
58
24
82
C
C
46.0
5836
4024
/
/
x
x
bc
ad
dc
baOR
45.5157158158/98607682
)15860(8224
)1(/
)/(
))((
)()(
2
20101
211
22
xxxxxTTMMNN
TMNa
aEVar
aExpaObsX
This 2 is based on the assumptions that the marginal totals of the table (N1, N0, M1,M0) are fixed and that the proportion of exposed cases is the same as the proportion of exposed controls (i.e. that the overall proportion M1/T applies to both cases and controls)
The natural logarithm of the odds ratio has (under a binomial model) an approximate standard error of:
SE[ln(OR)] = (1/a +1/b+ 1/c +1/d)0.5
An approximate 95% confidence interval for the odds ratio is then given by:
OR e+1.96 SE
Chapter 9
Data analysis
• Basic principles
• Basic analyses
• Control of confounding
Control of confounding
There are two methods of calculating a summary effect estimate to control confounding:
• Pooling
• Standardisation
The unadjusted (crude) findings indicate that there is a strong association between smoking and the ovarian cancer. Suppose, however, that we are concerned about the possibility that the effect of smoking is confounded by use of oral contraception (this would occur if oral contraception caused the ovarian cancer and if oral contraception was associated with smoking). We then need to stratify the data into those who have used oral contraceptives and those who have not.
Example of pooling:
OC use
Yes No
Smoking Smoking
Cases
Controls
Yes No
65
50
15
16
12
4 19
81
62
17
8
9
60
28
32 41
77
36
In those who have used oral contraceptives, the odds ratio for smoking is:
In those who have not used oral contraceptives, the odds ratio for smoking is:
90.0504
1215
x
xOR
98.0832
289
x
xOR
Thus, the crude OR for smoking (=0.46) was partly elevated due to confounding by oc use. When we remove this problem (by stratifying on oral contraceptive use) the odds ratios increase and are close to 1.0
In this example, the odds ratios are not exactly the same in each stratum. If they are very different (e.g. 1.0 in one stratum and 4.0 in the other stratum) then we would usually report the findings separately for each stratum. However, if the odds ratio estimates are reasonably similar then we usually wish to summarize our findings into a single summary odds ratio by taking a weighted average of the OR estimates in each stratum.
i
ii
W
ORWOR
where ORi = OR in stratum i Wi = weight given to stratum i
One obvious choice of weights would be to weight each stratum by the inverse of its variance (precision-based estimates). However, this method of obtaining a summary odds ratio yields estimates which are unstable and highly affected by small numbers in particular strata.
A better set of weights were developed by Mantel-Haenszel. These involve using the weights bi ci /Ti :
iii
iii
iii
ii
iiiii
i
ii
Tcb
Tda
Tcb
cb
daTcb
W
ORWOR
/
/
/
))(/(
C
E
65
50
15
16
12
4 19
81
62
17
8
9
60
28
32 41
77
36
C C
C
EE
Stratum 1 Stratum 2
95.077/83281/504
77/28981/1215
xx
xxORMH
E
This set of weights yields summary odds ratio estimates which are very close to being statistically optimal (they are very close to the maximum likelihood estimates) and are very robust in that they are not unduly affected by small numbers in particular strata (provided that the strata do not have any zero marginal totals).
We can calculate a corresponding chi-square:
1/ 2
0101
2
112
2
iiiiii
i
iii
TTMMNN
TM
Na
Var
ExpObsMH
C
E
65
50
15
16
12
4 19
81
62
17
8
9
60
28
32 41
77
36
C C
C
EE
Stratum 1 Stratum 2
E
016.0
767777/36416017808181/62191665
77
41179
81
196515
2
2
xxxxxxxxxx
MH
The natural logarithm of the odds ratio has (under a binomial model) an approximate standard error of:
ΣPR Σ(PS + QR) ΣQS
SE = ----- + -------------- + ------
2R+2 2R+S+ 2S+
2
where: P = (ai + di)/Ti
Q = (bi + ci)/Ti
R = aidi/Ti
S = bici/Ti
R+ = ΣR
S+ = ΣS
An approximate 95% confidence interval for the odds ratio is then given by:
OR e+1.96 SE
E
a
E
bc M1
Y1 Y0PY
Rate ratios:
E
350
0.001250.00350
E
125
10,000 10,000
Case
PY
Rate
8.200125.0
00350.0
000,100/125
000,100/350
/
/
0
1 Yb
YaRR
Stratifying On Tobacco
Tobacco Tobacco
Yes No
Alcohol Alcohol
Yes No Yes No
Cases 300 50 50 75
Person-years
75,000 25,000 25,000 75,000
Rate 0.00400 0.00200 0.00200 0.00100
10
1
/
/
bY
aY
Yb
YaRR o
The summary Mantel-Haenszel rate ratio involves taking the weights bY1/T to yield:
TbY
TaYRRMH /
/
1
0
0.2000,100/2500075000,100/7500050
000,100/7500050000,100/25000300
xx
xxRRMH
The equivalent Mantel-Haenszel chi-square is:
1
201
2
112
2
/)( TMYY
TM
Ya
aEVar
aEa i
i
iiMH
This is very similar to the 2MH for case-control
studies, but it has some minor modifications to take account of the fact that we are using person-time data rather than binomial data.
5.35
000,100000,100/1257500025000000,100000,100/3502500075000
000,100125
2500050000,100
35075000300
2
2
xxxxxx
MH
An approximate standard error for the natural log of the rate ratio is :
[ ΣM1iY1iY0i/Ti2]0.5
SE = ------------------------------
[(ΣaiY0i/Ti)(ΣbiY1i/Ti)]0.5
An approximate 95% confidence interval for the rate ratio is then given by:
RR e+1.96 SE
Risk ratios:
E
a
E
bCases M1
N1 N0Total
c dNon Cases M0
TbN
TaNRRMH /
/
1
0
1/
/2
0101
2112
TTMMNN
TMNaMH
An approximate standard error for the natural log of the risk ratio is :
[ ΣM1iN1iN0i/Ti2 - aibi/Ti]0.5
SE = ---------------------------------
[(ΣaiN0i/Ti)(ΣbiN1i/Ti)]0.5
An approximate 95% confidence interval for the risk ratio is then given by:
RR e+1.96 SE
Standardization, in contrast to pooling, involves taking a weighted average of the rates in each stratum (eg age-group) before taking the ratio of the two standardized rates. Standardization has many advantages in descriptive epidemiology involving comparisons between countries, regions, ethnic groups or gender groups. However, pooling (when done appropriately) has some superior statistical properties when comparing exposed and non-exposed in specific study.
Summary of Stratified Analysis
If we are concerned about confounding by a factor such as age, gender, smoking then we need to stratify on this factor (or all factors simultaneously if there is more than one potential confounder) and calculate the exposure effect separately in each stratum.
If the effect is very different in different strata then we would report the findings separately for each stratum.
If the effect is similar in each stratum then we can obtain a summary estimate by taking a weighted average of the effect in each stratum.If the adjusted effect is different from the crude effect this means that the crude effect was biased due to confounding.
Usually we need to adjust the findings (ie stratify on) age, gender, and some other factors.If we have five age-groups and two gender-groups then we need to divide the data into ten age-gender-groups. If we have too many strata then we begin to get strata with zero marginal totals (eg with no cases or no controls).The analysis then begins to ‘break down’ and we have to consider using mathematical modelling.
A short introduction to epidemiology
Chapter 9: Data analysis
Neil PearceCentre for Public Health
ResearchMassey University
Wellington, New Zealand