a short introduction to epidemiology chapter 9: data analysis neil pearce centre for public health...

A short introduction to epidemiology

Chapter 9: Data analysis

Neil PearceCentre for Public Health

ResearchMassey University

Wellington, New Zealand

Chapter 9

Data analysis

• Basic principles

• Basic analyses

• Control of confounding

Basic principles

• Effect estimation

• Confidence intervals

• P-values

Testing and estimation• The effect estimate provides an estimate of the effect (e.g.

relative risk, risk difference) of exposure on the occurrence of disease

• The confidence interval provides a range of values in which it is plausible that the true effect estimate may lie

• The p-value is the probability that differences as large or larger as those observed could have arisen by chance if the null hypothesis (of no association between exposure and disease) is correct

• The principal aim of an individual study should be to estimate the size of the effect (using the effect estimate and confidence interval) rather than just to decide whether or not an effect is present (using the p-value)

Problems of significance testing• The p-value depends on two factors: the size of the effect;

and the size of the study• A very small difference may be statistically significant if the

study is very large, whereas a very large difference may not be significant if the study is very small.

• The purpose of significance testing is to reach a decision based on a single study. However, decisions should be based on information from all available studies, as well as non-statistical considerations such as the plausibility and coherence of the effect in the light of current theoretical and empirical knowledge (see chapter 10).

Chapter 9

Data analysis


• Basic analyses


Basic analyses• Measures of occurrence

– Incidence proportion (risk)– Incidence rate– Incidence odds

• Measures of effect– Risk ratio– Rate ratio– Odds ratio

Example:

C

E

c

a

E

b

M0d

N0N1 T

C

M1

Example: Smoking and Ovarian Cancer

98

158

60

E

36

40

76

E

58

24

82

C

C

46.0

5836

4024

/

/

x

x

bc

ad

dc

baOR

45.5157158158/98607682

)15860(8224

)1(/

)/(

))((

)()(

2

20101

211

22

xxxxxTTMMNN

TMNa

aEVar

aExpaObsX

This 2 is based on the assumptions that the marginal totals of the table (N1, N0, M1,M0) are fixed and that the proportion of exposed cases is the same as the proportion of exposed controls (i.e. that the overall proportion M1/T applies to both cases and controls)

The natural logarithm of the odds ratio has (under a binomial model) an approximate standard error of:

SE[ln(OR)] = (1/a +1/b+ 1/c +1/d)0.5

An approximate 95% confidence interval for the odds ratio is then given by:

OR e+1.96 SE

Chapter 9

Data analysis


• Basic analyses


Control of confounding

There are two methods of calculating a summary effect estimate to control confounding:

• Pooling

• Standardisation

The unadjusted (crude) findings indicate that there is a strong association between smoking and the ovarian cancer. Suppose, however, that we are concerned about the possibility that the effect of smoking is confounded by use of oral contraception (this would occur if oral contraception caused the ovarian cancer and if oral contraception was associated with smoking). We then need to stratify the data into those who have used oral contraceptives and those who have not.

Example of pooling:

OC use

Yes No

Smoking Smoking

Cases

Controls

Yes No

65

50

15

16

12

4 19

81

62

17

8

9

60

28

32 41

77

36

In those who have used oral contraceptives, the odds ratio for smoking is:

In those who have not used oral contraceptives, the odds ratio for smoking is:

90.0504

1215

x

xOR

98.0832

289

x

xOR

Thus, the crude OR for smoking (=0.46) was partly elevated due to confounding by oc use. When we remove this problem (by stratifying on oral contraceptive use) the odds ratios increase and are close to 1.0

In this example, the odds ratios are not exactly the same in each stratum. If they are very different (e.g. 1.0 in one stratum and 4.0 in the other stratum) then we would usually report the findings separately for each stratum. However, if the odds ratio estimates are reasonably similar then we usually wish to summarize our findings into a single summary odds ratio by taking a weighted average of the OR estimates in each stratum.

i

ii

W

ORWOR

where ORi = OR in stratum i Wi = weight given to stratum i

One obvious choice of weights would be to weight each stratum by the inverse of its variance (precision-based estimates). However, this method of obtaining a summary odds ratio yields estimates which are unstable and highly affected by small numbers in particular strata.

A better set of weights were developed by Mantel-Haenszel. These involve using the weights bi ci /Ti :

iii

iii

iii

ii

iiiii

i

ii

Tcb

Tda

Tcb

cb

daTcb

W

ORWOR

/

/

/

))(/(

C

E

65

50

15

16

12

4 19

81

62

17

8

9

60

28

32 41

77

36

C C

C

EE

Stratum 1 Stratum 2

95.077/83281/504

77/28981/1215

xx

xxORMH

E

This set of weights yields summary odds ratio estimates which are very close to being statistically optimal (they are very close to the maximum likelihood estimates) and are very robust in that they are not unduly affected by small numbers in particular strata (provided that the strata do not have any zero marginal totals).

We can calculate a corresponding chi-square:

1/ 2

0101

2

112

2

iiiiii

i

iii

TTMMNN

TM

Na

Var

ExpObsMH

C

E

65

50

15

16

12

4 19

81

62

17

8

9

60

28

32 41

77

36

C C

C

EE

Stratum 1 Stratum 2

E

016.0

767777/36416017808181/62191665

77

41179

81

196515

2

2

xxxxxxxxxx

MH

The natural logarithm of the odds ratio has (under a binomial model) an approximate standard error of:

ΣPR Σ(PS + QR) ΣQS

SE = ----- + -------------- + ------

2R+2 2R+S+ 2S+

2

where: P = (ai + di)/Ti

Q = (bi + ci)/Ti

R = aidi/Ti

S = bici/Ti

R+ = ΣR

S+ = ΣS

An approximate 95% confidence interval for the odds ratio is then given by:

OR e+1.96 SE

E

a

E

bc M1

Y1 Y0PY

Rate ratios:

E

350

0.001250.00350

E

125

10,000 10,000

Case

PY

Rate

8.200125.0

00350.0

000,100/125

000,100/350

/

/

0

1 Yb

YaRR

Stratifying On Tobacco

Tobacco Tobacco

Yes No

Alcohol Alcohol

Yes No Yes No

Cases 300 50 50 75

Person-years

75,000 25,000 25,000 75,000

Rate 0.00400 0.00200 0.00200 0.00100

10

1

/

/

bY

aY

Yb

YaRR o

The summary Mantel-Haenszel rate ratio involves taking the weights bY1/T to yield:

TbY

TaYRRMH /

/

1

0

0.2000,100/2500075000,100/7500050

000,100/7500050000,100/25000300

xx

xxRRMH

The equivalent Mantel-Haenszel chi-square is:

1

201

2

112

2

/)( TMYY

TM

Ya

aEVar

aEa i

i

iiMH

This is very similar to the 2MH for case-control

studies, but it has some minor modifications to take account of the fact that we are using person-time data rather than binomial data.

5.35

000,100000,100/1257500025000000,100000,100/3502500075000

000,100125

2500050000,100

35075000300

2

2

xxxxxx

MH

An approximate standard error for the natural log of the rate ratio is :

[ ΣM1iY1iY0i/Ti2]0.5

SE = ------------------------------

[(ΣaiY0i/Ti)(ΣbiY1i/Ti)]0.5

An approximate 95% confidence interval for the rate ratio is then given by:

RR e+1.96 SE

Risk ratios:

E

a

E

bCases M1

N1 N0Total

c dNon Cases M0

TbN

TaNRRMH /

/

1

0

1/

/2

0101

2112

TTMMNN

TMNaMH

An approximate standard error for the natural log of the risk ratio is :

[ ΣM1iN1iN0i/Ti2 - aibi/Ti]0.5

SE = ---------------------------------

[(ΣaiN0i/Ti)(ΣbiN1i/Ti)]0.5

An approximate 95% confidence interval for the risk ratio is then given by:

RR e+1.96 SE

Standardization, in contrast to pooling, involves taking a weighted average of the rates in each stratum (eg age-group) before taking the ratio of the two standardized rates. Standardization has many advantages in descriptive epidemiology involving comparisons between countries, regions, ethnic groups or gender groups. However, pooling (when done appropriately) has some superior statistical properties when comparing exposed and non-exposed in specific study.

Summary of Stratified Analysis

If we are concerned about confounding by a factor such as age, gender, smoking then we need to stratify on this factor (or all factors simultaneously if there is more than one potential confounder) and calculate the exposure effect separately in each stratum.

If the effect is very different in different strata then we would report the findings separately for each stratum.

If the effect is similar in each stratum then we can obtain a summary estimate by taking a weighted average of the effect in each stratum.If the adjusted effect is different from the crude effect this means that the crude effect was biased due to confounding.

Usually we need to adjust the findings (ie stratify on) age, gender, and some other factors.If we have five age-groups and two gender-groups then we need to divide the data into ten age-gender-groups. If we have too many strata then we begin to get strata with zero marginal totals (eg with no cases or no controls).The analysis then begins to ‘break down’ and we have to consider using mathematical modelling.

A short introduction to epidemiology

Chapter 9: Data analysis

Neil PearceCentre for Public Health

ResearchMassey University

Wellington, New Zealand

a short introduction to epidemiology chapter 9: data analysis neil pearce centre for public health...

Documents

effect of smoking

estimationthe effect

true effect estimate

summary effect

oral contraceptives

large difference

use of oral contraception

proportion of exposed