lecture 5: an o v a and co rrelat io nam3xa/biostatii/slides/lecture5.pdflecture 5: an o v a and co...

Post on 30-May-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lecture 5: ANOVA and Correlation

Ani Manichaikulamanicha@jhsph.edu

23 April 2007

1 / 62

Comparing Multiple Groups

Continous data: comparing means

Analysis of variance

Binary data: comparing proportions

Pearson’s Chi-square tests for r × 2 tablesIndependenceGoodness of FitHomogeneity

Categorical data: r × c tables

Pearson chi-square tests

Odds ratio and relative risk

2 / 62

ANOVA: Definition

Statistical technique for comparing means for multiplepopulations

Partitioning the total variation in a data set into componentsdefined by specific sources

ANOVA = ANalysis Of VAriance

3 / 62

ANOVA: Concepts

Estimate group means

Assess magnitude of variation attributable to specific sources

Extension of 2-sample t-test to multiple groups

Population model

Sample model: estimates, standard errors

Partition of variability

4 / 62

Types of ANOVA

One-way ANOVA

One factor — e.g. smoking status

Two-way ANOVA

Two factors — e.g. gender and smoking status

Three-way ANOVA

Three factors — e.g. gender, smoking and beer

5 / 62

Emphasis

One-way ANOVA is an extension of the t-test to 3 or more samples

focus analysis on group differences

Two-way ANOVA (and higher) focuses on the interaction offactors

Does the effect due to one factor change as the level ofanother factor changes?

6 / 62

ANOVA Rationale I

Variation VariationVariation between each between each

in all = observation + group meanobservations and its group and the overall

mean mean

In other words,

Total = Within group + Between groupssum of squares sum of squares sum of squares

7 / 62

ANOVA Rationale II

In shorthand:

SST = SSW + SSB

If the group means are not very different, the variationbetween them and the overall mean (SSB) will not be muchmore than the variation between the observations within agroup (SSW)

8 / 62

ANOVA: One-Way

9 / 62

MSW

We can pool the estimates of σ2 across groups and use anoverall estimate for the population variance:

Variation within a group = σ2W

=SSW

N − k= MSW

MSW is called the “within groups mean square”

10 / 62

MSB

We can also look at systematic variation among groups

Variation between groups = σ2B

=SSB

k − 1= MSB

11 / 62

An ANOVA table

Suppose there are k groups (e.g. if smoking status hascategories current, former or never, then k=3)

We calculate our test statistic using the sum of square valuesas follows:

12 / 62

Hypothesis testing with ANOVA

In performing ANOVA, we may want to ask: is there truly adifference in means across groups?

Formally, we can specify the hypotheses:

H0 : µ1 = µ2 = · · · = µk

Ha : at least one of the µi ’s is different

The null hypothesis specifies a global relationship

If the result of the test is significant, then perform individualcomparisons

13 / 62

Goal of the comparisons

Compare the two variability estimates, MSW and MSB

If Fobs = MSBMSW =

σ2B

σ2W

is small,

then variability between groups is negligible compared tovariation within groups⇒ The grouping does not explain much variation in the data

14 / 62

The F-statistic

For our observations, we assume X ∼ N(µgp,σ2), where

µgp = E (X |gp)

= β0 + β1 · I (group=2) + β1 · I (group=3) + · · · )

and I (group=i) is an indicator to denote whether or not eachindividual is in the ith group

Note: we have assumed the same variance σ2 for all groups— important to check this assumption

Under these assumptions, we know the null distribution of thestatistic F= MSB

MSW

The distribution is called an F-distribution

15 / 62

The F-distribution

Remember that a χ2 distribution is always specified by itsdegrees of freedom

An F-distribution is any distribution obtained by taking thequotient of two χ2 distributions divided by their respectivedegrees of freedom

When we specify an F-distribution, we must state twoparameters, which correspond to the degrees of freedom forthe two χ2 distributions

If X1 ∼ χ2df1

and X2 ∼ χ2df2

we write:

X1/df1X2/df2

∼ Fdf1,df2

16 / 62

Back to the hypothesis test . . .

Knowing the null distribution of MSBMSW,

we can define a decision rule to test the hypothesis for ANOVA:

Reject H0 if F ≥ Fα;k−1,N−k

Fail to reject H0 if F < Fα;k−1,N−k

17 / 62

ANOVA: F-tests I

18 / 62

ANOVA: F-tests II

19 / 62

Example: ANOVA for HDL

Study design: Randomize control trial

132 men randomized to one ofDiet + exericseDietControl

Follow-up one year later:

119 men remaining in study

Outcome: mean change in plasma levels of HDL cholesterol frombaseline to one-year follow-up in the three groups

20 / 62

Model for HDL outcomes

We model the means for each group as follows:

µc = E (HDL|gp = c) = mean change in control group

µd = E (HDL|gp = d) = mean change in diet group

µde = E (HDL|gp = de) = mean change in diet and exercise group

We could also write the model as

E (HDL|gp) = β0 + β1I (gp = d) + β2I (gp = de)

Recall that I(gp=D), I(gp=DE) are 0/1 group indicators

21 / 62

HDL ANOVA Table

We obtain the following results from the HDL experiment:

22 / 62

HDL ANOVA results

F-test

H0 : µc = µd = µde (or H0 : β1 = β2 = 0)

Ha : at least one mean is different from the others

Test statistic

Fobs = 13

df1 = k − 1 = 3− 1 = 2

df2 = N − k = 116

23 / 62

HDL ANOVA Conclusions

Rejection region: F > F0.05;2,116 = 3.07

Since Fobs = 13.0 > 3.07, we reject H0

We conclude that at least one of the group means is differentfrom the others

24 / 62

Which groups are different?

We might proceed to make individual comparisons

Conduct two-sample t-tests for each pair of groups:

t =θ − θ0

SE (θ)=

Xi − Xj − 0√s2p

ni+

s2p

nj

25 / 62

Multiple Comparisons

Performing individual comparisons require multiple hypothesistests

If α = 0.05 for each comparison, there is a 5% chance thateach comparison will falsely be called significant

Overall, the probability of Type I error is elevated above 5%

Question How can we address this multiple comparisons issue?

26 / 62

Bonferroni adjustment

A possible correction for multiple comparisons

Test each hypothesis at level α∗ = (α/3) = 0.0167

Adjustment ensures overall Type I error rate does not exceedα = 0.05

However, this adjustment may be too conservative

27 / 62

Multiple comparisons α

Hypothesis α∗ = α/3H0 : µc = µd (or β1 = 0) 0.0167H0 : µc = µde (or β2 = 0) 0.0167H0 : µd = µde (or β1 − β2 = 0) 0.0167

Overall α = 0.05

28 / 62

HDL: Pairwise comparisons I

Control and Diet groups

H0 : µc = µd (or β1 = 0)

t = −0.05−0.02q0.028

40 +0.02840

= −1.87

p-value = 0.06

29 / 62

HDL: Pairwise comparisons II

Control and Diet + exercise groups

H0 : µc = µde (or β2 = 0)

t = −0.05−0.14q0.028

40 +0.02839

= 5.05

p-value = 4.4× 10−7

30 / 62

HDL: Pairwise comparisons III

Diet and Diet + exercise groups

H0 : µd = µde (or β1 − β2 = 0)

t = −0.02−0.14q0.028

40 +0.02839

= −3.19

p-value = 0.0014

31 / 62

Bonferroni corrected p-values

Hypothesis p-value adjusted p-valueH0 : µc = µd 0.06 0.18H0 : µc = µde 4.4× 10−7 1.3× 10−6

H0 : µd = µde 0.0014 0.0042Overall α = 0.05

Conclusion: Significant difference in HDL change for DE groupcompared to other groups

32 / 62

Two-way ANOVA

Uses the same idea as one-way ANOVA by partitioningvariability

Allows us to look at interaction of factors

Does the effect due to one factor change as the level ofanother factor changes?

33 / 62

Example: Public health students’ medical expenditures

Study design: In an observation study, total medicalexpenditures and various demographic characteristics wererecorded for 200 public health students

Goal: determine how gender and smoking status affect totalmedical expenditures in this population

34 / 62

Example: Set-up

Y = Total medical expenditures

F = Indicator of Female= 1 if Gender=Female, 0 otherwise

S = Indicator of Smoking= 1 if smoked 100 cigarettes or more, 0 otherwise

35 / 62

Interaction model

We assume the model

Y ∼ N(µ,σ2)

whereµ = E (Y ) = β0 + β1F + β2S + β3F · S

What are the interpretations of β0,β1,β2, and β3

36 / 62

Two-way ANOVA: Interactions

Mean Model

µ = E (Y ) = β0 + β1F + β2S + β3F · S

SmokerNo Yes

GenderMale β0 β0 + β2

Female β0 + β1 β0 + β1 + β2 + β3

37 / 62

Mean Model

E (Expenditure|Male, non-smoker) = β0 + β1 · 0 + β2 · 0 + β3 · 0

= β0

E (Expenditure|Female, non-smoker) = β0 + β1 · 1 + β2 · 0 + β3 · 0

= β0 + β1

E (Expenditure|Male, Smoker) = β0 + β1 · 0 + β2 · 1 + β3 · 0

= β0 + β2

E (Expenditure|Female, Smoker) = β0 + β1 · 1 + β2 · 1 + β3 · 1

= β0 + β1 + β2 + β3

38 / 62

Medical Expenditures: ANOVA table

Source of Sum of MeanVariation Square df Square F p-valueModel

(between groups) 1.7× 109 3 5.6× 108 28.11 < 0.001Error

(within groups) 3.9× 109 196 2.0× 107

Total 5.6× 109 199

39 / 62

Medical Expenditures: Results

Overall model F-test:

H0 : β1 = β2 = β3 = 0

Ha : At least one group is different

Test statistic:Fobs = 28.11df1 = k − 1 = 3df2 = N − k = 196p-value < 0.001

40 / 62

Medical Expenditures: Overall Conclusions

The medical expenditures are different in at least one of thegroups

Now we can figure out which ones. . .

41 / 62

Medical Expenditures: Two-way ANOVA I

Table of coefficient estimates

Coefficient Estimate Standard Errorβ0 (baseline) 5049 597

β1 (female effect) 1784 765β2 (smoker effect) 907 1062β3 (female*smoke) 6239 1422

42 / 62

Medical Expenditures: Two-way ANOVA II

Test statistics and confidence intervals

Coefficient t P> |t| 95% Confidence intervalβ0 8.45 0.000 (3870, 6228)β1 2.33 0.21 (276, 3292)β2 0.85 0.394 (-1187, 3001)β3 4.39 0.000 (3434, 9043)

43 / 62

Medical Expenditures: Group-wise Conclusions

In this population, an average male non-smoker spends about$5000 on medical costs per year

Males who smoked were estimated as having spent about$900 more than non-smokers, but this difference was notfound to be statistically significant

Female non-smokers spent about $1700 more than therenon-smoking male counterparts

Female smokers spent about $8900 (= β1 + β2 + β3) morethan non-smoking males

44 / 62

Association and Correlation I

Association

Express the relationship between two variables

Can be measured in different ways, depending on the natureof the variables

For now, we’ll focus on continuous variables (e.g. height,weight)

Important note: association does not imply causation

45 / 62

Association and Correlation II

Describing the relationship between two continuous variables

Correlation analysis

Measures strength of relationship between two variablesSpecifies direction of relationship

Regression analysis

Concerns prediction or estimation of outcome variable, basedon value of another variable (or variables)

46 / 62

Correlation analysis

Plot the data (or have a computer to do so)

Visually inspect the relationship between two continousvariables

Is there a linear relationship (correlation)?

Are there outliers?

Are the distributions skewed?

47 / 62

Correlation Coefficient I

Measures the strength and direction of the linear relationshipbetween to variables X and Y

Population correlation coefficient,

ρ =cov(X ,Y )√

var(X ) · var(Y )=

E [(X − µX )(Y − µY )]√E [(X − µX )2] · E[(Y − µY )2]

48 / 62

Correlation Coefficient II

The correlation coefficient, ρ, takes values between -1 and +1

-1: Perfect negative linear relationship

0: No linear relationship

+1: Perfect positive relationship

49 / 62

Correlation Coefficient III

Sample correlation coefficient:

Obtained by plugging sample estimates into the populationcorrelation coefficient

r =sample cov(X ,Y )√

s2x · s2

Y

=

∑ni=1

(Xi−X )(Yi−Y )n−1√∑n

i=1(Xi−X )2

n−1 · ∑ni=1

(Yi−Y )2

n−1

50 / 62

Correlation Coefficient IV

Plot standardized Y versusstandardized X

Observe an ellipse(elongated circle)

Correlation is the slope ofthe major axis

51 / 62

Correlation Notes

Other names for rPearson correlation coefficientProduct moment of correlation

Characteristics of rMeasures *linear* associationThe value of r is independent of units used to measure thevariablesThe value of r is sensitive to outliersr2 tells us what proportion of variation in Y is explained bylinear relationship with X

52 / 62

Several levels of correlation

53 / 62

Examples of the Correlation Coefficient I

Perfect positive correlation, r ≈ 1

● ●●

●●

●●

●●

● ●

● ●●

●●

● ●●

● ●●

● ●●

● ●

●●

●●

● ●●

54 / 62

Examples of the Correlation Coefficient II

Perfect negative correlation, r ≈ -1

● ●●

●●

●●

● ●●

●●

● ●

●●

●●

●● ●

●●

●● ●

●●

● ●

●●

●●

55 / 62

Examples of the Correlation Coefficient III

Imperfect positive correlation, 0< r <1

●●

●●

● ●

●●

● ●

●●

56 / 62

Examples of the Correlation Coefficient IV

Imperfect negative correlation, -1<r <0

● ●

●●

● ●

57 / 62

Examples of the Correlation Coefficient V

No relation, r ≈ 0

●●

●●

●●

●●

●●

● ●

●●

58 / 62

Examples of the Correlation Coefficient VI

Some relation but little *linear* relationship, r ≈ 0

●●

●●

● ●

● ●

● ●●

●●

59 / 62

Association and Causality

In general, association between two variables means theresome form of relationship between them

The relationship is not necessarily causalAssociation does not imply causation, no matter how much wewould like it to

Example: Hot days, ice cream, drowning

60 / 62

Sir Bradford Hill’s Criteria for Causality I

Strength: magnitude of association

Consistency of association: repeated observation of theassociation in different situations

Specificity: uniqueness of the association

Temporality: cause precedes effect

61 / 62

Sir Bradford Hill’s Criteria for Causality II

Biologic gradient: dose-response relationship

Biologic plausibility: known mechanisms

Coherence: makes sense based on other known facts

Experimental evidence: from designed (randomized)experiments

Analogy: with other known associations

62 / 62

top related