statistics workshop - 2 contingency tables correlation analysis of variance

34
STATISTICS WORKSHOP - 2 Contingency tables Correlation Analysis of variance

Upload: arron-griffith

Post on 25-Dec-2015

227 views

Category:

Documents


0 download

TRANSCRIPT

STATISTICS WORKSHOP - 2

Contingency tablesCorrelation

Analysis of variance

Why relations between variables are important

• The ultimate goal of every research or scientific analysis is finding relations between variables.

• The philosophy of science teaches us that there is no other way of representing “meaning” except in terms of relations between some quantities or qualities; either way involves relations between variables.

• The advancement of science must always involve finding new relations between variables.

Qualitative Data (Contingency Table)

Factor 1 Factor 2 Factor 3

Factor A

Factor B

Variable1

Variable 2

n11 n12 n13

n21 n22 n23

Example: This test would be the one to use if we have, say, different classes of patients (e.g., six types of cancers) and for a set of 1000 markers we can have either presence/absence of each marker in each patient (this would yield 1000 contingency tables of dimensions 6x2 ---each marker by each cancer type)

Contingency TableQuestion

Is there evidence in the data for association between the categorical variables?

For cross-classified data, the Pearson chi-square test for independence and Fisher's exact test can be used to test the null hypothesis that the row and column classification variables of the data's two-way contingency table are independent.

Chi-Square test

Odds Ratio (OR) = (ad)/(bc)

Relative risk = a(c+d)/c(a+b)

2

2

Contingency Table

Chi-Square Test

Example

3500 were observed whether they snore or not

Is there an associationbetween snoring and gender ?

Contingency Table

SEX * SNORING Crosstabulation

Count

1029 698 1727

855 918 1773

1884 1616 3500

MALE

FEMALE

SEX

Total

YES NO

SNORING

Total

Example - Is there an association between snoring and gender?

Contingency Table

Chi-Square Tests

45.424b 1 .000

44.968 1 .000

45.532 1 .000

.000 .000

3500

Pearson Chi-Square

Continuity Correctiona

Likelihood Ratio

Fisher's Exact Test

N of Valid Cases

Value df

Asymp.Sig.

(2-sided)Exact Sig.(2-sided)

Exact Sig.(1-sided)

Computed only for a 2x2 tablea.

0 cells (.0%) have expected count less than 5. The minimum expected count is 797.38.b.

Odds ratio = 1.5895% CI = 1.39 to 1.81

Contingency Table

SEX * SMOKER Crosstabulation

Count

14 20 16 50

6 10 24 40

20 30 40 90

BOYS

GIRLS

SEX

Total

HEAVY LIGHT NON

SMOKER

Total

Is there evidence of differences in smoking pattern between the sexes?

Contingency Table

Chi-Square Tests

7.110a 2 .029

7.187 2 .028

90

Pearson Chi-Square

Likelihood Ratio

N of Valid Cases

Value df

Asymp.Sig.

(2-sided)

0 cells (.0%) have expected count less than 5. Theminimum expected count is 8.89.

a.

Measuring treatment differences with Y/N response

• For outcomes such as reduction in blood pressure there are obvious summaries of treatment effect such as the difference between the average of each group

• For yes/no outcomes like death or cure the choice of summary is not so obvious

Dead Y N

aspirin 804 7783

placebo 1016 7584

TOTAL 1820 15367

9.4%

11.8%

Relative Risk or Risk Ratio

• Relative risk or risk ratio: risk of death in aspirin group divided by risk in placebo group:

Relative Risk = 9.4% / 11.8% = 0.80

“mortality is reduced by 20%”

• Relative risk estimates are likely to generalise well from one population to another

Absolute Risk Difference

• Absolute risk difference is the proportion of deaths in the aspirin group minus the proportion in the placebo group

risk difference = 9.7% - 11.8% = -2.1%

"2.1 lives saved for each 100 patients treated"

• Risk difference has a more direct clinical interpretation, especially when considering cost- effectiveness

Odds Ratio

• Odds ratio: the odds of death in the aspirin group divided by the odds in the placebo group

Odds Ratio = (9.7/90.3) / (11.8/88.2) = 0.77

"reduction of 23% in the odds of death”

• The odds ratio has some purely mathematical advantages. It is not much used in randomised studies

Berkson’s Fallacy

• It is a treatment-seeking bias so called because Berkson indicated that individuals with more than one disorder are more likely to seek clinical services than are those with only one disorder. 

• This leads to an erroneously higher estimate of the prevalence of the association between these disorders than would be the case if each single disorder independently led the patient to seek care.

Berkson’s Fallacy• 2784 individuals were surveyed to determine whether each subject

suffered from either a disease A or disease B or both. It is found that 257 out of the 2784 patients were hospitalised for the condition.

Disease A

Disease B

TotalYes No

Yes 7 29 36

No 13 208 221

Total 20 237 257

Disease A

Disease B

TotalYes No

Yes 22 171 193

No 202 2389 2591

Total 224 2560 2784P < 0.025There is some association between having disease A and having disease B

P > 0.1There is no association between having disease A and having disease B

Gene Association Studies Typically Wrong

• Evolution of the strength of an association as more information is accumulated. The strength of the association is shown as an estimate of the odds ratio (OR) without confidence intervals.

• a, Eight topics in which the results of the first study or studies differed beyond chance (P<0.05) when compared with the results of the subsequent studies.

• b, Eight topics in which the first study or studies did not claim formal statistical significance for the genetic association but formal significance was reached by the end of the meta-analysis.

• Each trajectory starts at the OR of the first study or studies. Updated cumulative OR estimates are obtained at the end of each subsequent year, summarizing all information to that time.

(Adapted from J.P.Ioannidis et al., Nature Genetics 29:306-9, 2001)

Studies of disease association

• Given the number of potentially identifiable genetic markers and the multitude of clinical outcomes to which these may be linked, the testing and validation of statistical hypotheses in genetic epidemiology is a task of unprecedented scale

Testing for equality of two proportions

Example: Two groups of genes 1. genes for transcription and translation 2. genes in the immune system

Question: Do they have similar purine-pyrimidine compositions?

The question is asking whether the percentage of purines (or pyrimidines) in group 1 is the same as the percentage of purines (or pyrimidines) in group 2.

To form the null and alternative hypotheses we can say: G1 = the percentage of purines in group 1 G2 = the percentage of purines in group 2

H0: G1 = G2 H1: G1 > G2 or G2 > G1

Correlation

• Correlation can be used to summarise the amount of linear association between two continuous variables x and y.

• Let (x1, y1), (x2, y2), ..., (xn, yn) denote the data points.

• A scatter plot gives a "cloud" of points

Y

X

YY

X

Positive correlation Negative correlation No correlation

X

Positive and Negative Association

• If the points are nearly in a straight line then knowing the value of one variable helps you to predict the value of the other.

• If there is little or no association, the "cloud" is more spread out and information about one variable doesn't tell you much about the other.

A simple correlation formula

• Suppose there are n points altogether and that n(A) is the number in region A and similarly for n(B), n(C) and n(D

• Give a value of 1/n to every point in A or C and -1/n to every point in B or D

• Define

Cor = n(A)+n(C)-n(B)-n(D)

n

• What are the properties of cor? X

D A

C B

Y

The Pearson product moment correlation coefficient

• The formula for cor works, but it is rather crude. For example both the diagrams below would give cor = 1.

and are positive or negative in the different regions and so is the product

* Sum will not lie between -1 and 1. It depends on:•The scale of x and y•The number of points

Correlation formula

yxxy ss

yxr

),cov(

n

yyxxyx

i

n

ii ))((

),cov( 1

yx

i

n

ii

xy sns

yyxxr

))((1

22 )()(

)()(

yyxx

yyxx

Partial correlation: Correlation between 2 variables that controls for the effects of one or more other variables.

Rank Correlation:

where

Pearson Correlation Coefficient

• A measure of linear association between two variables, denoted as r.

• Values of the correlation coefficient range from -1 to 1.

• The sign of the coefficient indicates the direction of the relationship, and its absolute value indicates the strength, with larger absolute values indicating stronger relationships.

Interpretation of correlation• r measures the extent of linear association between two

continuous variables.

• Association does not imply causation - both variables may be affected by a third variable.

• If r = 0, there is no association between X and Y

• r does not indicate the extent of non linear associations

• The value of r can be affected by outliers

Correlations Do Not Establish Causality

Example: When a gene is isolated that has some positive correlation to cancer, claim is often made that it enhances the susceptibility to the disease, and not cause it.

Some misconceptions• When the value of the correlation coefficient is large (small),

the relation between the two variates is close to linear, thus, when r = 0.9 or 0.95 the relation is nearly linear

• When the value of the correlation coefficient is zero or near zero the two variates have no or almost no functional relation

• When the value of the correlation coefficient is positive (negative), the value of Y becomes larger (smaller) as a whole, as the value of X becomes large

Example: Let (X,Y) take (1,-1),(2,-2),(3,-3),(4,-4),(5,20) each with probability 1/5. Then we have Cor(X,Y) = 0.62Concerning the first four points Y decreases as X increases. This example shows that even when the correlation coefficient between X and Y is positive, Y does not always increase as a whole as X increases.

Examples• Eg1. In Australia total alcohol

consumption and the number of ministers of religion have both increased over time and would be positively correlated but the increase in one has not caused the increase in the other (both are related to the total population size)

• Eg2. In Japanese schoolchildren shoe size was reported to be correlated (positively) with scores on a test of mathematical ability.

• Eg3. Extracting informative genes with negative correlation for accurate cancer classification

Effectiveness of the first Cold-War arms agreement

• "Most important, the negative correlation between the mutation rate and the parental year of birth [among those born between 1950 and 1956] provides experimental evidence for change in human germline mutation rate with declining exposure to ionizing radiation and therefore shows that the Moscow treaty banning nuclear weapon tests in the atmosphere (August 1963) has been effective in reducing genetic risk to the affected population."

Example - Heights and weights of 6 female students

• The table below shows the heights and weights of 6 female students. How closely related are the heights and the weights?

Student 1 2 3 4 5 6

Height 167 170 160 152 157 160

Weight 60 64 57 46 55 50

The correlation coefficient = 0.904

Spearman Correlation Coefficient

• Commonly used nonparametric measure of correlation between two ordinal variables. For all of the cases, the values of each of the variables are ranked from smallest to largest, and the Pearson correlation coefficient is computed on the ranks.

Rank Correlation

• 10 students, arranged in alphabetical order, were ranked according to their achievements in both the laboratory and lecture sections of a biology course. Find the coefficient of rank correlation.

L a b 8 3 9 2 7 1 0 4 6 1 5

L e c t u r e 9 5 1 0 1 8 7 3 4 2 6

Rank correlation = 0.8545

Thoughts…Patterns often emerge before the reasons for them become apparent. - Vasant Dhar

If you do not expect, you cannot find the unexpected. - Heracletes

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of. - R.A.Fisher