basic biostatistics - day 2 1 phd course in basic biostatistics – day 2 erik parner, department of...

49
Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride) Logarithms and exponentials Two independent samples from normal distributions The model, check of the model, estimation Comparing the two means Approximate confidence interval and test Exact confidence interval and test using the t-distribution Comparing two populations using a non-parametric test The Wilcoxon-Mann-Whitney test Type 1 and type 2 errors Statistical power Simple sample size calculations

Upload: stephanie-farmer

Post on 03-Jan-2016

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 1

PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University©

Exercise 1.2+1.4 (Triglyceride)Logarithms and exponentials

Two independent samples from normal distributionsThe model, check of the model, estimationComparing the two meansApproximate confidence interval and testExact confidence interval and test using the t-distribution

Comparing two populations using a non-parametric testThe Wilcoxon-Mann-Whitney test

Type 1 and type 2 errors

Statistical power

Simple sample size calculations

Page 2: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 2

Overview

Data to analyse Type of analysis Unpaired/Paired Type DayContinuous One sample mean Irrelevant Parametric Day 1

Nonparametric Day 3Two sample mean Non-paired Parametric Day 2

Nonparametric Day 2Paired Parametric Day 3

Nonparametric Day 3Regression Non-paired Parametric Day 5Several means Non-paired Parametric Day 6

Nonparametric Day 6Binary One sample mean Irrelevant Parametric Day 4

Two sample mean Non-paired Parametric Day 4Paired Parametric Day 4

Regression Non-paired Parametric Day 7Time to event One sample: Cumulative risk Irrelevant Nonparametric Day 8

Regression: Rate/hazard ratio Non-paired Semi-parametric Day 8

Page 3: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 3

Exercise 1.2+1.4 (Triglyceride)

Assuming triglyceride measurements follows a normal distribution gave invalid results: e.g. the PI did not have 2.5% below and above the two limits.The triglyceride may however be analyzed using a normalmodel on the log-transformed data. We then need to transform the results back to theoriginal scale to obtain useful results on the triglyceridemeasurements.The method presented on the next overheads rely on the fact that percentiles are preserved when creatinga transformation of the data.

Page 4: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 4

0

.2

.4

.6

.8

1

Den

sity

-2 -1.5 -1 -.5 0 .5

ln trigly

0

.5

1

1.5

2

2.5

Den

sity

0 .5 1 1.5

trigly

Exercise 1.2+1.4 (Triglyceride)

PI(-1.54;-0.01)

PI(0.21;0.99)

exp

CI mean-0.77(-0.81;-0.74)

CI median0.46 (0.44;0.48)

Page 5: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 5

Medians and percentiles are preserved when making a transformation of the data:

exp exp log logX A X A X A

16 % to the right

50% to the right

explog

Logarithmic and exponential transformations

Page 6: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 6

Logarithmic and exponential transformations

The basic properties of the logarithms and exponentials that we will use throughout the course:

log log log log log log

exp exp exp exp exp exp

a b a b a b a b

a b a b a b a b

Product Sum

log

exp

log log exp exp expb aba b a a b a b

Page 7: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 7

Logarithms and the normal distribution

Assume Y is the measurement and that log(Y)=X follows a normal distribution with mean=median= , and standard deviation=, then Y = exp(X) has:

2

2

2

( ) exp

( ) exp 0.5

( ) exp 1

( ) exp 1

median Y

mean Y

sd Y mean

sdcv Y

mean

Page 8: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 8

Logarithm and the normal distribution

If X has a normal distribution with mean=median= , and standard deviation= ,then

• a valid 95% CI for will transform intoa valid 95% CI for the median of Y = exp(X)

• a valid 95% PI for X will transform intoa valid 95% PI for Y = exp(X)

Page 9: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 9

Body temperature versus gender

Scientific question: Do the two gender have different normal body temperature?

Design: 130 participants were randomly sampled, 65 males and 65 females

Data: Measured temperature, gender

Summary of the data (the units are degrees Celsius):

-------------------------------------------------------------- Gender | N(tempC) mean(tempC) sd(tempC) med(tempC)----------+--------------------------------------------------- Male | 65 36.72615 .3882158 36.7 Female | 65 36.88923 .4127359 36.9--------------------------------------------------------------

Page 10: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 10

Body temperature: Plotting the data

The data looks “fine” - a few outliers among females?

35

.53

63

6.5

37

37

.53

8Te

mp

era

ture

(C

)

Male FemaleGender

35

.53

63

6.5

37

37

.53

8Te

mp

era

ture

(C

)

Male Female

Figure 2.1

Page 11: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 11

Body temperature: Checking the normality in each group

0.5

10

.51

35 36 37 38

Male

Female

De

nsi

ty

Graphs by Gender

35.

53

63

6.5

37

37.

53

8

36 36.5 37 37.5Inverse Normal

Male

35.

53

63

6.5

37

37.

53

8

36 36.5 37 37.5 38Inverse Normal

Female

Normality looks ok!

Figure 2.2

Page 12: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 12

Body temperature: The model

A statistical model:

Two independent samples from normal distributions, i.e.

• the two samples are independent

and each are assumed to be a random sample from a normal distribution:

1. The observations are independent (knowing one observation will not alter the distribution of the others)

2. The observations come from the same distribution, e.g. they all have the same mean and variance.

3. This distribution is a normal distribution with unknown mean, i, and standard

deviationiN(ii2)

Page 13: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 13

Body temperature: Checking the assumptions

The first two – think about how data was collected!

1. Independence between groups –information on different individualsIndependence within groups: Data are from different individuals, so the assumption is probably ok.

2. In each group: The observations come from the same distribution. Here we can only speculate. Does the body temperature depend on known factors of interest, for example heart rate, time of day, etc.?

Page 14: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 14

Body temperature: The estimates

The estimates are found like we did day 1:

ˆ ˆ ˆ36.73 36.63;36.82 , 0.388, sem 0.048

ˆ ˆ ˆ36.89 36.79;36.99 , 0.413, sem 0.051

M M M

F F F

Observe that the width of the prediction interval is approximately

2 * 1.96 * 0.4 C = 1.6 C,

so there is a large variation in body temperature between individuals within each of the two groups

We see that the average body temperature is higher among women

Page 15: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 15

Body temperature: Estimating the difference

Remember focus is on the difference between the two groups, meaning, we are interested in :

F M The unknown difference in mean body temperature. This is of course estimated by:

ˆ ˆ ˆ 36.89 36.73 0.16F M

What about the precision of this estimate?What is the standard error of a difference?

Page 16: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 16

The standard error of a difference

2 2ˆ ˆ ˆ ˆ ˆse se se seF M F M

If we have two independent estimates and, like here, calculate the differences, then the standard error of the difference is given as

2 2ˆse 0.048 0.051 0.070

We note that standard error of a difference between two independent estimates is larger than both of the two standard errors.

In the body temperature data we get:

ˆ ˆ1.96 se 0.163 1.96 0.070 0.025;0.301

and an approx. 95% CI

Page 17: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 17

Testing no difference in means

Here we are especially interested in the hypothesis that body temperature is the same for the two gender:

Hypothesis: 0

We can make an approx. test similar to day 1

ˆ: 0.025;0.301 se0 0.07. 3 016

and find the p-value as

0 0.163 0

2.320.070

ˆ ˆ

ˆ ˆ0

obs se sez

2 Pr standard normal obsz

We get p=2.03%

Page 18: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 18

Exact inference for two independent normal samples

Just like in the one sample setting, it is possible to make exact inference – based on the t-distribution.

And again these are easily made by a computer.

Remember the model: Two independent samples from normal distributions with means and standard deviations,

, ,M M F F and

Note, both the means and the standard deviations might be different in the two populations.

If one wants to make exact inference, then one has to make the additional assumption:

4. The standard deviations are the same: M F

Page 19: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 19

Exact inference for two independent normal samples

Testing the hypothesis : M F

This is done by considering the ratio between the two estimated standard deviations:

2Largest observed standard deviation

Smallest observed standard deviationobsF

A large value of this F-ratio is critical for the hypothesisThe p-value = the probability of observing a F-ratio at

least as large as we have observed - given the hypothesis is true!The p-value is here found by using an F-distribution with (nlargest-1) and (nsmallest-1) degrees of freedom:

2 Pr 1; 1largest smallest obsp value F n n F

Page 20: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 20

Exact inference for two independent normal samples

Testing the hypothesis : M F

Here we have:2

20.4131.063 1.13

0.388obsF The observed variance (sd2) is 13% higher among women.

But could this be explained by sampling variation – what is the p-value?

To find the p-value we consult an F-distribution with 64=(65-1) and 64=(65-1) degrees of freedom.

We get p-value = 63%

The difference in the observed standard deviation can be explained by sampling variation.

We accept that M F ! The fourth assumption is ok!

ˆ65 0.413

ˆ65 0.388F F

M M

n

n

so

Page 21: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 21

Exact inference for two independent normal samples

We now have a common standard deviation : F

M

This is estimated as a “weighted” average

Based on this we can calculate a revised/updated standard error of the difference:

2 2

2 2

ˆ ˆ1 1ˆ

1 1

0.413 65 1 0.388 65 10.401

65 1 65 1

F F M M

F M

n n

n n

1 1 1 1ˆ ˆse 0.401 0.07065 65F Mn n

This is not found in the Stata output

Page 22: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 22

Exact inference for two independent normal samples

Exact confidence intervals and p-values are found by using a t-distribution with nM + nF 2 = 65 + 652 = 128 d.f.

ˆ ˆ: se 0.070.1 063

0.975ˆ ˆse 0.163 1.96 0.07 0.024;0.0 302t

0 0.163

: 2.320.0

ˆ0

ˆ 70obs sH

et

and find the p-value as 2 Pr obst t-distribution

We get p2.2% (either from table of standard normaldistribution, or from Stata)

And the exact test:

Page 23: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 23

Stata: two-sample normal analysis

. cd "D:\Teaching\BasalBiostat\Lectures\Day2"

D:\Teaching\BasalBiostat\Lectures\Day2

. use normtemp.dta, clear

. * Checking the normality.

. qnorm tempC if sex==1, title("Male") name(plot2, replace)

. qnorm tempC if sex==2, title("Female") name(plot3, replace)

. graph combine plot2 plot3, name(plotright, replace) col(1)

The F-test and t-test are easily done in Stata (more details can be found in the file day2.do).

Page 24: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 24

. sdtest tempC, by(sex)

Variance ratio test

---------------------------------------------------------------

Group | Obs Mean Std.Err. Std.Dev. [95% Conf.Interval]

--------+------------------------------------------------------

Male | 65 36.72615 .0481522 .3882158 36.62996 36.82235

Female | 65 36.88923 .0511936 .4127359 36.78696 36.9915

--------+------------------------------------------------------combined 130 36.80769 .0357326 .4074148 36.73699 36.87839

--------------------------------------------------------------- ratio = sd(Male) / sd(Female) f = 0.8847

Ho: ratio = 1 degrees of freedom = 64, 64

Ha: ratio < 1 Ha: ratio != 1 Ha: ratio > 1

Pr(F < f) = 0.3128 2*Pr(F < f)= 0.6256 Pr(F > f)= 0.6872

Page 25: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 25

. ttest tempC, by(sex)

Two-sample t test with equal variances

---------------------------------------------------------------

Group | Obs Mean Std.Err. Std.Dev. [95%Conf.Interval]

-------+-------------------------------------------------------

Male | 65 36.72615 .0481522 .3882158 36.62996 36.82235

Female | 65 36.88923 .0511936 .4127359 36.78696 36.9915

-------+-------------------------------------------------------

combined 130 36.80769 .0357326 .4074148 36.73699 36.87839

-------+-------------------------------------------------------

diff | -.1630766 .070281 -.3021396 -.0240136

---------------------------------------------------------------

diff = mean(Male) - mean(Female) t = -2.3204

Ho: diff = 0 degrees of freedom = 128

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 0.0110 Pr(|T| > |t|)= 0.0219 Pr(T > t)= 0.9890

Page 26: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 26

Exact inference for two independent normal samples

What if you reject the hypothesis of the same sd in the two groups?

1. This indicates that the variation in the two groups differ! Think about why!!!

2. Often it is due to the fact that the assumption of normality is not satisfied. Maybe you would do better by making the statistical analysis on another scale, e.g. log.

3. If you still want to compare the means on the original scale you can make approximate inference based on the t-distribution (e.g. ttest tempC, by(sex) unequal )

4. If you only want to test the hypothesis that the two distributions are located the same place, then can you use the non-parametric Wilcoxon-Mann-Whitney test – see later.

Page 27: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 27

Body temperature example - formulations

Methods: Data was analyzed as two independent samples from normal distributions based on the Students t. The assumption of normality was checked by a Q-Q plot. Estimates are given with 95% confidence intervals.

Results:The mean body temperature was 36.9(36.8;37.0)C among women compared to 36.7(36.6;36.8)C among men. The mean was 0.16(0.02;0.30)C, higher for females and this was statistically significant (p=2.3%).

Conclusion:Based on this study we conclude that women have a small, but statistically significantly higher mean body temperature than men.

Page 28: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 28

Example 7.2 Birth weight and heavy smoking

Scientific question: Does the smoking habits of the mother influence the birth weight of the child?

Design and data: (observational) The birth weight (kg) of children born by 14 heavy smokers and 15 non-smokers were recorded.

Summary of the data (the units is kg):

------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+---------------------------------------------------------------Non-smok | 15 3.627 .0925 .3584 3.428 3.825Heavy sm | 14 3.174 .1238 .4631 2.907 3.442

Already here we observe, that the average birth weight is smallest among heavy-smokers: difference=452 g

Page 29: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 29

2.5

3

3.5

4

4.5

Birt

h w

eig

ht

Non-smoker Heavy smoker

Smoking habits

2.5

3

3.5

4

4.5

Birt

h w

eig

ht

Non-smoker Heavy smoker

Example 7.2 Birth weight and heavy smoking

Plot the data !!!!!!

Page 30: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 30

Example 7.2 Birth weight and heavy smoking

0

.5

1

1.5

0

.5

1

1.5

2 3 4 5

Non-smoker

Heavy smoker

De

nsi

ty

Graphs by Smoking habits

2.5

3

3.5

4

4.5

3 3.5 4 4.5

Inverse Normal

Non-smokers

2.5

3

3.5

4

4.5

2.5 3 3.5 4

Inverse Normal

Heavy smokers

Independence, same distribution and normality seems ok.

Page 31: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 31

Example 7.2 Birth weight and heavy smokingexact inference

Compare the standard deviations (using the computer): 2

(13,14)0.4631

1.64 35%0.3584

f romobs FF p

Conclusion of the test:If there was no difference between the two groups, then it would be almost impossible to observe such a large difference as we have seen – hence the hypothesis cannot be true!

We accept that the two standard deviations are identical.

and again by computer we get:

Difference in mean birth weight: 0.452(0.138;0.767) kg

Hypothesis: no difference in mean birth weight. p=0.06%

Page 32: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 32

The birth weight example - formulationsMethods - like the body temperature example: Data ……intervals.

Results:The mean birth weight was 3.627(3.428;3.825) kg among non-smokers compared to 3.174(2.907;3.442) kg among heavy smokers. The difference 452(138;767)g was statistically significant (p=0.06%).

Conclusion:Children born by heavy-smokers have a birth weight, that is statistically significantly smaller, than that of children born by non-smokers. The study has only limited information on the precise size of the association.

Furthermore we have not studied the implications of the difference in birth weight or whether the difference could be explained by other factors, like eating habits……

Page 33: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 33

Non-Parametric test: Wilcoxon-Mann-Whitney test

Until now we have only made statistical inference based on a parametric model.

E.g. we have focused on estimating the difference between two groups and supplying the estimate with a confidence interval.

We have also performed a statistical test of no difference based on the estimate and the standard error – a parametric test.

There are other types of tests – non-parametric tests – that are not based on a parametric model.

These test are also based on models, but they are not parametric models.

We will here look at the Wilcoxon-Mann-Whitney test, which is the non-parametric analogy to the two sample t-test.

Page 34: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 34

Non-Parametric test: Wilcoxon-Mann-Whitney test

The key feature of all non-parametric tests is, that they are based on the ranks of the data and not the actual values.

Birth weight Rank

Birth weight Rank

2.340 1 2.710 32.380 2 3.310 102.740 4 3.360 112.860 5 3.410 122.900 6 3.510 143.180 7 3.540 163.230 8 3.600 17.53.270 9 3.610 193.420 13 3.700 233.530 15 3.730 243.600 17.5 3.830 253.650 20.5 3.890 263.650 20.5 3.990 273.690 22 4.080 28

4.130 29

Heavy smokers Non-smokers

Smallest

Number 17 and 18

Page 35: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 35

Non-Parametric test: Wilcoxon-Mann-Whitney test

We can now add the rank in one of the groups, here the heavy smokers:

Heavy-smokers observed rank sum=150.5

Hypothesis: The birth weights among heavy-smokers and non-smokers is the same.

Assuming the hypothesis is true one can calculate the expected rank sum among the heavy-smokers and standard error of the observed rank sum and calculate a test statistics:

se

2.5210150.5

22.9197

obsz

Observed ranksumO

Expected ranksumbserved ranksum

P-value = 0.9%

The p-value is found as 2 Pr standard normal obsz

Page 36: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 36

Non-Parametric test: Wilcoxon-Mann-Whitney test

We saw that the ranksum among heavy smokers was smaller than expected if there was no true difference between the two groups.

So small that we only observe such a discrepancy in one out of 100 (p-val=0.9%) studies like this.

We reject the hypothesis!

ConclusionChildren born by heavy-smokers have a statistically significant lower birth weight than children born by non-smokers.

Remember this depends on, the sample size, the design, the statistical analysis...

Page 37: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 37

Non-Parametric test: Wilcoxon-Mann-Whitney test

Some comments:

• There are two assumptions behind the test:

1. Independence between and within the groups.

2. Within each group: The observations come from the same distribution, e.g. they all have the same mean and variance.

• The test is designed to detect a shift in location in the two populations and not, for example, a difference in the variation in the two populations.

• You will only get a p-value – the possible difference in location will is not quantified by an estimate with a confidence interval.

• As a test it is just as valid as the t-test!

Page 38: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 38

Stata: Wilcoxon-Mann-Whitney test

. use bwsmoking.dta,clear

(Birth weight (kg) of 29 babies born to 14 heavy smokers and 15 non-smokers)

. ranksum bw, by(group)

Two-sample Wilcoxon rank-sum (Mann-Whitney) test

group | obs rank sum expected-------------+--------------------------------- Non-smoker | 15 284.5 225Heavy smoker | 14 150.5 210-------------+--------------------------------- combined | 29 435 435

unadjusted variance 525.00adjustment for ties -0.26 ----------adjusted variance 524.74

Ho: bw(group==Non-smoker) = bw(group==Heavy smoker) z = 2.597 Prob > |z| = 0.0094

Page 39: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 39

Type 1 and type 2 errors

We will here return to the simple interpretation of a statistical test:

We test a hypothesis:

We will make a

Type 1 error if we reject the hypothesis, if it is true.

Type 2 error if we accept the hypothesis, if it is false.

If we use a specific significance level, , (typically 5%) then we know:

0

0 0

Pr

Pr

reject given it is true

reject given

The risk of a Type 1 error =

Page 40: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 40

Type 1 and type 2 errors

What about the risk of Type 2 error:

0

0 0

Pr

Pr ?

accept given it is not true

accept given

This will depend on several things:

1. The statistical model and test we will be using

2. What is the true value of ?

3. The precision of the estimate. What is the sample size and standard deviation?

That is, the risk of Type 2 error, , is not constant.

Often we consider the statistical power: 0 0Pr 1reject given

Page 41: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 41

Statistical power – planning a study - testing for no difference

Suppose we are planning a new study of fish oil and its possible effect on diastolic blood pressure (DBP).

Assume we want to make a randomized trial with two groups of equal size and we will test the hypothesis of no difference. We believe that the true difference between groups in DBP is 5mmHg.

Furthermore we believe that the standard deviation in the increase in DBP is 9mmHg.

We plan to include 40 women in each group and analyze using a t-test.

What is the chance, that this study will lead to a statistically significant difference between the two groups, given the true difference is 5mmHg?

Page 42: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 42

10

20

30

40

50

60

70

80

90

100

Pow

er in

%

0 20 40 60 80 100 Observations in each group

sd=10sd=9sd=8

sd=7

True difference = 5 - Test for no difference

Statistical power, when the true difference is 5 and sd= 7,8,9 or 10 and we test the hypothesis of no difference.

n=40 power=69%

Page 43: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 43

Statistical power – planning a study

We plan to include 40 women in each group and analyze using a t-test and the true difference is 5mmHg and sd=9mmHg

Power = 69%

That is, there is only 69% chance, that such a study will lead to a statistical significant result - given the assumptions are true.

How may women should we include in each group if we want to have a power of 90%?

Based on the plot we see that more than aprox. 69 women in each group will lead to a power of 90%.

Page 44: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 44

10

20

30

40

50

60

70

80

90

100

Pow

er in

%

0 20 40 60 80 100 Observations in each group

sd=10sd=9sd=8

sd=7

True difference = 5 - Test for no differencepower=90% n=69

Statistical power, when the true difference is 5 and sd= 7,8,9 or 10 and we test the hypothesis of no difference.

Page 45: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 45

10

20

30

40

50

60

70

80

90

100

Pow

er in

%

0 20 40 60 80 100 Observations in each group

sd=10sd=9sd=8

sd=7

True difference = 10 - Test for no difference

The power increases as a function of the expected difference between the groups and decreases as a function of the variation, standard deviation, within the groups

Page 46: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 46

Power two unpaired normal samples

In general we have the five quantities in play:

1 2-

n

The true diff erence between groups

The standard deviation each group

The signifi cance level (typically 5%)

The risk of type 2 error = 1-the power

The sample size in each

wit

hi

p

n

grou

If we know four of these, then we can determine the last.

Typically, we know the first four and want to know the sample size.

or we know and nand then we want to know the power.

Page 47: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 47

Stata: power for two unpaired normal samples

. sampsi 0 5, sd1(9) sd2(9) alpha(0.05) power(0.90)

Estimated sample size for two-sample comparison of meansTest Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2

Assumptions: alpha = 0.0500 (two-sided) power = 0.9000 m1 = 0 m2 = 5 sd1 = 9 sd2 = 9 n2/n1 = 1.00

Estimated required sample sizes:

n1 = 69

n2 = 69

* In Stata 13* power twomeans 0 5 , sd(9) alpha(0.05) power(0.90)

Power calculations are done using the sampsi command:

Page 48: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 48

If the sample size is not too small then it can be found by hand by using the formula :

2

2 ,n f

50% 20% 10% 5%

50% 80% 90% 95%

5%, 3.8 7.8 10.5 13.0

Risk of type 2 error

Statistical Power

f

2

2

5%, 5, 9 10%

92 5%,10% 2 1.8 10.5 68

5n f

I f we assume and

then

By hand: power for two unpaired normal samples

Page 49: Basic Biostatistics - Day 2 1 PhD course in Basic Biostatistics – Day 2 Erik Parner, Department of Biostatistics, Aarhus University© Exercise 1.2+1.4 (Triglyceride)

Basic Biostatistics - Day 2 49

Comments on sample size calculations• Most often done by computer (in Stata sampsi)

• There are many different formulas see Kirkwood & Stern Table 35.1. We will only look at a few in this course.

• It is in general more relevant to test that the difference is larger than a specified value.A so-called Superiority or Non-inferiority study.

• Or to plan the study so that your study is expected to yield a confidence interval with a certain width.

• You need to know the true difference and you must have an idea of the variation within the groups. The latter you might find based on hospital records or in the literature.

• Sample size calculations after the study has been carried out (post –hoc) is nonsense!!The confidence interval will show how much information you have in the study.