statistical decision-making -...

39
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval Statistical Decision-Making Hong Li May. 3, 2016 1

Upload: others

Post on 03-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Statistical Decision-Making

Hong Li

May. 3, 2016

1

Page 2: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Objectives

Explain steps in hypothesis testing

Describe null and alternative hypothesis

Define p-value (both one-sided and two-sided)

Define Type-I and Type-II errors

Define power

Understand how to interpret confidence interval (CI) inthe context of hypothesis testing

2

Page 3: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Example: Avastin in Gastric Cancer (AVAGAST)

Paper

Bevacizumab in Combination with Chemotherapy AsFirst-Line Therapy in Advanced Gastric Cancer: ARandomized Double-Blind Placebo-Controlled Phase IIIStudy (Journal of Clincial Oncology Volume 29, Number30, Oct. 20, 2011, pp3968-3976)

Purpose

Evaluate the efficacy of adding bevacizumab tocapecitabine-cisplatin in the first-line treatment ofadvanced gastric cancer

Study design

Double-blind, placebo-controlled Phase III clinical trial

3

Page 4: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Example: Avastin in Gastric Cancer (AVAGAST),

cont.

Treatment: in addition to cisplatin and capecitabine

Treatment arm: BevacizumabControl arm: placebo

Sample size

774 in total with 387 in each arm

Primary endpoint

Overall survivalAssessed every 3 months

4

Page 5: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Example: Avastin in Gastric Cancer (AVAGAST),

cont.

Overall response rate (ORR)

Overall response rate was improved significantly with theaddition of bevacizumab (46% vs. 37.4% in the placebogroup; p-value=0.0315)

5

Page 6: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Hypothesis Testing

Statistical hypothesis is an assumption about apopulation parameter

Assumption may or may not be true

Hypothesis testing refers to the formal procedures used bystatisticians to accept or reject statistical hypotheses

6

Page 7: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Hypothesis Testing,cont.

There are two types of statistical hypotheses:

Null hypothesis (H0): usually the hypothesis that sampleobservations result purely from chanceAlternative hypothesis (HA): the hypothesis that sampleobservations are influenced by some non-random cause

7

Page 8: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Hypothesis Testing, cont.

State the hypotheses. This involves stating the null andalternative hypotheses. The hypotheses are stated in sucha way that they are mutually exclusive. That is, if one istrue, the other must be false

Formulate an analysis plan. The analysis plan describeshow to use sample data to evaluate the null hypothesis.The evaluation often focuses around a single test statistic

Analyze sample data. Find the value of the test statistic(mean score, proportion, t statistic, z-score, etc.)described in the analysis plan

Reject the null or fail to reject the null, depending on thep-value’s extremeness

8

Page 9: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

What Is Population Parameter?

Statistician differentiate between that which is beingestimated and the estimate itself

That which is being estimated is called a populationparameter and we (typically) assume it has a true butunknown value

In this example, there are two population parameters

True (but unknown) ORR for the population of patientsrepresented by those in the study receivingbevacizumab-pbevTrue (but unknown) ORR for the population of patientsrepresented by those in the study receiving placebo-pplac

9

Page 10: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Hypothesis Testing: Assume a Null Condition

Statistical testing starts with a statement of the nullhypothesis

H0 : pbev = pplac

This can also be written

H0 : pbev − pplac = 0

10

Page 11: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Hypothesis Testing: State Alternative Hypothesis

Alternative hypothesis is the state of nature we will acceptif the evidence in the data suggests the null is implausible

HA : pbev > pplac

This can also be written

HA : pbev − pplac > 0

This is called a one-tailed or one-sided test

Note that we do not actually specify a value forpbev − pplac under the alternative. We simply state thedirection of the effect (> 0, < 0, or 6= 0)

11

Page 12: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Hypothesis Testing: Formulate Analysis Plan

Intutively, it makes sense that any test for a comparisonof ORRs should be constructed from their estimates,namely p̂bev and p̂plac

From the results stated in the paper

p̂bev = 46% and p̂plac = 37.4%

or equivalently

p̂bev − p̂plac = 8.6%

Question: are these estimates meaningfully different, or isthe magnitude of their difference a chance occurrence-theluck of the draw?

12

Page 13: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Hypothesis Testing: Formulate Analysis Plan

(cont.)

The test is constructed based on the data (p̂bev − p̂plac)

Goal of test: determine how unlikely it is to observe adifference in the estimated ORRs at least as large as theone we’ve observed, if in fact there really is no differencein ORRs

Allows us to make a probabilistic statement about thelikelihood of the observed difference in ORRs beingattributable to chance and not to the intervention

Requires understanding the sampling distribution ofp̂bev − p̂plac

13

Page 14: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Sampling Distribution

Sampling distribution of a statistic is the distribution ofthat statistic when derived from a random sample of sizen

Example: the test is constructed under the nullhypothesis

Sampling distirbution as a histogram of all possiblevalues of p̂bev − p̂plac you could observed if in fact thetrue ORRs, pbev and pplac are equal

14

Page 15: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Example based on ORRs

pbev = 46% and pplac = 37.4%

H0: pbev = pplac

Under H0, the best estimate of the true underlying ORR(for either arm) is a pooled estimate

Results from the paper

p̂bev = 143/311 = 0.46

p̂plac = 111/297 = 0.374

The pooled estimate

p̂pooled = (143 + 111)/(311 + 297) = 0.418

15

Page 16: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Repeated Sampling under H0

Simulate 10000 data sets with sample sizes of 311 and 297,and assume pbev = pplac = ppooled = 0.418. For each sample,compute pbev and pplac and construct their difference.

16

Page 17: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Histogram of Resulting Differences

17

Page 18: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

What Is P-value?

Question: How can we determine ”likely” or ”unlikely” bydetermining the probability-assuming the null hypothesiswere true- of observing a more extreme test statistic inthe direction of the alternative hypothesis than the oneobserved

Anwser: we estimate the probability of observing adifference in ORRs as or more extreme than the one weobserved in our study. This probability is called a p-value

The p-value is defined as the probability, under theassumption of null hypothesis, of obtaining a result equalto or more extreme than what was actually observed

18

Page 19: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

P-value for Our Example

19

Page 20: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

P-value for Our Example (cont.)

Approximate the theoretical distribution of the differences(blue line overlaing the histogram) using a normaldistribution

P-value is the red-shaded area in the plot

Example: the p-value is approximately 0.016

This probability represents the likelihood of obtaining asample proportion that is at least as extreme as oursample proportion in right tail of the distribution bychance if the null hypothesis is trueThe likelihood of observing a difference of 8.6% (or oneeven larger) by chance alone is less than 2%

20

Page 21: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Other Alternatives

Another one-tailed test would be

HA : pbev < pplac

P-value would be approximated by the relative area to theleft of the observed difference.

21

Page 22: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Other Alternatives (cont.)

22

Page 23: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

P-value in the Paper

A two-tailed or two-sided test is given by the alternativehypothesis

HA : pbev 6= pplac

P-value is approximated by the sum of the relative areasto the left of -0.086 and to the right of 0.086. Theauthors performed a two-sided test and report a p-valueof 0.0315.

23

Page 24: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

P-value in the Paper (cont.)

24

Page 25: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Definition of P-value

The probability of observing a result at least as extremeas the one you observed if the null is true

P value is not the probability of making a mistake byrejecting a true null hypothesis

25

Page 26: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Make Decision

Make a judgement about the null hypothesis based onp-value

Small p-value: observing the outcome or one moreextreme due to chance alone is highly improbable

Reject the null hypothesis in favor of the alternative (callfinding statistically significant)Common practice: p-value to be less than 0.05

P-value is not small: observing the outcome or one moreextreme due to chance alone is probable

Fail to reject the nullUnable to rule out random chance as a plausibleexplanation for any observed difference

26

Page 27: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Actions and Errors

Two types of errors one can commit in conducting astatistical test

27

Page 28: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Actions and Errors (cont.)

Type I error

The probability of rejecting the null given that the null istrue, denoted by α

Type II error

The probability of failing to reject the null given that thealternative is true, denoted by β

28

Page 29: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Power

It is the probability of correctly rejecting the null in favorof the alternative. A well-designed study will strike abalance between acceptable levels of Type I error (usually0.05) and power (often set at a minimum of 0.8).

29

Page 30: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Power (cont.)

The power of a hypothesis test is affected by three factors

Sample size

Other things being equal, the greater the sample size,the greater the power of the test

Significance level (α)

The higher the significance level, the higher the power ofthe testIncrease the significance level, reduce the region ofacceptance. Less likely to accept the null hypothesiswhen it is false; i.e., less likely to make a Type II error.Hence, the power of the test is increased

30

Page 31: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Power (cont.)

The power of a hypothesis test is affected by three factors

The ”true” value of the parameter being tested

The greater the difference between the ”true” value of aparameter and the value specified in the null hypothesis,the greater the power of the test

31

Page 32: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Common Endpoints to Compare Groups

Difference in means

Compares a continuous variable between two groupsNull value is 0two-sample t-test or Wilcoxon rank sum test(independent groups)paired t-test or Wilcoxon signed rank test (pairedgroups)

Difference in proportions

Compares a categorical variable between two groupsNull value is 0Chi-square test or Fisher’s exact test (independentgroups)McNemar’s test (paired groups)

32

Page 33: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Common Endpoints to Compare Groups (cont.)

Odds Ratio (OR)

OR = odds of event E for subjects in group A divided byodds of event E for subjects in group BCompares a binary variable between groups A and BOR > 0Null value is 1In our example, OR = 0.46/(1−0.46)

0.374/(1−0.374)

.= 1.43

There is a 43% increase in the odds of responsecomparing patients randomized to bevacizumab topatients randomized to placebo

33

Page 34: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Common Endpoints to Compare Groups (cont.)

Hazard Ratio (HR)

HR = hazard of death in group A divided by hazard ofdeath in group B (‘hazard’ = ‘risk’)Compares a time-to-event variable between groups Aand BHR > 0Null value is 1Log rank test or Cox proportional hazards regression

34

Page 35: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Putting Some of This together

”On the basis of a systematic literature review, it was assumedthat median OS in the placebo group would be 10 months.The study was powered to test the hypothesis that theaddition of bevacizumab would improve median OS to 12.8months, equivalent to a hazard ratio (HR) of 0.78, 509 deathswere necessary to ensure 80% power for a two-sided log-ranktest at a significance level of 0.05”

35

Page 36: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Confidence Interval (CI)

CI is a type of interval estimate of a population parameter

Estimated range of values that provides a measure ofuncertainty associated with an estimated parameter

The wider the interval the greater the uncertainty in theestimate

95% CI means

If you were able to replicate the exact same study aninfinite number of times, 95% of the resulting CIs wouldcontain the true parameter of interest

36

Page 37: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

CI and Hypothesis Testing

”PFS was prolonged significantly in the bevacizumabgroup compared with the placebo group (HR, 0.8; 95%CI, 0.67 to 0.93; p-value=0.037)”

The best estimate of the true HR is 0.80, but our dataare consistent with a HR ranging from 0.67 to 0.93CI does not contain 1 which means we have even greaterfaith that the benefit observed is real and not just achance occurrence

37

Page 38: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

CI and Hypothesis Testing (cont.)

S be the statistic used to perform a test that comparestwo groups. For example, S could be a difference insample means, a difference in sample proportions, anestimated OR, or an estimated HR. If the test issignificant at α-level 0.05, then the 95% CI onstructedwith S will not contain the relevant null value.

38

Page 39: Statistical Decision-Making - MUSCpeople.musc.edu/.../statistical_decision_making_5_3_2016.pdfStatistical Decision-Making Hong Li May. 3, 2016 1 OutlineAVAGASTHypothesis TestingP-valueErrorsCon

Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval

Subgroup Analysis of Overall Survival in the

Intent-to-Treat Population

39