assessment of audit methodologies for bias evaluation...

Assessment of Audit Methodologies for Bias Evaluation of Tumor Progression in Oncology Clinical Trials (Running title: Evaluation of Audit Methodologies)

Jenny J Zhang Ph.D.1, Lijun Zhang Ph.D.1, Huanyu Chen Ph.D.1, Anthony J Murgo M.D.3, Lori E Dodd Ph.D.2, Richard Pazdur M.D.3, Rajeshwari Sridhara Ph.D.1* 1U.S. Food and Drug Administration, CDER/OTS/OB/DBV, Silver Spring, MD 20993 2Biostatistics Research Branch, National Institute of Allergy and Infectious Disease, NIH,

Bethesda, MD 20892 3U.S. Food and Drug Administration, CDER/OND/OHOP, Silver Spring, MD 20993 *Corresponding author: [email protected]; phone: (301) 796-1759; fax: (301) 796-

9733

Disclaimer: This article reflects the views of the authors and should not be construed to

represent the FDA’s views or policies.

Conflicts of interest: None.

on July 4, 2018. © 2013 American Association for Cancer Research.clincancerres.aacrjournals.org Downloaded from

Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited. Author Manuscript Published OnlineFirst on March 26, 2013; DOI: 10.1158/1078-0432.CCR-12-3364

http://clincancerres.aacrjournals.org/

2

Abstract

As progression-free survival (PFS) has become increasingly used as the primary endpoint

in oncology phase 3 trials, the Food and Drug Administration (FDA) has generally

required a complete-case blinded independent central review (BICR) of PFS to assess and

reduce potential bias in the investigator or local site evaluation (LE). However, recent

publications and FDA analyses have shown a high correlation between LE and BICR

assessments of the PFS treatment effect, which questions whether complete-case BICR is

necessary. One potential alternative is to use BICR as an audit tool to detect evaluation

bias in the LE. In this paper, the performance characteristics of two audit methods

proposed in the literature are evaluated on 26 prospective, randomized phase 3

registration trials in non-hematologic malignancies. The results support that a BICR

audit to assess potential bias in the LE is a feasible approach. However, implementation

and logistical challenges need further consideration and discussion.

Keywords

Blinded independent central review (BICR); Progression-free survival (PFS); Audit

methodology; Bias




3

Introduction

Progression-free survival (PFS) is defined as the time from randomization to either

disease progression or death, whichever occurs first. When PFS is the primary efficacy

endpoint of a clinical trial, the Food and Drug Administration (FDA) and other regulatory

authorities have generally required a blinded independent central review (BICR) under

the assumption that the investigator or local site evaluation (LE) could potentially be

biased due to the subjectivity in the measurement and interpretation of PFS. However,

this approach may lead to a greater than 30% disagreement at the patient-level between

the BICR and LE assessments and/or among the independent reviewers themselves.

These disagreements have been attributed to a variety of reasons, including selection of

different target lesions by the reviewers (1). In addition, since treatment is generally

changed after LE-determined progression resulting in no further protocol-specified

progression assessments before the BICR is conducted, missing data and informative

censoring are limitations of BICR-determined PFS analyses that may result in biased

treatment effect estimates (2). Informative censoring results when patients declared to

have progressive disease by the LE are censored by the BICR due to lack of further tumor

assessments after LE-determined progression. Note that LE assessments may include

local radiographic reads and/or other clinical assessments. BICR radiologists are not part

of the clinical trial investigation, typically do not have information about clinical

assessments, and are blinded to treatment assignment.

The role of BICR was first examined by Dodd and colleagues in 2008 in six phase 3

oncology trials (2), where differences between LE and BICR did not result in different




4

conclusions about treatment efficacy, despite relatively high discrepancy rates at the

patient-level. The Pharmaceutical Research and Manufacturers of America (PhRMA)

PFS Working Group performed a meta-analysis of 27 trials, which showed that while

discrepancies in determining the progression dates were observed on average in 50% of

patients, the relative treatment effect as measured by a hazard ratio (HR) was similar

when assessed by either LE or BICR (3). To further confirm these results, the FDA

conducted a meta-analysis of 28 randomized, phase 3 trials submitted for review in

consideration of approval across 9 non-hematologic malignant tumor types with BICR-

and LE-assessed PFS results reported (4). Note that there is some overlap of the trials

included in the FDA analysis with that of the analyses in Dodd et al. 2008 (2) and Amit et

al. 2011 (3).

The FDA analysis showed that there existed a high degree of association between LE and

BICR estimates of the PFS treatment effect (r = 0.954 [95% confidence interval (CI):

0.908, 0.977]). The overall ratio of hazard ratios (BICR vs. LE) was 1.03 (95% CI: 0.99,

1.07), indicating only a 3% difference between the two evaluations. Throughout, a

hazard ratio less than 1 indicates a treatment effect in favor of the experimental therapy.

Subgroup analyses of blinded versus open-label trials, interim versus final analysis results,

and first line versus subsequent line indications all showed similarly high correlations (4).

Although an inherent measurement error exists in the reading of radiographic scans and

disagreements between reviewers at the patient-level are commonly observed, regulatory

considerations for drug approval are based on the relative treatment effect at the




5

population level. Given that the FDA meta-analysis corroborated the high degree of

association between LE and BICR PFS treatment effects purported in recent publications,

complete-case BICR may not be necessary in many oncology trials. These findings

motivate the exploration of a random sample-based BICR audit as an alternative method

for bias evaluation to assess whether or not consistency of the treatment effect can be

concluded. .

The idea behind the audit strategy is to increase our confidence in the LE results of PFS

by conducting a BICR in a random sample of patients. The main savings of such a

strategy lies in the situation where there is no actual bias in the LE results and only a

partial BICR audit is needed to confirm that fact. Other potential benefits include: a

reduction in trial complexity and burden to investigators, and avoidance of some missing

data issues. Currently, two methods have been proposed in the literature for a random

sample-based BICR audit.

In 2011, Dodd and colleagues (5) proposed a two-stage procedure based on a hazard ratio

estimator to evaluate the consistency of treatment effect (as measured by a hazard ratio)

between the BICR audited assessments and the LE assessments that is more efficient than

the standard estimator based on the audit subset alone. Amit and colleagues (3) proposed

a procedure based on differential discordance rates of LE versus BICR between the

treatment and control arms. These two approaches will be respectively referred to herein

as Method A and Method B. The objective of this paper is to evaluate the performance

characteristics of these two audit methods.




6

Methods

Brief summaries of the two audit methods are given below. We refer the reader to the

original publications by Dodd et al. 2011 (2) and Amit et al.2011 (3) for more details.

The goal of Method A is to provide assurance that the LE PFS treatment effect estimate is

unlikely to be substantially biased. Thus, a BICR audit should only be considered when

the LE hazard ratio indicates a clinically meaningful and statistically significant effect in

favor of the experimental arm. The method is a two-stage procedure. First, the BICR-

based hazard ratio is estimated on the audited subset. Audited subjects were selected as a

simple random sample from all subjects in the study. If the hazard ratio is confirmed to

be significant, then the process concludes. If not, then a complete-case BICR is

conducted. A more efficient estimator of the BICR hazard ratio simultaneously

incorporates information from the patient-level LE data on all cases and the retrospective

random sample BICR audit cases. A formula to estimate the audit size is also provided,

which depends on the effect size and the minimum important difference (MID). The

MID is a threshold value (e.g. HR = 1) used in the proposed two-stage testing procedure.

The upper bound of the confidence interval of the BICR hazard ratio estimate is

compared to the MID to determine whether consistency of the PFS treatment effect has

been verified (5).

The basis of Method B is to use differential discordance as a measure to detect evaluation

bias. Two measures are defined in Table 1: the early discrepancy rate (EDR) is the




7

frequency that the LE declares progression earlier than BICR, and the late discrepancy

rate (LDR) is the frequency that the LE declares progression later than BICR. The

differential discordance for each measure is the difference between the rate on the

experimental arm and that on the control arm. A differential discordance beyond a

certain threshold is suggestive of bias being present in the LE assessment, although the

method does not quantify the uncertainty in estimation of the differential discordance. To

implement the method, a threshold of differential discordance that triggers complete-case

BICR is required; increasing the threshold decreases the sensitivity of the method to

detect bias, while improving its specificity to rule it out. Threshold values ranging from

0.075 to 0.100 and BICR audit sizes of 100-160 patients were recommended based on

simulation studies in Amit et al. 2011 (3). A negative differential discordance for EDR

and/or a positive differential discordance for LDR are indicative of bias in the LE result

in favor of the experimental arm, which would trigger a complete-case audit. A negative

differential discordance for EDR means a higher rate of LE progressions being called

earlier than BICR on the control arm, and a positive differential discordance for LDR

means a higher rate of LE progressions being called later than BICR on the experimental

arm (3).

EDR and LDR are calculated as

ba

abEDR

++= 3

and 32

2

aacb

acLDR

++++= using the cell values from Table 1.

The following hypothetical example will be used to illustrate the calculations. For

example, in the treatment arm of a study, a total of 90 patients were assessed to have PD

by both BICR and LE (cell ‘a’ in Table 1), of which 50 patients had agreement between




8

BICR and LE on the timing and occurrence of PD (a1), 25 patients had LE declaring PD

later than BICR (a2), and 15 patients had LE declaring PD earlier than BICR (a3). A

total of 30 patients were assessed to have PD by LE but not BICR (cell ‘b’ in Table 1),

and 25 patients were assessed to have PD by BICR but not LE (cell ‘c’ in Table 1).

Seventy patients did not have PD by either LE or BICR assessments (cell ‘d’ in Table 1).

According to the formulas in Table 1, the corresponding EDR and LDR for the

experimental arm are (30+15)/(50+25+25+30) = 45/120 = 0.375 and

(25+25)/(30+25+25+15) = 50/95 = 0.526, respectively. Similar calculations are needed

for the control arm. The differential discordance for EDR/LDR would be the difference

between arms for each rate.

The performance evaluation of these two proposed audit methods is based on 26

randomized, superiority phase 3 trials submitted for review in consideration of approval

in non-hematologic malignancies across 9 tumor types (Table 2). Note that these 26 trials

are a subset of the 28 trials included in the FDA meta-analysis reported in Zhang and

colleagues (4) with adequate data for evaluation to determine whether a random sample-

based BICR audit is a viable alternative to a complete-case BICR with respect to its

ability to detect bias in the LE. As a result of several trials having multiple cohorts or

multiple treatment arms, the number of analysis units or randomized comparisons was

greater than the number of trials (i.e. 31 instead of 26). Trials with multiple treatment

arms were weighted accordingly to account for correlation.




9

Table 3 provides a summary of the performance evaluation strategy and summary

measures of the two audit methods. Since all trials evaluated had a complete-case BICR

conducted, random sample audits are performed 10,000 times for each trial to assess the

performance of Method A. Summary measures of the mean audit size, percentage of

complete-case audits, and percentage of audits confirming the LE result (i.e., where

consistency of the PFS treatment effect is concluded) out of the 10,000 simulated

replicates were obtained. Note that these replicate audits are conducted for performance

evaluation purposes only; in practice, these procedures would be implemented a single

time.

For Method B, our evaluations calculated the differential discordance for both the early

and late discrepancy rates (EDR and LDR). The audit size was fixed at 160 patients, the

maximum audit size recommended by Amit et al. 2011 (3); analyses using other audit

sizes were also performed and gave similar results. For each trial, a random sample of

160 patients was drawn and the differential discordances for EDR and LDR were

calculated. To assess performance characteristics, the random samples were repeatedly

drawn 10,000 times; summary measures of the mean differential discordance values and

the percentage of times a complete-case audit was recommended (i.e. bias was detected)

out of the 10,000 replicates were obtained for each study. Under Method B, a trial was

considered to have a biased LE result, and thus a complete-case audit was recommended,

if the differential discordance for EDR or LDR passed the specified threshold value.

Results




10

Table 2 summarizes the level of LE-BICR discordance between treatment arms across the

26 trials, divided into disagreements on censoring status and the timing of progression

given a 7-day window. We see that the discordance rates are around 20% across arms for

both categories. Table 4 presents the performance measures for the two audit methods

for all evaluated trials; these results will be discussed and described in the figures below

and summarized in Table 5. Note that since the audit size for Method B was fixed at 160

patients for all studies, no mean audit size was reported.

Figure 1 assesses Method A by looking into the relationship between the mean audit size

for each trial and the upper bound of the 95% confidence interval (CI) of the LE hazard

ratio estimate. The cluster of circles at mean audit size of 100% are those trials for which

complete-case BICR audits were needed in all 10,000 replicates. Those trials all had

upper 95% CI bounds of the LE hazard ratio greater than 0.90. This means that, as

expected, trials with borderline or non-significant LE results would be recommended

complete-case BICR audits.

For all the other trials, the mean audit size decreases with the upper bound. This means

that trials with larger, more significant LE results would obtain the most savings in terms

of needing a much smaller audit size (most are below 50%). This general relationship

between the mean audit size and the upper CI bound holds true across tumor types

(results not shown).




11

Figure 2 assesses Method B by looking into the relationship between the HR ratio of

BICR versus LE and the differential discordance for the early or late discrepancy rate

(EDR or LDR). Note that a HR ratio greater than 1 implies an overestimate of the

treatment effect by the LE. Recall that EDR is the frequency that the LE declares

progression earlier than BICR. As explained previously, a negative differential

discordance for EDR is suggestive of bias in the LE result favoring the experimental arm.

In support of this rationale, we see that the differential discordance for EDR decreases as

the HR ratio increases. This means that, as more LE progressions are being called earlier

than BICR on the control arm, the difference in BICR and LE hazard ratio estimates also

increases. The reverse relationship is seen for the late discrepancy rate (LDR) since

LDR is the complement of EDR. This general relationship between the HR ratio of

BICR versus LE and the differential discordance for EDR/LDR holds true across tumor

types (results not shown).

Table 5 summarizes measures from both methods by categorizing the trials with respect

to their LE hazard ratio estimate. Of the 12 trials with a large observed LE-assessed PFS

treatment effect (HR ≤ 0.5), the median across studies of the mean audit sizes over the

10,000 replicates (in short, the median mean audit size) from Method A was 35% and the

LE was confirmed in all 10,000 replicate audits for all 12 trials (i.e. consistency of the

treatment effect was concluded for all replicates). This indicates very little (if any) loss in

power from the two-stage procedure of Method A. Of the same 12 trials, the mean across

studies of the proportion of times a complete-case audit was recommended over the

10,000 replicates from Method B was 43% and 37% for threshold values of 0.075 and




12

0.100, respectively. A complete-case audit is recommended for a study if the differential

discordance (DD) in EDR is less than the negative of the threshold value or the DD in

LDR is greater than the threshold value.

For smaller treatment effects (HR > 0.75), only 2 of the 8 trials resulted in all 10,000

replicate audits confirming the LE for Method A and the median mean audit size was

100%. It should be noted that, for most trials, either all 10,000 replicate audits confirmed

the LE or none did. Whereas there is an intuitive trend in decreased savings (larger

median mean audit sizes) using Method A as the observed LE treatment effect becomes

smaller, no such trend was seen for Method B as the mean proportion of times a

complete-case audit is recommended stayed fairly constant across the HR categories.

Case Studies

Carcinoid Study

One trial with evaluation bias present was the carcinoid trial (study 26 in Table 4), which

was discussed by the Oncologic Drug Advisory Committee (ODAC) in April of 2011 (6).

This was a phase 3, randomized (1:1), placebo-controlled study of everolimus for the

treatment of patients with unresectable or metastatic carcinoid tumor. The primary

endpoint was PFS by BICR. At the second interim analysis, an unprecedented

discordance of the PFS treatment effect was observed between the LE and BICR. The

LE PFS result (HR= 0.78, p = 0.003) crossed the efficacy boundary of p = 0.010 while

the BICR PFS result (HR= 0.93, p = 0.233) crossed the futility boundary of p = 0.175.

The boundary p-values are the significance levels to conclude either efficacy or futility,




13

while preserving the type I and II errors, respectively. Clearly, some bias was present in

this trial. The HR ratio was 1.19. Given that the LE and BICR gave such divergent

views of efficacy, it was of particular interest how the two audit methods would perform

for this study.

Table 6 summarizes the discordance between arms with respect to censoring status,

progression time, and censoring time. We see some discrepancies between the two arms.

For Method A, 100% of the 10,000 replicates resulted in complete-case audits with 0%

being able to verify the consistency of the treatment effect. For Method B, however, only

32.7% and 22.1% of the 10,000 replicates recommended a complete-case audit for

threshold values of 0.075 and 0.100, respectively, to support the conclusion that bias may

be present. Note that the fixed audit size of 160 was 37% of the total number of patients.

Soft Tissue Sarcoma (STS) Study

To illustrate the potential savings in audit size, another case study is presented (study 22

in Table 4), which was discussed by the Oncologic Drug Advisory Committee (ODAC)

in March of 2012 (7). This was a phase 3, randomized (1:1), placebo-controlled,

maintenance trial of ridaforolimus in 711 patients with soft tissue sarcoma (STS). The

final LE and BICR PFS hazard ratio estimates were 0.72 and 0.76, respectively, with a

HR ratio of 1.06. Table 6 summarizes the discordance between arms for this study,

which appears fairly balanced. For Method A, only 28% of the 10,000 replicates

resulted in complete-case audits with 100% verifying consistency of the treatment effect.

The mean audit size was 48%, a savings of over 50% in audit size.




14

For Method B, the fixed audit size of 160 was 23% of the total number of patients. Of

the 10,000 replicates, 51.7% and 41.4% recommended a complete-case audit for

threshold values of 0.075 and 0.100, respectively, to support the conclusion that bias may

be present. This was largely due to increased LDR discrepancies, as 2.0% and 41.4% of

the 10,000 replicates resulted in differential discordance values for EDR and LDR,

respectively, exceeded a threshold of 0.100. Thus, there is quite a bit of variability in the

differential discordance for LDR making it difficult to determine whether or not bias was

present in this study using Method B.

Discussion

Although measurement error is inherent in the reading of radiographic scans, regulatory

considerations for drug approval are based on the relative treatment effect at the

population level. Given that multiple publications (2-4) have corroborated the strong

correlation between LE- and BICR-assessed PFS treatment effect estimates, there is a

need for the exploration of alternative strategies to detect bias in the LE. The results of

the analyses presented herein support that a random sample-based BICR audit is a viable

alternative to a complete-case BICR, and may be a more efficient strategy for bias

evaluation of the LE. Note that, although there is general agreement, in one study

reported here (study 26 in Table 4), there were divergent conclusions depending on the

choice of LE and BICR, justifying a continued need for BICR to provide assurance about

the LE-based treatment effect.




15

Method A seems to perform well in most situations; i.e. it seems able to distinguish

between trials with and without bias present. The savings with respect to audit size varies

from case to case, depending on the study size and magnitude of the observed LE PFS

treatment effect; larger effect sizes and/or studies typically require smaller audit

proportions. Method B is intuitively appealing, but needs further evaluation, particularly

with respect to determination of the appropriate threshold value. Potential reasons for its

variable performance that need further evaluation include loss of important information

due to dichotomization and ignoring patients who were censored by both the LE and

BICR (cell “d” in Table 1) in the definition of EDR and LDR. Method B counts

discordances but not how far apart they are. For example, with the LDR, the LE could

call PD right after BICR or a long time after. If the late discrepancies occurred in the

control arm many visits after BICR, this would produce more significant bias in the HR

estimate. On a similar note, we could have a relatively small number of late

discrepancies, but if they occur very late, then this would produce greater bias in the HR

estimate.

Although real-time BICRs are becoming more prominent with advances in modern

digital technology, all the studies included in the analyses presented herein did not have

real-time BICRs. It should be noted that real-time BICR would ameliorate bias concerns

due to informative censoring, but does not alleviate potential bias due to the subjectivity

inherent to the PFS endpoint.

We acknowledge that the studies included in this analysis are limited to registration trials

that all had complete-case BICR and may not be representative of the population of all




16

clinical trials conducted. However, they are representative of the clinical trials that are

generally submitted to the FDA for regulatory consideration and for which BICRs are

usually required.

An ODAC meeting was held in July 2012 to discuss whether the current practice of

complete-case BICR should be replaced by a random sample-based BICR audit, based on

the information and analyses presented herein (8). All committee members agreed that a

random sample-based BICR audit should be considered; however, the potential merits

must be viewed in tandem with the potential limitations and challenges. They also

advised against the complete elimination of BICR, which could jeopardize the integrity

of the LE.

Although these analyses have demonstrated that a BICR audit to assess potential bias in

the LE is a feasible approach, the logistics of how the audit should proceed need further

discussion and consideration. The method of selecting the random sample audit needs to

be determined; a simple random sample audit may not be sufficient, for example, to

ensure representation of all study sites. Efforts should also be made to minimize any

additional burden that the audit may cause the investigator or sponsor without

compromising the integrity of the study. Selection of the actual audit strategy to

implement within a trial may need to be determined on a case-by-case basis. While we

focused on two methodologies, we expect that other approaches will become available

for consideration in the future.




17

References

1. Ford R, Schwartz L, Dancey J, Dodd LE, Eisenhauer EA, Gwyther S et al. Lessons

learned from independent central review. European Journal of Cancer 2009; 45: 268-274.

2. Dodd LE, Korn EL, Freidlin B, Jaffe CC, Rubinstein LV, Dancey J et al. Blinded

independent central review of progression-free survival in phase III clinical trials:

important design element or unnecessary expense? Journal of Clinical Oncology. 2008;

26: 3791-3796.

3. Amit O, Mannino F, Stone AM, Bushnell W, Denne J, Helterbrand J et al. Blinded

independent central review of progression in cancer clinical trials: results from a meta-

analysis. European Journal of Cancer. 2011; 47: 1772-1778.

4. Zhang JJ, Chen H, He K, Tang S, Justice R, Keegan P et al. Evaluation of blinded

independent central review of tumor progression in oncology clinical trials: A meta-

analysis. Drug Information Journal 2013;47: 167-74.

5. Dodd LE, Korn EL, Freidlin B, Gray R, Bhattacharya S. An audit strategy for

progression-free survival. Biometrics. 2011;67:1092-1099.

6. Oncologic Drug Advisory Committee – Everolimus for Carcinoid Tumors. April 12,

2011 Meeting.

http://www.fda.gov/AdvisoryCommittees/CommitteesMeetingMaterials/Drugs/Oncologi

cDrugsAdvisoryCommittee/ucm235829.htm

7. Oncologic Drug Advisory Committee – Ridaforolimus for STS. March 20, 2012

Meeting.






18

8. Oncologic Drug Advisory Committee – Evaluation of Radiologic Review of PFS in

Non-Hematologic Malignancies. July 24, 2012 Meeting.



Figure captions:

Figure 1. Relationship between mean audit size from Method A and CI upper bound of

LE hazard ratio

Figure 2. Relationship between HR ratio (BICR vs. LE) and differential discordance in

EDR/LDR from Method B




Table 1: Definition of EDR and LDR from Amit et al. 2011

BICR PD No PD

LE PD a = a1 + a2 + a3 b No PD c d

a1: number of agreements on timing and occurrence of PD a2: number of times LE declares PD later than BICR a3: number of times LE declares PD earlier than BICR PD: progressive disease Adapted from European Journal of Cancer, Amit, et al Copyright (2011), with permission from Elsevier (3).

on July 4, 2018. © 2013 A

merican A

ssociation for Cancer R

esearch.clincancerres.aacrjournals.org

Dow

nloaded from

Author m

anuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Author M

anuscript Published O

nlineFirst on M

arch 26, 2013; DO

I: 10.1158/1078-0432.CC

R-12-3364


Table 2. Summary of Study Characteristics

Study characteristics Meta-analysis trials (N = 26)

Tumor type MBC 6 RCC 7 MCRC 4 Othera 9 Design A vs. B / A+B vs. A / Placebo or BSCb 4 / 6 / 16 Open-label / Double-blind 11 / 15 Interim / Final analysis 9 / 17 1:1 / 2:1 randomization 18 / 8 1st / subsequentc / maintenance line 12 / 12 / 3 Sample Size Median 716.5 Min, Max 171, 1286

Discordance measure

Control (N = 31)

Experimental (N = 31)

% LE-BICR discordance on censoring status Mean (SD) 23.8% (9.7%) 23.0% (7.7%) Median 23.7% 24.5% Min, Max 9.1%, 43.3% 11.1%, 36.9%

% LE-BICR discordance on timing of PD (7-day window) Mean (SD) 21.5% (8.1%) 20.4% (8.7%) Median 21.4% 20.2% Min, Max 8.4%, 38.7% 6.1%, 41.4%

aIncludes trials in non-small cell lung cancer (3), pancreatic neuroendocrine tumors (2), soft tissue sarcoma (2), gastrointestinal stromal tumor (1), ovarian cancer (1), carcinoid tumors (1); bBSC = best supportive care; cOne trial (#12 in Table 4) included both 1st

and 2nd line patients and is double counted here; Abbreviations: MBC = metastatic breast cancer; RCC = renal cell carcinoma; MCRC = metastatic colorectal cancer

on July 4, 2018. © 2013 A

merican A



Dow

nloaded from

Author m


Author M


nlineFirst on M

arch 26, 2013; DO

I: 10.1158/1078-0432.CC

R-12-3364


Table 3. Performance Evaluation Summary Measures of Audit Methods (for Each Study)

Method A (NCI Method) Method B (PhRMA Method) Evaluation Strategy

Conduct 10,000 random sample audits of varying size

Conduct 10,000 random sample audits of fixed size (N = 160)

Summary Measures

Mean audit size Mean differential discordance for EDR % complete-case (CC) audits1 Mean differential discordance for LDR % audits confirming LE result % complete-case (CC) audit1 recommended

1A complete-case (CC) audit is an audit with size = 100%

on July 4, 2018. © 2013 A

merican A



Dow

nloaded from

Author m


Author M


nlineFirst on M

arch 26, 2013; DO

I: 10.1158/1078-0432.CC

R-12-3364


Table 4: Evaluation Results for Method A and Method B

Study

N

Tumor type1

LE HR (95% CI)

CC BICR HR

(95% CI)

HR Ratio

(BICR / LE)

Method A Method B

% CC audits2

Mean

audit size2

% replicate audits

confirming LE3

% CC audits

(0.100)4

% CC audits

(0.075)5 1 752 MBC 0.79 (0.68, 0.91) 0.74 (0.64, 0.87) 0.94 0% 73% 100% 14.8% 23.2% 2 722 MBC 0.49 (0.40, 0.59) 0.54 (0.44, 0.67) 1.10 5% 29% 100% 48.4% 57.7%

3-HER2- 952 MBC 0.89 (0.76, 1.04) 0.96 (0.78, 1.20) 1.08 100% 100% 0 26.1% 37.1% 3-HER2+ 219 MBC 0.72 (0.54, 0.97) 0.67 (0.45, 0.99) 0.93 100% 100% 100 21.2% 39.6%

4-Anth 622 MBC 0.66 (0.54, 0.81) 0.79 (0.63, 1.00) 1.20 78% 86% 36% 66.0% 75.3% 4-Cap 615 MBC 0.67 (0.56, 0.82) 0.70 (0.56, 0.87) 1.04 32% 55% 100% 29.6% 38.7%

5 762 MBC 0.81 (0.68, 0.95) 0.86 (0.72, 1.04) 1.06 100% 100% 0% 48.0% 58.0% 6 724 MBC 0.44 (0.36, 0.55) 0.35 (0.27, 0.46) 0.80 0% 30% 100% 4.9% 8.0% 7 769 RCC 0.44 (0.35, 0.54) 0.45 (0.37, 0.56) 1.02 0% 30% 100% 30.8% 38.9% 8 750 RCC 0.41 (0.33, 0.52) 0.41 (0.31, 0.53) 1.00 0% 35% 100% 23.4% 30.2%

9-25mg 416 RCC 0.70 (0.58, 0.86) 0.68 (0.55, 0.85) 0.97 2% 61% 100% 15.6% 25.6% 9-15mg 417 RCC 0.75 (0.61, 0.92) 0.76 (0.62, 0.94) 1.01 100% 100% 100% 3.7% 7.6%

10 416 RCC 0.33 (0.26, 0.42) 0.33 (0.26, 0.43) 1.00 0% 35% 100% 17.4% 24.1% 11 649 RCC 0.62 (0.51, 0.75) 0.59 (0.47, 0.74) 0.95 21% 41% 100% 19.4% 27.3% 12 435 RCC 0.43 (0.34, 0.54) 0.41 (0.32, 0.54) 0.95 0% 35% 100% 19.8% 29.0% 13 723 RCC 0.68 (0.56, 0.82) 0.68 (0.56, 0.83) 1.00 14% 49% 100% 24.5% 33.7% 14 463 MCRC 0.39 (0.32, 0.48) 0.55 (0.45, 0.67) 1.41 5% 29% 100% 98.2% 99.0%

15-Oxal 812 MCRC 1.35 (1.09, 1.67) 1.38 (1.08, 1.77) 1.02 100% 100% 0% 45.5% 54.7% 16-WT 656 MCRC 0.81 (0.67, 0.98) 0.80 (0.66, 0.98) 0.99 100% 100% 100% 25.6% 34.0% 16-Mu 527 MCRC 1.15 (0.95, 1.40) 1.22 (1.00, 1.50) 1.06 100% 100% 0% 16.4% 26.8% 17-WT 597 MCRC 0.71 (0.58, 0.87) 0.75 (0.62, 0.92) 1.06 16% 68% 100% 24.5% 33.4% 17-Mu 589 MCRC 0.82 (0.67, 0.99) 0.90 (0.74, 1.10) 1.10 100% 100% 0% 64.3% 74.6%

18 663 NSCLC 0.50 (0.41, 0.60) 0.63 (0.52, 0.76) 1.26 32% 46% 100% 53.4% 62.2% 19 884 NSCLC 0.71 (0.61, 0.82) 0.71 (0.60, 0.83) 1.00 29% 46% 100% 17.4% 24.8% 20 171 PNET 0.42 (0.26, 0.66) 0.31 (0.18, 0.54) 0.74 1% 55% 100% 0.0% 0.0% 21 410 PNET 0.38 (0.29, 0.48) 0.40 (0.30, 0.54) 1.05 0% 40% 100% 77.4% 84.7% 22 711 STS 0.72 (0.61, 0.85) 0.76 (0.64, 0.90) 1.06 28% 48% 100% 41.4% 51.7% 23 369 STS 0.35 (0.28, 0.45) 0.31 (0.24, 0.41) 0.89 0% 35% 100% 10.4% 16.0% 24 312 GIST 0.29 (0.20, 0.40) 0.32 (0.23, 0.45) 1.10 0% 40% 100% 56.5% 65.5% 25 645 Ovarian 0.69 (0.58, 0.82) 0.79 (0.65, 0.96) 1.14 67% 80% 100% 55.5% 66.0% 26 429 Carcinoid 0.78 (0.62, 0.98) 0.93 (0.71, 1.22) 1.19 100% 100% 0% 22.1% 32.7%

Abbreviations: Anth = anthracycline; Cap = capecitabine; Oxal = oxaliplatin; WT = wild type; Mu = mutant; CC = complete-case; DD = differential discordance 1MBC = metastatic breast cancer, RCC = renal cell carcinoma, MCRC = metastatic colorectal cancer, NSCLC = non-small cell lung cancer, PNET = pancreatic neuroendocrine tumors, STS = soft tissue sarcoma, GIST = gastrointestinal stromal tumor; 2over the 10,000 replicates per study; 3% of 10,000 audit replicates (whether partial or CC) per study in which consistency of the PFS treatment effect is concluded (i.e. the LE result is confirmed); 4% of 10,000 replicate audits per study for which CC audit is recommended (i.e. differential discordance (DD) in EDR < -0.100 or DD in LDR > 0.100); 5% of 10,000 replicate audits for which CC audit is recommended (i.e. DD in EDR < -0.075 or DD in LDR > 0.075)

on July 4, 2018. © 2013 A

merican A



Dow

nloaded from

Author m


Author M


nlineFirst on M

arch 26, 2013; DO

I: 10.1158/1078-0432.CC

R-12-3364


Table 5: Method A vs. Method B

LE

hazard ratio

N1

Method A Method B Median mean audit

size2 (min, max) All replicate audits

confirmed LE3 Mean of % CC audit recommended4

Threshold = 0.100 Threshold = 0.075 ≤ 0.50 12 35% (29%, 55%) 12 (100%) 37% (0%, 98%) 43% (0%, 99%)

0.50 – 0.75 11 61% (41%, 100%) 10 (91%) 29% (4%, 66%) 39% (8%, 75%) > 0.75 8 100% (73%, 100%) 2 (25%) 33% (15%, 64%) 43% (23%, 75%)

1N = number of studies; 2median across studies of the mean over 10,000 replicates for each study; 3number of studies for which all 10,000 replicate audits concluded consistency of the PFS treatment effect (i.e. confirmed the LE result); 4mean across studies of % complete-case (CC) audit is recommended of 10,000 replicate audits; a CC audit is recommended for a study if the differential discordance (DD) in EDR is less than the negative of the threshold value or the DD in LDR is greater than the threshold value

Table 6: Discordance in Case Studies

Case Study 1 Case Study 2

Discordance Placebo

(N = 213) Everolimus (N = 216)

Placebo (N = 364)

Ridaforolimus(N = 347)

Censoring status 38% 26% 14% 16% PD time1 15% 18% 21% 27%

Censoring time1 1.4% 3.7% 0.8% 2.0% 1PD/censoring discordant outside of 7-day window

on July 4, 2018. © 2013 A

merican A



Dow

nloaded from

Author m


Author M


nlineFirst on M

arch 26, 2013; DO

I: 10.1158/1078-0432.CC

R-12-3364


© 2013 American Association for Cancer Research

0

0.5 1.0

Mea

n au

dit

size

(%)

95% CI upper bound of LE hazard ratio

r = 0.797 (0.617, 0.898)

1.5

20

40

60

80

100

120

Figure 1:




© 2013 American Association for Cancer Research

0.6

0.8

1.0

1.2

1.4

1.6

0.2 0.1 0.0 0.1 0.2 0.3

0.6

0.8

1.0

1.2

1.4

1.6

0.2 0.1 0.0 0.1 0.2 0.3

HR ratio = 1Threshold = 0.075Threshold = 0.1

Reference lines:HR ratio = 1Threshold = 0.075Threshold = 0.1

Reference lines:

r = 0.658 ( 0.821, 0.396) r = 0.827 (0.669, 0.914)

Mean differential discordance in EDR Mean differential discordance in LDR

Early discrepancy rate (EDR) Late discrepancy rate (LDR)H

R r

atio

(BIC

R/L

E)

HR

rat

io (B

ICR

/LE

)

Figure 2:

on July 4, 2018. © 2013 A

merican A



Dow

nloaded from

Author m


Author M


nlineFirst on M

arch 26, 2013; DO

I: 10.1158/1078-0432.CC

R-12-3364


Published OnlineFirst March 26, 2013.Clin Cancer Res Jenny J Zhang, Lijun Zhang, Huanyu Chen, et al. Tumor Progression in Oncology Clinical TrialsAssessment of Audit Methodologies for Bias Evaluation of

Updated version

10.1158/1078-0432.CCR-12-3364doi:

Access the most recent version of this article at:

Manuscript

Authoredited. Author manuscripts have been peer reviewed and accepted for publication but have not yet been

E-mail alerts related to this article or journal.Sign up to receive free email-alerts

Subscriptions

Reprints and

[email protected] at

To order reprints of this article or to subscribe to the journal, contact the AACR Publications

Permissions

Rightslink site. Click on "Request Permissions" which will take you to the Copyright Clearance Center's (CCC)

.http://clincancerres.aacrjournals.org/content/early/2013/03/23/1078-0432.CCR-12-3364To request permission to re-use all or part of this article, use this link



http://clincancerres.aacrjournals.org/lookup/doi/10.1158/1078-0432.CCR-12-3364

http://clincancerres.aacrjournals.org/cgi/alerts

mailto:[email protected]

http://clincancerres.aacrjournals.org/content/early/2013/03/23/1078-0432.CCR-12-3364


assessment of audit methodologies for bias evaluation...

Documents