studying the impact of tests jon deeks professor of health statistics university of birmingham work...

Studying the Impact of Tests

Jon DeeksProfessor of Health Statistics

University of Birmingham

Work supported by a DOH NCC RCD Senior Research Scientist in Evidence Synthesis Award

Answering policy decisions about the use of diagnostic tests

Should GPs refer patients with low back pain for X-ray and/or MRI?

Should patients with dyspeptic symptoms receive serology tests for H.pylori, endoscopy, or empirical therapy?

Standard hierarchy for HTA of tests(Fryback and Thornton 1991)

1. Technical quality of the test2. Diagnostic accuracy3. Change in diagnostic thinking4. Change in patient management5. Change in patient outcomes6. Societal costs and benefits

Studies on the Diagnostic Evaluation Pathway Analytical validity

Reliability (repeatability and reproducibility) Measurement accuracy

Diagnostic validity Diagnostic accuracy Comparative/incremental diagnostic accuracy

Impact Change in diagnostic yield Change in management Change in patient outcomes

Economic evaluation

HTA policy on evaluating tests (up until 2004)

“the emphasis of the HTA programme is to assess the effect on patient management and outcomes … improvements in diagnostic accuracy, whilst relevant, are not the primary interest of this commissioned research programme”

Studies on the Diagnostic Evaluation Pathway Analytical validity

Reliability (repeatability and reproducibility) Measurement accuracy

Diagnostic validity Diagnostic accuracy Incremental diagnostic accuracy

Impact Change in diagnostic yield Change in management Change in patient outcomes

Economic evaluation Focus of HTA programme

Outline of talk Trials of diagnostic evaluations Problems

What is being evaluated? Statistical power Study validity Outcomes

Pragmatic suggestions When are trials really needed? Alternative trial designs Alternative of assessing comparative accuracy More research is needed

RCT to assess patient outcomes

Outcome

OutcomeControl

Active

RandomisePopulation Sample

Diagnostic RCT

Outcome

OutcomeControl

TEST

RandomisePopulation Sample

Outcome 6

weeks N=59

Outcome 1 year

N=50

RCT 1: X-ray at first GP presentation for low back pain. HTA 2000(4): 20

Randomise

GP attendees aged 16-64 yrs with LBP.

Excluded if ‘flu or previous consultation for LBP in last 4 weeks

Referred for X-

ray N=73

Sample

N=153

No X-ray

referral N=80

PRIMARY Roland scoreHADSSF-36EuroQol

SECONDARY time off work therapistsmedication satisfaction

Outcome 6

weeks N=67

Outcome 1 year N=58

Outcome 6

weeks N=59

Outcome 1 year

N=50

RCT 1: X-ray at first GP presentation for low back pain. HTA 2000(4): 20

Randomise

GP attendees aged 16-64 yrs with LBP.

Excluded if ‘flu or previous consultation for LBP in last 4 weeks

Referred for X-

ray N=73

Sample

N=153

No X-ray

referral N=80

Outcome 6

weeks N=67

Outcome 1 year N=58

RESULTSAt 6 weeks SF-36 mental health and vitality subscales (P<.05)

At 12 months SF-36 mental health subscale (P<.05)

Outcome 3

months N=199

Outcome 9

months N=195

RCT 2: X-ray for GP presentation for low back pain >6 weeks. HTA 2001(5): 30

Randomise

GP attendees aged 20-55.

1st episode of LBP between 6 weeks and 6 months duration.

Excluded if ‘red flags’

Referred for X-

ray N=210

Sample

N=421

No X-ray

referral N=211

PRIMARY Roland scoreSECONDARY

pain (VAS) EuroQol

pain (diary) satisfactionpain (any) belief in X-raytime off work therapistsmedication consultations

Outcome 3

months N=203

Outcome 9

months N=199

Outcome 3

months N=199

Outcome 9

months N=195

RCT 2: X-ray for GP presentation for low back pain >6 weeks. HTA 2001(5): 30

Randomise

GP attendees aged 20-55.

1st episode of LBP between 6 weeks and 6 months duration.

Excluded if ‘red flags’

Referred for X-

ray N=210

Sample

N=421

No X-ray

referral N=211

Outcome 3

months N=203

Outcome 9

months N=199

RESULTSAt 3 months proportion reporting LBP (P<.05)

At 9 monthsNone

What is being evaluated?

Medical Test

Information

Decision

ActionPatient Outcome

Test harms and placebo

effects

RCT combines

effects

Diagnostic

accuracy

Diagnostic yield

Management

What is being evaluated? Conditions for a test to be of diagnostic benefit

Test is more accurate Interpretation of test results is rational and consistent Management is rational and consistent Treatment is effective

Conditions for a trial to be informative Rules for interpretation of test results are described Management protocol is described

No descriptions given in example trials Applying the results requires faith that the behaviour of your

patients and clinicians is the same as the trial

What is being evaluated? If no difference is observed …

Is the test no more accurate? Are clinicians not correctly interpreting test results? Are management decisions inconsistent or

inappropriate? Is the treatment ineffective?

None of these questions can be answered

If one element changes, the results of the trial become redundant

Statistical Power RCT 1:

Reduction in proportion with pain at 2 weeks from 40% to 30% could be detected with 300 patients with 80% power at 5% significance

RCT 2: Difference of 1.5 on Roland score could be detected with

388 patients with 90% power and 5% significance sd=4.5, standardised difference=1.5/4.5=0.33

These sample size calculations are suitable for a trial of treatment vs placebo, not a trial of test+treatment

Diagnostic Accuracy of Clinical Judgement

TP

FN

FP

TN

Serious

(requires intervention)

Minor

(requires no intervention)

Diagnostic Accuracy of Clinical Judgement + X-ray

TP

FN

FP

TN

Serious


Minor


Comparison of Diagnostic Accuracy

All FP Discrepant B

All TN

All TP

Discrepant A

All FN

Serious


Minor


Benefit can only occur in those whose diagnosis changes Where can differences arise?

Discrepant A could benefit if intervention effective Discrepant B could benefit if intervention harmful All others have no benefit as no change in their intervention

Sample size must take into account Prevalence of treatable condition Detection rate (sensitivity) with control test Detection rate (sensitivity) with new test Treatment rate if control test negative (assume zero) Treatment rate if new test positive (assume 100%) Outcome for treatable condition if untreated Treatment effect

Sample size for detecting treatment effects

Sample size for treatment vs control 300-400.

Sample size must be adjusted according to the proportion in discrepant cells (particularly A).

If 20% have serious disease and sensitivity 20% there will be 4% in Discrepant A increase N 25-fold (N=7,500-10,000)

If 10% have serious disease and sensitivity 10% there will be 1% in Discrepant A increase N 100-fold (N=30,000-40,000)

Sample size for detecting differences in accuracy Sample size depends on whether the

sample all receive both tests, or are randomised to tests

Sample sizes for difference in sensitivity If 20% have serious disease to detect sensitivity

20% from 70% to 90% (80% power, alpha 0.05) paired cohort design N=116 [68-136] parallel cohort design N=232

If 10% have serious disease to detect sensitivity

10% from 80% to 90% (80% power, alpha 0.05) paired cohort design N=706 [271-814] parallel cohort design N=1411

Sample size for detecting differences in diagnoses and management Sample size based on accuracy sample size

inflated according to:

For diagnostic impact diagnosis rate if control test negative diagnosis rate if new test positive*

For therapeutic impact treatment rate if control test negative treatment rate if new test positive*

* subject to “learning effects”

Validity Concerns Blinding

Participants and outcome assessors are rarely blind in diagnostic trials

Trials may be more susceptible to measuring preconceived notions of participants and expectations of trialists

Drop-out Lack of blinding can induce differential drop-out There are more stages at which drop-out occurs

Compliance Lack of blinding and complexity in strategies can

reduce compliance

What outcomes? The problem is multi-multi-factorial

Assessing the effect of a single intervention for a single disease requires multiple outcomes

Tests are used to differentiate between multiple diseases and disease states

A trial should assess all the important outcomes for the multiple diseases within the differential diagnosis

But trials usually have a focus on one condition

Summary of problems Diagnostic trials are …

Rarely done Assess effects of “test+treatment package” Uninformative about the value of the test Often underpowered At risk of bias May not assess all relevant outcomes May be more likely to detect “placebo”

effects than benefits of better diagnoses May not represent future impact on

treatment and diagnostic decisions

Key issues Trials only need be done in limited

circumstances Only patients in the discrepant cell are

informative Audit and feedback studies are better

for assessing and changing clinicians’ behaviour than trials

More good comparative studies of test accuracy are required

When is measuring sensitivity and specificity sufficient to evaluate a new test?Lord et al. Ann Int Med 2006; 144: 850-5

Categories of test attributes: The new test is safer or is less costly The new test is more specific (excludes

more cases of non-disease) The new test is more sensitive (detects

more cases of disease)

If an RCT of treatments exists, when do we still need to undertake an RCT of test+treatment?

Lord, S. J. et. al. Ann Intern Med 2006;144:850-855

Trial evidence versus linked evidence of test accuracy and treatment efficacy

Lord, S. J. et. al. Ann Intern Med 2006;144:850-855

Assessing new tests using evidence of test accuracy, given that treatment is effective for cases detected by the old test


If the new test has similar sensitivity

Trials of test+treatment are not required

Reductions in harm or cost are benefits

Improved specificity can only be a benefit

Decision models can be used to analyse trade-offs between positive and negative benefits


If the new test has improved sensitivity Value of using the test depends on treatment response in

the extra cases detected

A trial is still not needed if Inclusion in the treatment trial was based on the reference

standard for assessing test accuracy The test is evaluated in a treatment trial as a predictor of

response The new cases represent the same spectrum or subtype of

disease Treatment response is known to be similar across the

spectrum or subtype of disease

Alternative Diagnostic RCT

OutcomeIntervene

OutcomeDo not intervene

Randomise

X-ray

X-ray

Do not intervene

Outcome

Intervene Outcome

Clinical diagnosis

Population Sample

Serious

Minor

Serious

Serious

Minor

Minor

Alternative Diagnostic RCT

OutcomeIntervene

OutcomeDo not intervene

Randomise

X-ray

X-ray

Clinical diagnosis

Population Sample

Serious

Minor

Serious

Serious

Minor

Minor

Compare

Alternative Diagnostic RCT Everybody gets all tests, randomise only those

with discrepant results Benefits

Assess diagnostic yield and resultant patient outcomes

Less follow-up required Include a reference standard for a random sample and

comparative diagnostic accuracy can also be assessed Downsides

More tests undertaken Problems when test material is limited Does not assess test harms or other direct effects May not be ethical to randomise treatment

Assessing clinicians’ behaviours Informative trials require documentation and

standardisation of decision-making Particularly difficult when the comparison

group is standard practice Assessing behaviour observed in a trial may

not be representative Future behaviour will depend on the trial results Learning curves may affect compliance

Becoming acquainted with a test Ascertaining how best to use it Gaining confidence in its findings Allowing it to replace other investigations

Diagnostic Before-and-After Studies Design

Doctors’ assessments of diagnostic, prognostic and required management decisions recorded

Result of new test made available Doctors’ changes in diagnostic, prognostic and

required management decisions noted (Reference standard applied)

Application Assessment of an Additional Test only Assessment of Diagnostic Yield and Management

Concerns New test assessed independent of other tests Doctors’ processes may not reflect standard clinical

practice Learning effects

Conclusions1. We have much to learn about the best way of studying

diagnostic tests2. Test+treatment trials are difficult to undertake, are

prone to bias, and often require unattainable sample sizes.

3. Good comparative studies of test accuracy combined in decision models with evidence from trials of treatments may in many circumstances provide the necessary evidence for policy decisions

4. Good comparative studies of test accuracy should be commissioned more readily

Defects and Disasters in Evaluations of the Impact

of Diagnostic Tests

Jon DeeksProfessor of Health Statistics

University of Birmingham

Work supported by a DOH NCC RCD Senior Research Scientist in Evidence Synthesis Award

studying the impact of tests jon deeks professor of health statistics university of birmingham work...

Documents

xray n

months n

weeks n

sample n

year n

xray referral n

diagnostic yield change

diagnostic thinking