studying the impact of tests jon deeks professor of health statistics university of birmingham work...
TRANSCRIPT
Studying the Impact of Tests
Jon DeeksProfessor of Health Statistics
University of Birmingham
Work supported by a DOH NCC RCD Senior Research Scientist in Evidence Synthesis Award
Answering policy decisions about the use of diagnostic tests
Should GPs refer patients with low back pain for X-ray and/or MRI?
Should patients with dyspeptic symptoms receive serology tests for H.pylori, endoscopy, or empirical therapy?
Standard hierarchy for HTA of tests(Fryback and Thornton 1991)
1. Technical quality of the test2. Diagnostic accuracy3. Change in diagnostic thinking4. Change in patient management5. Change in patient outcomes6. Societal costs and benefits
Studies on the Diagnostic Evaluation Pathway Analytical validity
Reliability (repeatability and reproducibility) Measurement accuracy
Diagnostic validity Diagnostic accuracy Comparative/incremental diagnostic accuracy
Impact Change in diagnostic yield Change in management Change in patient outcomes
Economic evaluation
HTA policy on evaluating tests (up until 2004)
“the emphasis of the HTA programme is to assess the effect on patient management and outcomes … improvements in diagnostic accuracy, whilst relevant, are not the primary interest of this commissioned research programme”
Studies on the Diagnostic Evaluation Pathway Analytical validity
Reliability (repeatability and reproducibility) Measurement accuracy
Diagnostic validity Diagnostic accuracy Incremental diagnostic accuracy
Impact Change in diagnostic yield Change in management Change in patient outcomes
Economic evaluation Focus of HTA programme
Outline of talk Trials of diagnostic evaluations Problems
What is being evaluated? Statistical power Study validity Outcomes
Pragmatic suggestions When are trials really needed? Alternative trial designs Alternative of assessing comparative accuracy More research is needed
RCT to assess patient outcomes
Outcome
OutcomeControl
Active
RandomisePopulation Sample
Diagnostic RCT
Outcome
OutcomeControl
TEST
RandomisePopulation Sample
Outcome 6
weeks N=59
Outcome 1 year
N=50
RCT 1: X-ray at first GP presentation for low back pain. HTA 2000(4): 20
Randomise
GP attendees aged 16-64 yrs with LBP.
Excluded if ‘flu or previous consultation for LBP in last 4 weeks
Referred for X-
ray N=73
Sample
N=153
No X-ray
referral N=80
PRIMARY Roland scoreHADSSF-36EuroQol
SECONDARY time off work therapistsmedication satisfaction
Outcome 6
weeks N=67
Outcome 1 year N=58
Outcome 6
weeks N=59
Outcome 1 year
N=50
RCT 1: X-ray at first GP presentation for low back pain. HTA 2000(4): 20
Randomise
GP attendees aged 16-64 yrs with LBP.
Excluded if ‘flu or previous consultation for LBP in last 4 weeks
Referred for X-
ray N=73
Sample
N=153
No X-ray
referral N=80
Outcome 6
weeks N=67
Outcome 1 year N=58
RESULTSAt 6 weeks SF-36 mental health and vitality subscales (P<.05)
At 12 months SF-36 mental health subscale (P<.05)
Outcome 3
months N=199
Outcome 9
months N=195
RCT 2: X-ray for GP presentation for low back pain >6 weeks. HTA 2001(5): 30
Randomise
GP attendees aged 20-55.
1st episode of LBP between 6 weeks and 6 months duration.
Excluded if ‘red flags’
Referred for X-
ray N=210
Sample
N=421
No X-ray
referral N=211
PRIMARY Roland scoreSECONDARY
pain (VAS) EuroQol
pain (diary) satisfactionpain (any) belief in X-raytime off work therapistsmedication consultations
Outcome 3
months N=203
Outcome 9
months N=199
Outcome 3
months N=199
Outcome 9
months N=195
RCT 2: X-ray for GP presentation for low back pain >6 weeks. HTA 2001(5): 30
Randomise
GP attendees aged 20-55.
1st episode of LBP between 6 weeks and 6 months duration.
Excluded if ‘red flags’
Referred for X-
ray N=210
Sample
N=421
No X-ray
referral N=211
Outcome 3
months N=203
Outcome 9
months N=199
RESULTSAt 3 months proportion reporting LBP (P<.05)
At 9 monthsNone
What is being evaluated?
Medical Test
Information
Decision
ActionPatient Outcome
Test harms and placebo
effects
RCT combines
effects
Diagnostic
accuracy
Diagnostic yield
Management
What is being evaluated? Conditions for a test to be of diagnostic benefit
Test is more accurate Interpretation of test results is rational and consistent Management is rational and consistent Treatment is effective
Conditions for a trial to be informative Rules for interpretation of test results are described Management protocol is described
No descriptions given in example trials Applying the results requires faith that the behaviour of your
patients and clinicians is the same as the trial
What is being evaluated? If no difference is observed …
Is the test no more accurate? Are clinicians not correctly interpreting test results? Are management decisions inconsistent or
inappropriate? Is the treatment ineffective?
None of these questions can be answered
If one element changes, the results of the trial become redundant
Statistical Power RCT 1:
Reduction in proportion with pain at 2 weeks from 40% to 30% could be detected with 300 patients with 80% power at 5% significance
RCT 2: Difference of 1.5 on Roland score could be detected with
388 patients with 90% power and 5% significance sd=4.5, standardised difference=1.5/4.5=0.33
These sample size calculations are suitable for a trial of treatment vs placebo, not a trial of test+treatment
Diagnostic Accuracy of Clinical Judgement
TP
FN
FP
TN
Serious
(requires intervention)
Minor
(requires no intervention)
Diagnostic Accuracy of Clinical Judgement + X-ray
TP
FN
FP
TN
Serious
(requires intervention)
Minor
(requires no intervention)
Comparison of Diagnostic Accuracy
All FP Discrepant B
All TN
All TP
Discrepant A
All FN
Serious
(requires intervention)
Minor
(requires no intervention)
Benefit can only occur in those whose diagnosis changes Where can differences arise?
Discrepant A could benefit if intervention effective Discrepant B could benefit if intervention harmful All others have no benefit as no change in their intervention
Sample size must take into account Prevalence of treatable condition Detection rate (sensitivity) with control test Detection rate (sensitivity) with new test Treatment rate if control test negative (assume zero) Treatment rate if new test positive (assume 100%) Outcome for treatable condition if untreated Treatment effect
Sample size for detecting treatment effects
Sample size for treatment vs control 300-400.
Sample size must be adjusted according to the proportion in discrepant cells (particularly A).
If 20% have serious disease and sensitivity 20% there will be 4% in Discrepant A increase N 25-fold (N=7,500-10,000)
If 10% have serious disease and sensitivity 10% there will be 1% in Discrepant A increase N 100-fold (N=30,000-40,000)
Sample size for detecting differences in accuracy Sample size depends on whether the
sample all receive both tests, or are randomised to tests
Sample sizes for difference in sensitivity If 20% have serious disease to detect sensitivity
20% from 70% to 90% (80% power, alpha 0.05) paired cohort design N=116 [68-136] parallel cohort design N=232
If 10% have serious disease to detect sensitivity
10% from 80% to 90% (80% power, alpha 0.05) paired cohort design N=706 [271-814] parallel cohort design N=1411
Sample size for detecting differences in diagnoses and management Sample size based on accuracy sample size
inflated according to:
For diagnostic impact diagnosis rate if control test negative diagnosis rate if new test positive*
For therapeutic impact treatment rate if control test negative treatment rate if new test positive*
* subject to “learning effects”
Validity Concerns Blinding
Participants and outcome assessors are rarely blind in diagnostic trials
Trials may be more susceptible to measuring preconceived notions of participants and expectations of trialists
Drop-out Lack of blinding can induce differential drop-out There are more stages at which drop-out occurs
Compliance Lack of blinding and complexity in strategies can
reduce compliance
What outcomes? The problem is multi-multi-factorial
Assessing the effect of a single intervention for a single disease requires multiple outcomes
Tests are used to differentiate between multiple diseases and disease states
A trial should assess all the important outcomes for the multiple diseases within the differential diagnosis
But trials usually have a focus on one condition
Summary of problems Diagnostic trials are …
Rarely done Assess effects of “test+treatment package” Uninformative about the value of the test Often underpowered At risk of bias May not assess all relevant outcomes May be more likely to detect “placebo”
effects than benefits of better diagnoses May not represent future impact on
treatment and diagnostic decisions
Key issues Trials only need be done in limited
circumstances Only patients in the discrepant cell are
informative Audit and feedback studies are better
for assessing and changing clinicians’ behaviour than trials
More good comparative studies of test accuracy are required
When is measuring sensitivity and specificity sufficient to evaluate a new test?Lord et al. Ann Int Med 2006; 144: 850-5
Categories of test attributes: The new test is safer or is less costly The new test is more specific (excludes
more cases of non-disease) The new test is more sensitive (detects
more cases of disease)
If an RCT of treatments exists, when do we still need to undertake an RCT of test+treatment?
Lord, S. J. et. al. Ann Intern Med 2006;144:850-855
Trial evidence versus linked evidence of test accuracy and treatment efficacy
Lord, S. J. et. al. Ann Intern Med 2006;144:850-855
Assessing new tests using evidence of test accuracy, given that treatment is effective for cases detected by the old test
When is measuring sensitivity and specificity sufficient to evaluate a new test?Lord et al. Ann Int Med 2006; 144: 850-5
If the new test has similar sensitivity
Trials of test+treatment are not required
Reductions in harm or cost are benefits
Improved specificity can only be a benefit
Decision models can be used to analyse trade-offs between positive and negative benefits
When is measuring sensitivity and specificity sufficient to evaluate a new test?Lord et al. Ann Int Med 2006; 144: 850-5
If the new test has improved sensitivity Value of using the test depends on treatment response in
the extra cases detected
A trial is still not needed if Inclusion in the treatment trial was based on the reference
standard for assessing test accuracy The test is evaluated in a treatment trial as a predictor of
response The new cases represent the same spectrum or subtype of
disease Treatment response is known to be similar across the
spectrum or subtype of disease
Alternative Diagnostic RCT
OutcomeIntervene
OutcomeDo not intervene
Randomise
X-ray
X-ray
Do not intervene
Outcome
Intervene Outcome
Clinical diagnosis
Population Sample
Serious
Minor
Serious
Serious
Minor
Minor
Alternative Diagnostic RCT
OutcomeIntervene
OutcomeDo not intervene
Randomise
X-ray
X-ray
Clinical diagnosis
Population Sample
Serious
Minor
Serious
Serious
Minor
Minor
Compare
Alternative Diagnostic RCT Everybody gets all tests, randomise only those
with discrepant results Benefits
Assess diagnostic yield and resultant patient outcomes
Less follow-up required Include a reference standard for a random sample and
comparative diagnostic accuracy can also be assessed Downsides
More tests undertaken Problems when test material is limited Does not assess test harms or other direct effects May not be ethical to randomise treatment
Assessing clinicians’ behaviours Informative trials require documentation and
standardisation of decision-making Particularly difficult when the comparison
group is standard practice Assessing behaviour observed in a trial may
not be representative Future behaviour will depend on the trial results Learning curves may affect compliance
Becoming acquainted with a test Ascertaining how best to use it Gaining confidence in its findings Allowing it to replace other investigations
Diagnostic Before-and-After Studies Design
Doctors’ assessments of diagnostic, prognostic and required management decisions recorded
Result of new test made available Doctors’ changes in diagnostic, prognostic and
required management decisions noted (Reference standard applied)
Application Assessment of an Additional Test only Assessment of Diagnostic Yield and Management
Concerns New test assessed independent of other tests Doctors’ processes may not reflect standard clinical
practice Learning effects
Conclusions1. We have much to learn about the best way of studying
diagnostic tests2. Test+treatment trials are difficult to undertake, are
prone to bias, and often require unattainable sample sizes.
3. Good comparative studies of test accuracy combined in decision models with evidence from trials of treatments may in many circumstances provide the necessary evidence for policy decisions
4. Good comparative studies of test accuracy should be commissioned more readily
Defects and Disasters in Evaluations of the Impact
of Diagnostic Tests
Jon DeeksProfessor of Health Statistics
University of Birmingham
Work supported by a DOH NCC RCD Senior Research Scientist in Evidence Synthesis Award