systematic reviews of studies quantifying the accuracy of

12
Systematic Reviews of Studies Quantifying the Accuracy of Diagnostic Tests and Markers Johannes B. Reitsma, 1* Karel G.M. Moons, 1 Patrick M.M. Bossuyt, 2 and Kristian Linnet 3 Systematic reviews of diagnostic accuracy studies allow calculation of pooled estimates of accuracy with in- creased precision and examination of differences in ac- curacy between tests or subgroups of studies. Recently, several advances have been made in the methods used in performing systematic reviews of diagnostic test ac- curacy studies, most notably in how to assess the meth- odological quality of primary diagnostic test accuracy studies by use of QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies 2) instrument and how to develop sound statistical models for metaanalysis of the paired measures of test accuracy (bivariate metare- gression model of sensitivity and specificity). This arti- cle provides an overview of the different steps within a diagnostic systematic review and highlights these ad- vances, illustrated with empirical data. The potential benefits of some recent developments in the areas of network metaanalysis and individual patient data metaanalysis for diagnostic tests are also discussed. © 2012 American Association for Clinical Chemistry There is growing awareness that proper evaluation of diagnostic tests, including biochemical tests, is a re- quirement for making informed decisions regarding the approval of these tests and recommendations for their use (1). In response, more and more primary evaluation studies of diagnostic tests are being per- formed and reported in the literature. In view of the increasing number of such studies, healthcare pro- fessionals more frequently turn to systematic re- views when seeking the best evidence about diagnos- tic tests. In this series of 4 reports, various approaches to the evaluation of diagnostic tests and markers are cov- ered, ranging from single-test accuracy studies [report 1 (2)], to multiple-test studies [report 2 (3)], to the evaluation of tests by their impact on patient outcomes and cost-effectiveness [report 4 (4)]. Each category of study has unique aspects that must be considered in the performance of systematic reviews of these kinds of studies. The focus of this third report in the series is on the methodology used for performing systematic re- views of diagnostic accuracy studies. This report will help readers of systematic reviews of diagnostic studies to judge key aspects of such reviews that may affect the validity (risk of bias) or applicability (generalizability) of the results of a review. The reasons for performing a review of accuracy studies of diagnostic tests, (bio)markers, or even mul- tivariable diagnostic models incorporating several diagnostic tests or markers, are similar to those for performing a review of (randomized) studies on ther- apeutic interventions: • Provide a transparent overview of all relevant studies highlighting differences in design and conduct. • Perform statistical pooling of estimates of the diag- nostic accuracy of a test from individual studies to increase precision. • Use the increased statistical power gained by pooled estimates to generate or confirm hypotheses about differences in accuracy of a test between clinical sub- groups, to examine the impact of study features on the accuracy of a test, and to compare accuracy be- tween different tests. However, compared with systematic reviews of therapeutic interventions, reviews of diagnostic test ac- curacy carry additional complexities, particularly as re- lated to the joint interest in 2 measures of accuracy per study (sensitivity and specificity) and the existence of specific forms of biases in diagnostic research (5, 6 ). Several advances in review methodology have been made in recent years to tackle these complexities (7, 8, 9 ). An important milestone confirming the rele- vance and maturity in methods of diagnostic reviews is the uptake of reviews about the accuracy of diagnostic tests in the Cochrane Collaboration database (9, 10 ). Here we describe the key steps in performing a systematic review and metaanalysis of diagnostic test accuracy studies, thereby highlighting recent advances in methodology. We illustrate this with a recently pub- 1 Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, the Netherlands; 2 Department Clinical Epidemiology, Biostatistics and Bioinformatics, Academic Medical Center, University of Amsterdam, the Neth- erlands; 3 Section of Forensic Chemistry, Department of Forensic Medicine, Faculty of Health Sciences, University of Copenhagen, Denmark. * Address correspondence to this author at: Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, P.O. Box 85500, 3508 GA Utrecht, the Netherlands. Fax 31-887555485; e-mail j.b.reitsma-2@ umcutrecht.nl. Received January 17, 2012; accepted June 11, 2012. Previously published online at DOI: 10.1373/clinchem.2012.182568 Clinical Chemistry 58:11 1534–1545 (2012) Review 1534

Upload: others

Post on 04-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Systematic Reviews of Studies Quantifying the Accuracy ofDiagnostic Tests and Markers

Johannes B Reitsma1 Karel GM Moons1 Patrick MM Bossuyt2 and Kristian Linnet3

Systematic reviews of diagnostic accuracy studies allowcalculation of pooled estimates of accuracy with in-creased precision and examination of differences in ac-curacy between tests or subgroups of studies Recentlyseveral advances have been made in the methods usedin performing systematic reviews of diagnostic test ac-curacy studies most notably in how to assess the meth-odological quality of primary diagnostic test accuracystudies by use of QUADAS-2 (Quality Assessment ofDiagnostic Accuracy Studies 2) instrument and how todevelop sound statistical models for metaanalysis ofthe paired measures of test accuracy (bivariate metare-gression model of sensitivity and specificity) This arti-cle provides an overview of the different steps within adiagnostic systematic review and highlights these ad-vances illustrated with empirical data The potentialbenefits of some recent developments in the areas ofnetwork metaanalysis and individual patient datametaanalysis for diagnostic tests are also discussedcopy 2012 American Association for Clinical Chemistry

There is growing awareness that proper evaluation ofdiagnostic tests including biochemical tests is a re-quirement for making informed decisions regardingthe approval of these tests and recommendations fortheir use (1 ) In response more and more primaryevaluation studies of diagnostic tests are being per-formed and reported in the literature In view of theincreasing number of such studies healthcare pro-fessionals more frequently turn to systematic re-views when seeking the best evidence about diagnos-tic tests

In this series of 4 reports various approaches tothe evaluation of diagnostic tests and markers are cov-

ered ranging from single-test accuracy studies [report1 (2)] to multiple-test studies [report 2 (3)] to theevaluation of tests by their impact on patient outcomesand cost-effectiveness [report 4 (4)] Each category ofstudy has unique aspects that must be considered in theperformance of systematic reviews of these kinds ofstudies The focus of this third report in the series is onthe methodology used for performing systematic re-views of diagnostic accuracy studies This report willhelp readers of systematic reviews of diagnostic studiesto judge key aspects of such reviews that may affect thevalidity (risk of bias) or applicability (generalizability)of the results of a review

The reasons for performing a review of accuracystudies of diagnostic tests (bio)markers or even mul-tivariable diagnostic models incorporating severaldiagnostic tests or markers are similar to those forperforming a review of (randomized) studies on ther-apeutic interventions

bull Provide a transparent overview of all relevant studieshighlighting differences in design and conduct

bull Perform statistical pooling of estimates of the diag-nostic accuracy of a test from individual studies toincrease precision

bull Use the increased statistical power gained by pooledestimates to generate or confirm hypotheses aboutdifferences in accuracy of a test between clinical sub-groups to examine the impact of study features onthe accuracy of a test and to compare accuracy be-tween different tests

However compared with systematic reviews oftherapeutic interventions reviews of diagnostic test ac-curacy carry additional complexities particularly as re-lated to the joint interest in 2 measures of accuracy perstudy (sensitivity and specificity) and the existence ofspecific forms of biases in diagnostic research (5 6 )Several advances in review methodology have beenmade in recent years to tackle these complexities(7 8 9 ) An important milestone confirming the rele-vance and maturity in methods of diagnostic reviews isthe uptake of reviews about the accuracy of diagnostictests in the Cochrane Collaboration database (9 10 )

Here we describe the key steps in performing asystematic review and metaanalysis of diagnostic testaccuracy studies thereby highlighting recent advancesin methodology We illustrate this with a recently pub-

1 Julius Center for Health Sciences and Primary Care University Medical CenterUtrecht the Netherlands 2 Department Clinical Epidemiology Biostatistics andBioinformatics Academic Medical Center University of Amsterdam the Neth-erlands 3 Section of Forensic Chemistry Department of Forensic MedicineFaculty of Health Sciences University of Copenhagen Denmark

Address correspondence to this author at Julius Center for Health Sciencesand Primary Care University Medical Center Utrecht PO Box 85500 3508GA Utrecht the Netherlands Fax 31-887555485 e-mail jbreitsma-2umcutrechtnl

Received January 17 2012 accepted June 11 2012Previously published online at DOI 101373clinchem2012182568

Clinical Chemistry 58111534ndash1545 (2012) Review

1534

lished systematic review on diagnostic decision rulesalone or in combination with a D-dimer assay for thediagnosis of pulmonary embolism see Appendix 1(11 ) In the final section some recent developments arediscussed

Key Steps in a Diagnostic Accuracy Review

Table 1 lists the different steps within a diagnosticreview and highlights the key issues within each stepThese steps are similar to those of a review of ran-domized therapeutic studies but it is probably fairto say that each step in a diagnostic review is morecomplicated

I FRAMING THE REVIEW QUESTION

How the question to be answered is framed is a criticalstep in any review but the framing process is morecomplex in the context of diagnostic test accuracy

(DTA)4 reviews The reason is that tests can have dif-ferent roles and be applied at different points in thediagnostic work-up of patients (12 ) As described inthe second report of this series the diagnostic processin clinical practice typically consists of multiple testsbeing applied in a specific order Such ordered lists ofsteps are also known as diagnostic pathways or work-ups On the basis of test results obtained at each stepreassurance can be provided to patients owing to theirlow test-based likelihood of a serious condition treat-ment can be offered if the likelihood of disease is highenough or further testing may be required when thereis remaining uncertainty Therefore the compositionof the patient population changes along the diagnosticpathway Such population changes that occur on thebasis of earlier test results are likely to change the accu-racy of a subsequent test as explained in the secondreport of this series One example would be the abilityof positron-emission tomographyndash computed tomog-raphy (PET-CT) scanning to detect distant metastasisin patients with esophageal cancer who have beenscheduled for a major operation with curative intentThis major operation is not worthwhile when distantmetastases are present PET-CT has been evaluated indifferent types of studies reflecting different clinicalscenarios in which this test can be applied For exampleone group of studies has examined whether PET-CT candetect additional metastasis in patients in whom ultra-sound and MRI did not find any metastasis This is anexample of an add-on question will the addition of a newtest correct previous errors by finding additional casesAny remaining metastases are likely to be small or other-wise difficult to detect and PET-CT scanning for the de-tection of these metastases later in the process is moredifficult than when PET-CT is evaluated earlier in the di-agnostic pathway as a possible alternative to MRI

Because the accuracy of a test is likely to differdepending on its place in the diagnostic pathway spec-ifying the intended role and placement of the index testin the diagnostic pathway or clinical context is the firstand critical step for any DTA review A clear statementin the review of the potential role of the index test (egthe intended change in the diagnostic pathway) willfacilitate the interpretation of the results of the review(12 ) Highlighting the intended change in the diagnos-tic pathway is helpful in determining the appropriatepatient population for which the right comparator test(if applicable) is being selected and in choosing andinterpreting accuracy measures A modification of thePICO (population intervention comparison out-

4 Nonstandard abbreviations DTA diagnostic test accuracy PET positron emis-sion tomography IPD individual patient data

Table 1 Main steps in a systematic review ofdiagnostic test accuracy studies

Steps Key issues

I Framing the reviewquestion

Target condition

Intended role of the index test(s)

Population of interest

II Searching andlocating studies

Multiple databases

Multiple search terms related totarget condition and indextest

No search filters

III Assessment ofmethodologicalquality

To assess risk of bias andsources of variation

QUADAS-2 checklist

IV Metaanalyzingdiagnosticaccuracy data

Descriptive figures forest plotsand ROC plot

Use of hierarchical randomeffects models

hierarchical summary ROCapproach

bivariate metaregressionof sensitivity andspecificity

Study-level covariates toexamine differences inaccuracy between indextests or subgroups

V Interpreting resultsand drawingconclusions

Precision and variability acrossstudies of relevant accuracymeasures

Absolute numbers of true-positive false-positive true-negative false-negativefindings derived fromsummary accuracymeasures

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1535

come) system used in therapeutic studies can also behelpful in framing the question in diagnostic reviewswith the following elements

P Population the patients in whom the test will beapplied in practice Important elements settingpresenting symptoms prior testing

I Index test(s) the test under evaluationC Comparator test(s) relevant if there is an interest

in comparing the accuracy of different index testsO Outcome the target condition and how the final

diagnosis will be made (ie reference standard)

In our example of clinical decision rules for pul-monary embolism (see Appendix 1) the key point ofinterest is whether these rules (alone or in combinationwith a D-dimer test) can be used to select patients whodo not require further invasive or costly testing Thistype of intended role of a test has been referred to astriage Therefore the number of patients with negativetest results but with a final diagnosis of pulmonary em-bolism is of key interest (ie missed cases)

II SEARCHING FOR AND LOCATING STUDIES

Identifying all relevant studies is a key objective in anysystematic review Searching for diagnostic accuracystudies proves to be more difficult than for interven-tion studies because there are no specific search termsfor diagnostic test accuracy studies such as ldquorandom-ized clinical trialrdquo for therapeutic intervention studies(13 ) Search strategies in diagnostic reviews are gener-ally based on combining different sets of terms relatedto (a) the test(s) under evaluation and (b) the clinicalcondition of interest Both MESH terms and free textwords describing the index test and condition of inter-est should be used in the search The articles of these 2sets can then be combined using the Boolean ldquoANDrdquooperator The use of filters to limit a set of articles todiagnostic accuracy studies is not recommended be-cause the use of these filters can cause a meaningfulnumber of relevant studies to be missed (ie in situa-tions in which the filter is not sensitive enough) or incase of highly sensitive filters can lead to a situation inwhich hardly any articles are eliminated from thosethat need to be screened (14 )

Limiting the search to a single database for exam-ple MEDLINE is generally not considered adequate forsystematic reviews (15 ) Relying solely on MEDLINEmay result in the retrieval of a set of reports unrepre-sentative of all reports that would have been identifiedthrough a comprehensive search of several sourcesEMBASE is a logical additional source to be searchedbecause it also covers all areas of healthcare (16 ndash18 )Many more specific databases exist that can be usefuladditional sources depending on the topic of the re-view The IFCC database may be a useful extra source

because it contains diagnostic reviews of tests or mark-ers used in the domain of clinical chemistry (websitewwwifccorg)

In our pulmonary embolism example bothMEDLINE and EMBASE have been searched usingmultiple alternative search terms for ldquopulmonary em-bolismrdquo combined with multiple search words that canindicate studies reporting results from diagnosticstudies The full search strategy is available as anappendix on the website of the Annals of InternalMedicine (11 )

III QUALITY ASSESSMENT

Quality assessment in diagnostic accuracy reviews fo-cuses on 2 different but related concepts Assessingboth these concepts is important because they may ex-plain why findings differ between studies The first di-mension to consider is whether the results of a studymay be biased Similarly to therapeutic interventionstudies diagnostic accuracy studies have key features inthe design and execution that can produce incorrectresults within a study This is often described as ldquointer-nal validityrdquo Examples of threats to internal validityare the use of an inappropriate reference standardstudies in which all patients are verified by the refer-ence standard to determine the presence or absence ofthe target condition (partial verification) or knowl-edge of the outcome of the reference standard when theresults of the index test are interpreted

Even if there are no apparent flaws directly leadingto bias a diagnostic accuracy study may generate re-sults that are not applicable for the particular questionthat the review tries to answer The patients in the studymay not be similar to those in whom the test is used thetest may be used at a different point in the care path-way or the test may be used in a different way than inpractice This refers to the issue of external validity orgeneralizability of results

In 2003 the Quality Assessment of Diagnostic Ac-curacy Studies (QUADAS) checklist was published as ageneric tool for the quality assessment of primary stud-ies within a diagnostic review (19 ) The QUADAS toolconsists of 14 items covering risk of bias sources ofvariation and reporting quality Each item is ratedldquoyesrdquo ldquonordquo or ldquounclearrdquo where yes indicates the ab-sence of problems Recently the QUADAS tool hasbeen revised on the basis of comments from users andexperts in the field (8 ) The items of the revisedQUADAS-2 checklist are shown in Table 2 Reviewersare encouraged to add additional items that are rele-vant for the particular review that they are carrying out

Quality assessment of included studies takes timeand effort because reporting is often incomplete or un-clear Furthermore several quality items require a sub-jective judgment of the assessor for example when

Review

1536 Clinical Chemistry 5811 (2012)

judging whether the spectrum of patients in a studymatches that of the intended population defined in thereview question Given these difficulties the strong adviceis that at least 2 persons should independently performthe quality assessment These persons should have rele-vant knowledge of both the methodological issues in di-agnostic accuracy studies and the clinical topic area

Results of quality assessment can be presented intables and graphs Tables can be used to document allfeatures of the included studies including theQUADAS-2 items Such a table takes up a lot of spaceand does not provide a useful succinct summary for thereader These tables are often reported as supplementalmaterial on the websites of journals

Table 2 Revised QUADAS-2 checklist

Table 1 Risk of bias and applicability judgments in QUADAS-2a

Domain Patient selection Index test Reference standard Flow and timing

Description Describe methods of patientselection

Describe the index test and howit was conducted andinterpreted

Describe the referencestandard and howit was conductedand interpreted

Describe any patientswho did notreceive the indextests or referencestandard or whowere excludedfrom the 2 2table (refer to flowdiagram)

Describe included patients(previous testingpresentation intendeduse of index test andsetting)

Describe the intervaland anyinterventionsbetween indextests and thereference standard

Signaling questions(yes no orunclear)

Was a consecutive orrandom sample ofpatients enrolled

Were the index test resultsinterpreted without know-ledge of the results of thereference standard

Is the referencestandard likely tocorrectly classify thetarget condition

Was there anappropriateinterval betweenindex tests andreferencestandard

Was a case control designavoided

If a threshold was used was itprespecified

Were the referencestandard resultsinterpreted withoutknowledge of theresults of the indextest

Did all patientsreceive a referencestandard

Did the study avoidinappropriate exclusions

Did all patientsreceive the samereferencestandard

Were all patientsincluded in theanalysis

Risk of bias (highlow or unclear)

Could the selection ofpatients have introducedbias

Could the conduct orinterpretation of the indextest have introduced bias

Could the referencestandard itsconduct or itsinterpretation haveintroduced bias

Could the patientflow haveintroduced bias

Concerns aboutapplicability(high low orunclear)

Are there concerns that theincluded patients do notmatch the reviewquestion

Are there concerns that theindex test its conduct or itsinterpretation differ from thereview question

Are there concernsthat the targetcondition as definedby the referencestandard does notmatch the reviewquestion

a Reproduced with permission from Whiting et al (8 )

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1537

Two graphical summaries are recommended forpresenting the results of the quality assessment Themethodological quality graph presents for each qualityassessment item the percentage of included studies inwhich the item was rated ldquoyesrdquo ldquonordquo and ldquounclearrdquo ina stacked bar chart This type of graph provides thereader with a quick overview of the study quality withinthe whole review The methodological quality graph ofour pulmonary embolism example is given in Fig 1which shows that potential areas of concern are differ-ential verification (eg the use of different referencestandard for different groups of patients in a study) andthe large proportion of studies providing no data thatuninterpretable test results were present

A systematic review provides an opportunity toinvestigate how features of study design executionand reporting may have an impact on study findingsOne way is to give a narrative summary of the qualityassessment and discuss how susceptible the results areto particular biases Another approach is to do a sensitivityanalysis in which studies that fail to meet some standard ofquality are excluded Metaregression allows direct exam-ination of the impact of specific individual quality itemson diagnostic accuracy (see next section)

IV METAANALYSIS AND PRESENTATION OF POOLED

DIAGNOSTIC TEST ACCURACY RESULTS

Metaanalysis is the use of statistical techniques to com-bine the results from a set of individual studies We canuse metaanalysis to obtain summaries of the results ofrelevant included studies such as an estimate of themean diagnostic accuracy of a test or marker the sta-

tistical uncertainty around this mean expressed with95 CIs and the variability of individual study find-ings around mean estimates Metaanalytical regressionmodels can statistically compare the accuracy of 2 ormore different tests and examine how test accuracyvaries with specific study characteristics

In the metaanalysis of diagnostic accuracy studiesthe focus is on 2 statistical measures of diagnostic ac-curacy the sensitivity of the test (the proportion ofpatients with the target disease who have an abnormaltest result) and the specificity of the test (the propor-tion of patients without the target disease who have anormal test result) Statistical methods for diagnostictest accuracy have to deal with 2 outcomes simultane-ously (ie sensitivity and specificity) rather than a sin-gle outcome measure (eg a relative risk or odds ratio)as is the case for reviews of therapeutic interventions(5 ) The diagnostic metaanalytical models have to al-low for the trade-off between sensitivity and specificitythat can arise because studies may vary in the thresholdvalue used to define test positives and test negatives[also see report 1 in our series (20 )] Another feature ofdiagnostic reviews is the many potential sources forvariation in test accuracy results between studies Ex-amining factors that can (partly) explain variation inthese results and the use of a random effects model arekey features of a DTA review

DESCRIPTIVE STATISTICS

The first step in the analysis is to visualize the resultsfrom the individual studies within a review There are 2types of figures that can be used forest plots of sensi-tivity and specificity and plots of these measures inROC space

Forest plots display the estimates of sensitivity andspecificity of each study the corresponding CIs andthe underlying raw numbers in a paired way (Fig 2)These plots give a visual impression of the variation inresults between studies an indication of the precisionby which sensitivity and specificity have been measuredin each study the presence of outliers and a sense forthe mean values of sensitivity and specificity

Plotting the pairs of sensitivity and specificity esti-mates from separate studies in the ROC space providesadditional insight regarding the variation of results be-tween studies in particular whether sensitivity andspecificity are negatively correlated (Fig 3) The x axisof the ROC plot displays the (1 specificity) obtainedin the studies in the review and the y axis shows thecorresponding sensitivity The rising diagonal line in-dicates values of sensitivity and specificity belonging toa test that is not informative ie the chances for apositive test result are identical for patients with andwithout the target disease Better (eg more informa-tive) tests will have higher values of both sensitivity and

Fig 1 Quality assessment of included studies per-formed with the original QUADAS checklist in a re-cently published systematic review on diagnostic deci-sion rules used alone or in combination with a D-dimerassay for the diagnosis of pulmonary embolism

Reproduced with permission from Lucassen et al (11 )

Review

1538 Clinical Chemistry 5811 (2012)

Fig 2 Paired forest plot of sensitivity (A) and specificity (B) and the corresponding 95 CIs from studies examiningthe diagnostic accuracy of the Wells rule with a cutoff value of 2 Wells rule with cutoff value of 4 and Gestalt forthe diagnosis of pulmonary embolism

Studies within a rule are sorted by prevalence Adapted from Lucassen et al (11 )Continued on page 1540

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1539

specificity and are therefore located more toward thetop-left corner of the ROC-space If there is a tradeoff(eg negative correlation) between sensitivity andspecificity a shoulderlike pattern in the ROC space willemerge This pattern will be comparable to the pattern

that arises in a single study of a test that produces acontinuous result in which the threshold has been var-ied Lowering the threshold will then increase the like-lihood of a positive test result in patients with the targetdisease thereby increasing sensitivity while at the same

Fig 2 Continued

Review

1540 Clinical Chemistry 5811 (2012)

it increases the risk of a false-positive result in patientswithout the target disease thereby lowering specificityThis trade-off or negative correlation will generate thisshoulder-like pattern in the ROC space

The ROC plot of our 3 clinical decision rules forpulmonary embolism clearly indicates the presence ofnegative correlations both within rules as well as acrossdifferent rules (Fig 3)

Metaanalysis of Diagnostic Accuracy Data

Metaanalyses of studies reporting sensitivities andspecificities have often used the MosesndashLittenberg lin-ear regression approach (21 ) to obtain a summaryROC curve It has become clear that this approach hasstatistical shortcomings (5 22 ) and therefore it is nolonger recommended for evaluating differences be-tween summary ROC curves between tests or examin-ing the impact of covariates on accuracy

To overcome the shortcomings of the MosesndashLittenberg approach 2 more rigorous statistical ap-proaches have since been developed These are the hi-erarchical summary ROC approach and the bivariaterandom effects model (5 22 ) Both models are hierar-chical random effects models that take into account thebetween-study variation in sensitivities and specifici-ties (eg random effects models) and their possiblecorrelations as well as the precision of these estimates

within a study (eg weighting of studies) Although thestarting points of these 2 models are different the 2models are mathematically equivalent (23 ) Both mod-els can produce summary estimates of sensitivity andspecificity and produce a statistically sound summaryROC line or provide 95 confidence ellipses aroundthe mean values of sensitivity and specificity (Fig 4)

EXAMINATION OF SOURCES OF VARIATION AND DIFFERENCES

IN ACCURACY BETWEEN TESTS

Results from individual studies often vary within a re-view There are several possible causes for variationwhich can be categorized according to the groupsshown in Table 3

Both advanced models are regression models thatallow flexibility in examining sources of heterogeneityby including study-level covariates This feature pro-vides the option of formally comparing the results ofstudies with a specific feature (eg partial verification)with the results of studies that have avoided partial ver-ification In the same way we can examine whether theaccuracy results from studies examining test A are dif-ferent from studies examining test B Limiting thecomparison of tests to studies with a cross-over designin which both index tests have been applied in the samepatient may be a preferred approach These so-called

Fig 3 Pairs of sensitivity and specificity values fromstudies examining 3 different rules for the diagnosisof pulmonary embolism (A Wells rule with a cutoffvalue 2 B Wells rule with a cutoff value of 4 CGestalt) Adapted from Lucassen et al (11 )

Fig 4 ROC plot showing the summary estimate ofsensitivity and specificity and the corresponding 95confidence ellipse for 3 different clinical decisionrules for the diagnosis of pulmonary embolism (A

Wells rule with cutoff value 2 B Wells rule withcutoff value of 4 C Gestalt) Adapted from Lucas-sen et al (11 )

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1541

paired comparisons provide more valid evidence thanresults generated from unpaired studies (separate stud-ies) that may reflect other underlying differences in de-sign and conduct between studies (ie confoundingfactors)

Diagnostic accuracy can vary between clinical sub-groups Examining such differences in a systematic re-view is problematic if primary studies do not reportstratified results for these subgroups In the absenceof stratified results researchers have to use study-levelsummaries of the covariate representing the clinicalsubgroup Such summaries have limited power for de-tecting differences in accuracy between clinical sub-groups As an example if reviewers are interested inwhether accuracy of a test varies between men andwomen they could use the percentage of males in eachstudy as a study-level covariate in their model A study-

level covariate reduces the power to find differencesbetween males and females which would clearly be thecase if all included studies had similar percentages ofmales Even if a clear difference in accuracy existed be-tween males and females in each study it would remainundetected in a regression model based on the percent-age of males Individual patient data (IPD) metaanaly-sis provides more power and flexibility to examinevariation in accuracy between clinical subgroups

Just as for any regression model used for examiningcovariates clear boundaries exist that define what can bedone or what is sensible given the sample size of the studyInsufficient statistical power or an increased risk of find-ing false-positive associations when many covariates areexamined are concerns when diagnostic reviews are con-ducted The number of different studies within a review isthe key limitation for examining covariates

The results from the bivariate model comparingthe three different rules are summarized in Fig 4 andTable 4

These results show that (as expected) mean sensi-tivity is significantly lower for the Wells studies using acutoff value of 2 compared with studies using a cutoffvalue of 4 but at the same time specificity is signifi-cantly lower Such differences are expected when low-ering the threshold for positivity The results from theGestalt studies are comparable with the Wells studiesusing a cutoff value of 2 although there appears to bemore heterogeneity in the reported specificities ofGestalt studies (Fig 2)

In the example review on pulmonary embolismthe authors examined whether the prevalence in astudy had an impact on the levels of sensitivity andspecificity in a study by including it as a covariate in thebivariate metaregression model There are several rea-sons why prevalence might be associated with sensitiv-ity and specificity An overview of these potential rea-sons is given in (24 )

In this case the authors hypothesized that differ-ences in prevalence could be seen as a proxy for differ-ences in case mix between studies In studies with lowerprevalence more patients may be in an early stage ofthe disease which would hamper detection and lead tomore false-negative results and hence lower sensitiv-ity In this review increased prevalence was associatedwith higher sensitivity and lower specificity (Table 4)

V INTERPRETING RESULTS AND DRAWING CONCLUSIONS

This is the part of the review process in which all theresults of the different steps within a systematic reviewhave to be combined to answer the review question(s)at hand Key ingredients include the methodologicalquality of the evidence whether the included studiesexamined the same intended role of the test as ex-

Table 3 Causes for variation in sensitivity andspecificity results between primary studies within a

review

Chance variation The majority of diagnostic accuracystudies are moderate to small insample size Considerablevariation by chance can then beexpected especially forsensitivity when the prevalenceis low The advanced modelsproperly take into account theprecision by which sensitivityand specificity have beenmeasured in each study

Differences inthreshold

Explicit or implicit differences inthresholds for positivity betweenstudies will lead to differences insensitivity and specificity inopposite directions creatingnegative correlations Theadvanced models take thepossible correlations intoaccount

Bias Deficiencies in the design andconduct of diagnostic studies canlead to biased results oftenproducing more exaggeratedresults Advanced models canexamine the impact ofdeficiencies in design byincluding study-level covariates

Variation byclinicalsubgroups

Examine stratified results orsummaries at a study level

Unexplainedvariation

It is likely that remaining variationbeyond chance will be present inDTA reviews The advancedmodels use random effects toincorporate variation beyondchance

Review

1542 Clinical Chemistry 5811 (2012)

pressed in the review question and the precision andvariability in accuracy results

Reviews with a comparative question (eg is test Abetter than test B at a specific point in the diagnosticpathway) can directly examine whether sensitivity orspecificity or both are higher for one test than the otherA distinction should be made between primary studiesdirectly comparing the 2 index tests in the same patient(direct evidence) and studies examining only one ofthese index tests (indirect evidence) Direct evidence ispreferred because important factors that may have animpact on accuracy (ie potential confounding factorssuch as the population and choice of reference stan-dard) will be constant when the index tests are com-pared If sufficient studies with direct evidence areavailable the main analysis or any sensitivity analysesshould focus on these studies providing directevidence

If both sensitivity and specificity are higher or theentire summary ROC curve for one test is to the left andabove that of the other test the conclusion is straightfor-ward If sensitivity is higher for one test and specificity forthe other or if the summary ROC curves of the 2 testscross it is important to examine and weigh the potentialnegative consequences associated with false-positive orfalse-negative test results One way to provide this insightis to subject a hypothetical cohort of 1000 patients to bothtests and calculate the number of patients with differentcorrect and incorrect test results based on summary esti-

mates of sensitivity and specificity and a reasonable esti-mate of the expected prevalence

The intended role of a test is also helpful in structur-ing the interpretation of results In triage questions thenumber of missed cases (eg false-negative test results) isthe key concern so sensitivity or the negative predictivevalue are the key accuracy measures The desired mini-mum level for these measures will still be a subjectivechoice and depend on the condition at hand In our ex-ample most experts will agree that the clinical decisionrule should not miss more than 5 of the patients withpulmonary embolism so therefore sensitivity should be atleast 95 From the results of the rules alone it is clear thata large part of the confidence ellipse and even the sum-mary estimate of sensitivity do not meet this criterion (Ta-ble 4) This observation leads to a firm conclusion thatclinical decisions alone are not suited for use in the triageof patients suspected to have pulmonary embolismTherefore D-dimer results have been added to the triageof patients suspected for pulmonary embolism In thisscenario patients will not undergo further testing if boththe clinical decision rule AND the D-dimer are negativeThe proportion of patients who had negative results forboth tests but who had a final diagnosis of pulmonaryembolism (failure rate) has been metaanalyzed Adding aqualitative D-dimer to the clinical decision rule led tofailure rates that were lower than 2 (Table 4) Thisfrequency has been considered sufficiently low andtherefore such strategies have been implemented in

Table 4 Mean (95 CI) values of sensitivity and specificity for 3 different clinical decision rules for pulmonaryembolism the impact of prevalence on sensitivity and specificity and failure rate and efficiency of a strategy in

which patients with a low probability of disease and a negative D-dimer receive no further testing

Subgroup (no of studies) Sensitivity (95 CI) Specificity (95 CI)

Type of rule

Wells cutoff value of 2 (n 19) 84 (78ndash89) 58 (52ndash65)

Wells cutoff value of 4 (n 11) 60 (49ndash69) 80 (75ndash84)

Gestalt (n 15) 85 (78ndash90) 51 (39ndash63)

P value Wells 2 vs Wells 4 P 0001 P 0001

P value Wells 2 vs Gestalt P 096 P 031

P value Wells 4 vs Gestalt P 0001 P 0001

Impact prevalence within Wells 2 studies

Prevalence 5 67 (58ndash75) 72 (65ndash79)

Prevalence 15 85 (80ndash89) 58 (52ndash63)

Prevalence 30 91 (88ndash94) 47 (40ndash55)

P value for trend P 0001 P 0001

Adding D-dimer testing to rule Failure rate (95 CI) Efficiency (95 CI)

Wells 4 with quantitative D-dimer (n 4) 05 (02ndash09) 39 (30ndash48)

Wells 2 with qualitative D-dimer (n 5) 09 (05ndash17) 40 (32ndash49)

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1543

practice The efficiencies of such strategies arearound 40 meaning that in 40 of the patients nofurther testing is required

Similar to any other review there is the threat ofpublication bias in DTA reviews (18 ) Publication biasoccurs when studies containing less favorable resultsare less likely to be published Summary results basedon published findings will then generate an overopti-mistic picture of the accuracy of a test Unfortunatelylittle information exists regarding the presence andmagnitude of publication bias in diagnostic accuracystudies Unlike randomized trials there are no registriesfor protocols of diagnostic accuracy studies

Recent Developments

In this section we highlight some recent developmentsthat are relevant for diagnostic accuracy reviews of bio-chemical tests and markers

NETWORK METAANALYSIS

In many diagnostic scenarios there are several alterna-tive tests available which leads to the key questionwhich test is the best Direct comparisons of tests(head-to-head comparison in the same patients by useof a cross-over design or a parallel randomized design)offer the most valid study design but are not alwaysavailable in the literature Systematic reviews focusingon more than one diagnostic test have to incorporateindirect comparisons (accuracy of different tests as-sessed in different populations) Network metaanaly-ses have been developed in the field of intervention tocombine both direct and indirect comparisons within asingle statistical model to allow for ranking of the avail-able treatments (25 ) In addition these models provideestimates of heterogeneity and inconsistency of effectsSuch network metaanalyses would be a welcome addi-tion for ranking and selecting the best test among sev-eral alternatives

IPD METAANALYSIS

IPD metaanalyses use individual patient data ratherthan published summary results of a study In an IPDmetaanalysis there is more flexibility and more statisti-cal power to examine how patientsrsquo characteristics af-fect diagnostic test accuracy (subgroup analyses or ef-fect modification) IPD metaanalysis also offers moreflexibility in handling differences in thresholds for pos-itivity for continuous index test results and for deter-mining the optimal cutoff value (26 )

Concluding Remarks

Many improvements have been made in the method-ology of performing systematic reviews of the accuracy

of diagnostic tests and multivariable diagnostic mod-els Methods have been improved for locating diagnos-tic accuracy studies for assessing the risk for bias andsources of variation and for developing advanced andflexible models to metaanalyze 2 possible correlatedoutcomes However the biggest obstacle for generatinghigh-quality clinically useful diagnostic reviews is thepoor methodological quality of the existing body ofdiagnostic accuracy studies reported in the literatureFortunately interest in the methods for the evaluationof diagnostic tests has grown considerably in the lastdecade Higher-quality and more informative primarystudies will in return generate more informative diag-nostic reviews

Appendix 1

Accuracy of diagnostic decision rules without and withD-dimer assay for the diagnosis of pulmonary embo-lism Pulmonary embolism (PE) is an important condi-tion for physicians to consider because case fatality ishigh if left untreated However diagnosing PE in sus-pected patients is challenging because signs and symp-toms are often nonspecific Physicians constantly facethe dilemma of not wanting to miss a PE while at thesame time wanting to avoid performing too many un-necessary additional diagnostic procedures that can beexpensive burdensome and possibly harmful Diag-nostic strategies in suspected patients therefore focuson identifying patients in whom PE can be safely ruledout on the basis of findings from the patient history andphysical examination Many different diagnostic deci-sion rules for excluding PE on the basis of symptomsand signs with or without D-dimer assay have beendeveloped and validated but there remains uncertaintyas to whether these different rules differ in their accu-racy in a meaningful way In this example we focus on 3rules

bull Wells rule using a cutoff value of 2 for defining apositive (abnormal) test result

bull Wells rule using a cutoff value of 4bull Gestalt rule

In the Wells rules points are scored when certainsigns and symptoms (eg heart rate 100 previousdeep venous thrombosis) are present resulting in a to-tal score In the Gestalt rule physicians provide anoverall empirical assessment of the likelihood of pul-monary embolism being present after examination of apatient To safely exclude pulmonary embolism aD-dimer test can be added to the clinical rule to refrainfrom further testing if both tests (rule D-dimer assay)are negative Further details and more rules can befound in the original review (11 )

Review

1544 Clinical Chemistry 5811 (2012)

REVIEW AIMS

To determine and compare the diagnostic accuracy of 3different clinical decision rules Wells-2 (n 19 stud-ies) Wells-4 (n 11 studies) and Gestalt rule (n 15studies)

To examine whether a negative test from a rule incombination with a negative D-dimer test result is a safeand efficient strategy for excluding PE without referralfor further burdening and invasive imaging

Author Contributions All authors confirmed they have contributed tothe intellectual content of this paper and have met the following 3 re-quirements (a) significant contributions to the conception and design

acquisition of data or analysis and interpretation of data (b) draftingor revising the article for intellectual content and (c) final approval ofthe published article

Authorsrsquo Disclosures or Potential Conflicts of Interest Upon man-uscript submission all authors completed the author disclosure formDisclosures andor potential conflicts of interest

Employment or Leadership None declaredConsultant or Advisory Role None declaredStock Ownership None declaredHonoraria None declaredResearch Funding The Netherlands Organisation for Health Re-search and Development (ZonMW) KGM Moons the Nether-lands Organisation for Scientific Research (projects 91208004 and91810615)Expert Testimony None declared

References

1 Lijmer JG Leeflang M Bossuyt PM Proposals fora phased evaluation of medical tests Med DecisMaking 200929E13ndash21

2 Linnet K Bossuyt PM Moons KG Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

3 Moons KG de Groot JA Linnet K Reitsma JBBossuyt PM Quantifying the added value of adiagnostic test or marker Clin Chem 2012581408ndash17

4 Bossuyt PM Reitsma JB Linnet K Moons KGBeyond diagnostic accuracy the clinical utility ofdiagnostic tests Clin Chem [Epub ahead of print2012 Jun 22]

5 Reitsma JB Glas AS Rutjes AW Scholten RJBossuyt PM Zwinderman AH Bivariate analysisof sensitivity and specificity produces informativesummary measures in diagnostic reviews J ClinEpidemiol 200558982ndash90

6 Whiting P Rutjes AW Reitsma JB Glas ASBossuyt PM Kleijnen J Sources of variation andbias in studies of diagnostic accuracy a system-atic review Ann Intern Med 2004140189ndash202

7 Leeflang MM Deeks JJ Gatsonis C Bossuyt PMon behalf of the Cochrane Diagnostic Test Accu-racy Working Group Systematic reviews of diag-nostic test accuracy Ann Intern Med 2008149889ndash97

8 Whiting PF Rutjes AW Westwood ME Mallett SDeeks JJ Reitsma JB et al QUADAS-2 a revisedtool for the quality assessment of diagnostic ac-curacy studies Ann Intern Med 2011155529ndash36

9 Diagnostic test accuracy working group httpsrdtacochraneorg (Accessed August 2012)

10 Leeflang MM Debets-Ossenkopp YJ Visser CEScholten RJ Hooft L Bijlmer HA et al Galacto-mannan detection for invasive aspergillosis in

immunocompromized patients Cochrane Data-base Syst Rev 2008CD007394

11 Lucassen W Geersing GJ Erkens PM Reitsma JBMoons KG Buller H van Weert HC Clinical de-cision rules for excluding pulmonary embolism ameta-analysis Ann Intern Med 2011155448ndash60

12 Bossuyt PM Irwig L Craig J Glasziou P Com-parative accuracy assessing new tests againstexisting diagnostic pathways BMJ 20063321089ndash92

13 Doust JA Pietrzak E Sanders S Glasziou PPIdentifying studies for systematic reviews of di-agnostic tests was difficult due to the poor sen-sitivity and precision of methodologic filters andthe lack of information in the abstract J ClinEpidemiol 200558444ndash9

14 Leeflang MM Scholten RJ Rutjes AW ReitsmaJB Bossuyt PM Use of methodological searchfilters to identify diagnostic accuracy studies canlead to the omission of relevant studies J ClinEpidemiol 200659234ndash40

15 Savoie I Helmer D Green CJ Kazanjian A Be-yond Medline reducing bias through extendedsystematic review search Int J Technol AssessHealth Care 200319168ndash78

16 Fraser C Mowatt G Siddiqui R Burr J Searchingfor diagnostic test accuracy studies an applica-tion to screening for open angle glaucoma (OAG)[Abstract] Cochrane Colloquium AbstractsJournal 2006 httpwwwimbiuni-freiburgdeOJSccaindexphpjournalccaamppagearticleampopviewamppath[]1980 (Accessed October2012)

17 Whiting P Westwood M Burke M Sterne JGlanville J Systematic reviews of test accuracyshould search a range of databases to identifyprimary studies J Clin Epidemiol 200861

357ndash 6418 Song F Eastwood AJ Gilbody S Duley L Sutton

AJ Publication and related biases Health TechnolAssess 200041ndash115

19 Whiting P Rutjes AW Reitsma JB Bossuyt PMKleijnen J The development of QUADAS a toolfor the quality assessment of studies of diagnos-tic accuracy included in systematic reviews BMCMed Res Methodol 2003325

20 Linnet K Bossuyt PMM Moons KGM Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

21 Moses LE Shapiro D Littenberg B Combiningindependent studies of a diagnostic test into asummary ROC curve data-analytic approachesand some additional considerations Stat Med1993121293ndash316

22 Rutter CM Gatsonis CA A hierarchical regressionapproach to meta-analysis of diagnostic test ac-curacy evaluations Stat Med 2001202865ndash84

23 Harbord RM Deeks JJ Egger M Whiting PSterne JA A unification of models for meta-analysis of diagnostic accuracy studies Biostatis-tics 20078239ndash51

24 Leeflang MM Bossuyt PM Irwig L Diagnostictest accuracy may vary with prevalence implica-tions for evidence-based diagnosis J Clin Epide-miol 2009625ndash12

25 Li T Puhan MA Vedula SS Singh S Dickersin Kthe Ad Hoc Network Meta-analysis MethodsMeeting Working Group Network meta-analysis-highly attractive but more methodological re-search is needed BMC Med 2011 27979

26 Khan KS Bachmann LM ter Riet G Systematicreviews with individual patient data meta-analysis to evaluate diagnostic tests Eur J ObstetGynecol Reprod Biol 2003108121ndash5

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1545

lished systematic review on diagnostic decision rulesalone or in combination with a D-dimer assay for thediagnosis of pulmonary embolism see Appendix 1(11 ) In the final section some recent developments arediscussed

Key Steps in a Diagnostic Accuracy Review

Table 1 lists the different steps within a diagnosticreview and highlights the key issues within each stepThese steps are similar to those of a review of ran-domized therapeutic studies but it is probably fairto say that each step in a diagnostic review is morecomplicated

I FRAMING THE REVIEW QUESTION

How the question to be answered is framed is a criticalstep in any review but the framing process is morecomplex in the context of diagnostic test accuracy

(DTA)4 reviews The reason is that tests can have dif-ferent roles and be applied at different points in thediagnostic work-up of patients (12 ) As described inthe second report of this series the diagnostic processin clinical practice typically consists of multiple testsbeing applied in a specific order Such ordered lists ofsteps are also known as diagnostic pathways or work-ups On the basis of test results obtained at each stepreassurance can be provided to patients owing to theirlow test-based likelihood of a serious condition treat-ment can be offered if the likelihood of disease is highenough or further testing may be required when thereis remaining uncertainty Therefore the compositionof the patient population changes along the diagnosticpathway Such population changes that occur on thebasis of earlier test results are likely to change the accu-racy of a subsequent test as explained in the secondreport of this series One example would be the abilityof positron-emission tomographyndash computed tomog-raphy (PET-CT) scanning to detect distant metastasisin patients with esophageal cancer who have beenscheduled for a major operation with curative intentThis major operation is not worthwhile when distantmetastases are present PET-CT has been evaluated indifferent types of studies reflecting different clinicalscenarios in which this test can be applied For exampleone group of studies has examined whether PET-CT candetect additional metastasis in patients in whom ultra-sound and MRI did not find any metastasis This is anexample of an add-on question will the addition of a newtest correct previous errors by finding additional casesAny remaining metastases are likely to be small or other-wise difficult to detect and PET-CT scanning for the de-tection of these metastases later in the process is moredifficult than when PET-CT is evaluated earlier in the di-agnostic pathway as a possible alternative to MRI

Because the accuracy of a test is likely to differdepending on its place in the diagnostic pathway spec-ifying the intended role and placement of the index testin the diagnostic pathway or clinical context is the firstand critical step for any DTA review A clear statementin the review of the potential role of the index test (egthe intended change in the diagnostic pathway) willfacilitate the interpretation of the results of the review(12 ) Highlighting the intended change in the diagnos-tic pathway is helpful in determining the appropriatepatient population for which the right comparator test(if applicable) is being selected and in choosing andinterpreting accuracy measures A modification of thePICO (population intervention comparison out-

4 Nonstandard abbreviations DTA diagnostic test accuracy PET positron emis-sion tomography IPD individual patient data

Table 1 Main steps in a systematic review ofdiagnostic test accuracy studies

Steps Key issues

I Framing the reviewquestion

Target condition

Intended role of the index test(s)

Population of interest

II Searching andlocating studies

Multiple databases

Multiple search terms related totarget condition and indextest

No search filters

III Assessment ofmethodologicalquality

To assess risk of bias andsources of variation

QUADAS-2 checklist

IV Metaanalyzingdiagnosticaccuracy data

Descriptive figures forest plotsand ROC plot

Use of hierarchical randomeffects models

hierarchical summary ROCapproach

bivariate metaregressionof sensitivity andspecificity

Study-level covariates toexamine differences inaccuracy between indextests or subgroups

V Interpreting resultsand drawingconclusions

Precision and variability acrossstudies of relevant accuracymeasures

Absolute numbers of true-positive false-positive true-negative false-negativefindings derived fromsummary accuracymeasures

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1535

come) system used in therapeutic studies can also behelpful in framing the question in diagnostic reviewswith the following elements

P Population the patients in whom the test will beapplied in practice Important elements settingpresenting symptoms prior testing

I Index test(s) the test under evaluationC Comparator test(s) relevant if there is an interest

in comparing the accuracy of different index testsO Outcome the target condition and how the final

diagnosis will be made (ie reference standard)

In our example of clinical decision rules for pul-monary embolism (see Appendix 1) the key point ofinterest is whether these rules (alone or in combinationwith a D-dimer test) can be used to select patients whodo not require further invasive or costly testing Thistype of intended role of a test has been referred to astriage Therefore the number of patients with negativetest results but with a final diagnosis of pulmonary em-bolism is of key interest (ie missed cases)

II SEARCHING FOR AND LOCATING STUDIES

Identifying all relevant studies is a key objective in anysystematic review Searching for diagnostic accuracystudies proves to be more difficult than for interven-tion studies because there are no specific search termsfor diagnostic test accuracy studies such as ldquorandom-ized clinical trialrdquo for therapeutic intervention studies(13 ) Search strategies in diagnostic reviews are gener-ally based on combining different sets of terms relatedto (a) the test(s) under evaluation and (b) the clinicalcondition of interest Both MESH terms and free textwords describing the index test and condition of inter-est should be used in the search The articles of these 2sets can then be combined using the Boolean ldquoANDrdquooperator The use of filters to limit a set of articles todiagnostic accuracy studies is not recommended be-cause the use of these filters can cause a meaningfulnumber of relevant studies to be missed (ie in situa-tions in which the filter is not sensitive enough) or incase of highly sensitive filters can lead to a situation inwhich hardly any articles are eliminated from thosethat need to be screened (14 )

Limiting the search to a single database for exam-ple MEDLINE is generally not considered adequate forsystematic reviews (15 ) Relying solely on MEDLINEmay result in the retrieval of a set of reports unrepre-sentative of all reports that would have been identifiedthrough a comprehensive search of several sourcesEMBASE is a logical additional source to be searchedbecause it also covers all areas of healthcare (16 ndash18 )Many more specific databases exist that can be usefuladditional sources depending on the topic of the re-view The IFCC database may be a useful extra source

because it contains diagnostic reviews of tests or mark-ers used in the domain of clinical chemistry (websitewwwifccorg)

In our pulmonary embolism example bothMEDLINE and EMBASE have been searched usingmultiple alternative search terms for ldquopulmonary em-bolismrdquo combined with multiple search words that canindicate studies reporting results from diagnosticstudies The full search strategy is available as anappendix on the website of the Annals of InternalMedicine (11 )

III QUALITY ASSESSMENT

Quality assessment in diagnostic accuracy reviews fo-cuses on 2 different but related concepts Assessingboth these concepts is important because they may ex-plain why findings differ between studies The first di-mension to consider is whether the results of a studymay be biased Similarly to therapeutic interventionstudies diagnostic accuracy studies have key features inthe design and execution that can produce incorrectresults within a study This is often described as ldquointer-nal validityrdquo Examples of threats to internal validityare the use of an inappropriate reference standardstudies in which all patients are verified by the refer-ence standard to determine the presence or absence ofthe target condition (partial verification) or knowl-edge of the outcome of the reference standard when theresults of the index test are interpreted

Even if there are no apparent flaws directly leadingto bias a diagnostic accuracy study may generate re-sults that are not applicable for the particular questionthat the review tries to answer The patients in the studymay not be similar to those in whom the test is used thetest may be used at a different point in the care path-way or the test may be used in a different way than inpractice This refers to the issue of external validity orgeneralizability of results

In 2003 the Quality Assessment of Diagnostic Ac-curacy Studies (QUADAS) checklist was published as ageneric tool for the quality assessment of primary stud-ies within a diagnostic review (19 ) The QUADAS toolconsists of 14 items covering risk of bias sources ofvariation and reporting quality Each item is ratedldquoyesrdquo ldquonordquo or ldquounclearrdquo where yes indicates the ab-sence of problems Recently the QUADAS tool hasbeen revised on the basis of comments from users andexperts in the field (8 ) The items of the revisedQUADAS-2 checklist are shown in Table 2 Reviewersare encouraged to add additional items that are rele-vant for the particular review that they are carrying out

Quality assessment of included studies takes timeand effort because reporting is often incomplete or un-clear Furthermore several quality items require a sub-jective judgment of the assessor for example when

Review

1536 Clinical Chemistry 5811 (2012)

judging whether the spectrum of patients in a studymatches that of the intended population defined in thereview question Given these difficulties the strong adviceis that at least 2 persons should independently performthe quality assessment These persons should have rele-vant knowledge of both the methodological issues in di-agnostic accuracy studies and the clinical topic area

Results of quality assessment can be presented intables and graphs Tables can be used to document allfeatures of the included studies including theQUADAS-2 items Such a table takes up a lot of spaceand does not provide a useful succinct summary for thereader These tables are often reported as supplementalmaterial on the websites of journals

Table 2 Revised QUADAS-2 checklist

Table 1 Risk of bias and applicability judgments in QUADAS-2a

Domain Patient selection Index test Reference standard Flow and timing

Description Describe methods of patientselection

Describe the index test and howit was conducted andinterpreted

Describe the referencestandard and howit was conductedand interpreted

Describe any patientswho did notreceive the indextests or referencestandard or whowere excludedfrom the 2 2table (refer to flowdiagram)

Describe included patients(previous testingpresentation intendeduse of index test andsetting)

Describe the intervaland anyinterventionsbetween indextests and thereference standard

Signaling questions(yes no orunclear)

Was a consecutive orrandom sample ofpatients enrolled

Were the index test resultsinterpreted without know-ledge of the results of thereference standard

Is the referencestandard likely tocorrectly classify thetarget condition

Was there anappropriateinterval betweenindex tests andreferencestandard

Was a case control designavoided

If a threshold was used was itprespecified

Were the referencestandard resultsinterpreted withoutknowledge of theresults of the indextest

Did all patientsreceive a referencestandard

Did the study avoidinappropriate exclusions

Did all patientsreceive the samereferencestandard

Were all patientsincluded in theanalysis

Risk of bias (highlow or unclear)

Could the selection ofpatients have introducedbias

Could the conduct orinterpretation of the indextest have introduced bias

Could the referencestandard itsconduct or itsinterpretation haveintroduced bias

Could the patientflow haveintroduced bias

Concerns aboutapplicability(high low orunclear)

Are there concerns that theincluded patients do notmatch the reviewquestion

Are there concerns that theindex test its conduct or itsinterpretation differ from thereview question

Are there concernsthat the targetcondition as definedby the referencestandard does notmatch the reviewquestion

a Reproduced with permission from Whiting et al (8 )

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1537

Two graphical summaries are recommended forpresenting the results of the quality assessment Themethodological quality graph presents for each qualityassessment item the percentage of included studies inwhich the item was rated ldquoyesrdquo ldquonordquo and ldquounclearrdquo ina stacked bar chart This type of graph provides thereader with a quick overview of the study quality withinthe whole review The methodological quality graph ofour pulmonary embolism example is given in Fig 1which shows that potential areas of concern are differ-ential verification (eg the use of different referencestandard for different groups of patients in a study) andthe large proportion of studies providing no data thatuninterpretable test results were present

A systematic review provides an opportunity toinvestigate how features of study design executionand reporting may have an impact on study findingsOne way is to give a narrative summary of the qualityassessment and discuss how susceptible the results areto particular biases Another approach is to do a sensitivityanalysis in which studies that fail to meet some standard ofquality are excluded Metaregression allows direct exam-ination of the impact of specific individual quality itemson diagnostic accuracy (see next section)

IV METAANALYSIS AND PRESENTATION OF POOLED

DIAGNOSTIC TEST ACCURACY RESULTS

Metaanalysis is the use of statistical techniques to com-bine the results from a set of individual studies We canuse metaanalysis to obtain summaries of the results ofrelevant included studies such as an estimate of themean diagnostic accuracy of a test or marker the sta-

tistical uncertainty around this mean expressed with95 CIs and the variability of individual study find-ings around mean estimates Metaanalytical regressionmodels can statistically compare the accuracy of 2 ormore different tests and examine how test accuracyvaries with specific study characteristics

In the metaanalysis of diagnostic accuracy studiesthe focus is on 2 statistical measures of diagnostic ac-curacy the sensitivity of the test (the proportion ofpatients with the target disease who have an abnormaltest result) and the specificity of the test (the propor-tion of patients without the target disease who have anormal test result) Statistical methods for diagnostictest accuracy have to deal with 2 outcomes simultane-ously (ie sensitivity and specificity) rather than a sin-gle outcome measure (eg a relative risk or odds ratio)as is the case for reviews of therapeutic interventions(5 ) The diagnostic metaanalytical models have to al-low for the trade-off between sensitivity and specificitythat can arise because studies may vary in the thresholdvalue used to define test positives and test negatives[also see report 1 in our series (20 )] Another feature ofdiagnostic reviews is the many potential sources forvariation in test accuracy results between studies Ex-amining factors that can (partly) explain variation inthese results and the use of a random effects model arekey features of a DTA review

DESCRIPTIVE STATISTICS

The first step in the analysis is to visualize the resultsfrom the individual studies within a review There are 2types of figures that can be used forest plots of sensi-tivity and specificity and plots of these measures inROC space

Forest plots display the estimates of sensitivity andspecificity of each study the corresponding CIs andthe underlying raw numbers in a paired way (Fig 2)These plots give a visual impression of the variation inresults between studies an indication of the precisionby which sensitivity and specificity have been measuredin each study the presence of outliers and a sense forthe mean values of sensitivity and specificity

Plotting the pairs of sensitivity and specificity esti-mates from separate studies in the ROC space providesadditional insight regarding the variation of results be-tween studies in particular whether sensitivity andspecificity are negatively correlated (Fig 3) The x axisof the ROC plot displays the (1 specificity) obtainedin the studies in the review and the y axis shows thecorresponding sensitivity The rising diagonal line in-dicates values of sensitivity and specificity belonging toa test that is not informative ie the chances for apositive test result are identical for patients with andwithout the target disease Better (eg more informa-tive) tests will have higher values of both sensitivity and

Fig 1 Quality assessment of included studies per-formed with the original QUADAS checklist in a re-cently published systematic review on diagnostic deci-sion rules used alone or in combination with a D-dimerassay for the diagnosis of pulmonary embolism

Reproduced with permission from Lucassen et al (11 )

Review

1538 Clinical Chemistry 5811 (2012)

Fig 2 Paired forest plot of sensitivity (A) and specificity (B) and the corresponding 95 CIs from studies examiningthe diagnostic accuracy of the Wells rule with a cutoff value of 2 Wells rule with cutoff value of 4 and Gestalt forthe diagnosis of pulmonary embolism

Studies within a rule are sorted by prevalence Adapted from Lucassen et al (11 )Continued on page 1540

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1539

specificity and are therefore located more toward thetop-left corner of the ROC-space If there is a tradeoff(eg negative correlation) between sensitivity andspecificity a shoulderlike pattern in the ROC space willemerge This pattern will be comparable to the pattern

that arises in a single study of a test that produces acontinuous result in which the threshold has been var-ied Lowering the threshold will then increase the like-lihood of a positive test result in patients with the targetdisease thereby increasing sensitivity while at the same

Fig 2 Continued

Review

1540 Clinical Chemistry 5811 (2012)

it increases the risk of a false-positive result in patientswithout the target disease thereby lowering specificityThis trade-off or negative correlation will generate thisshoulder-like pattern in the ROC space

The ROC plot of our 3 clinical decision rules forpulmonary embolism clearly indicates the presence ofnegative correlations both within rules as well as acrossdifferent rules (Fig 3)

Metaanalysis of Diagnostic Accuracy Data

Metaanalyses of studies reporting sensitivities andspecificities have often used the MosesndashLittenberg lin-ear regression approach (21 ) to obtain a summaryROC curve It has become clear that this approach hasstatistical shortcomings (5 22 ) and therefore it is nolonger recommended for evaluating differences be-tween summary ROC curves between tests or examin-ing the impact of covariates on accuracy

To overcome the shortcomings of the MosesndashLittenberg approach 2 more rigorous statistical ap-proaches have since been developed These are the hi-erarchical summary ROC approach and the bivariaterandom effects model (5 22 ) Both models are hierar-chical random effects models that take into account thebetween-study variation in sensitivities and specifici-ties (eg random effects models) and their possiblecorrelations as well as the precision of these estimates

within a study (eg weighting of studies) Although thestarting points of these 2 models are different the 2models are mathematically equivalent (23 ) Both mod-els can produce summary estimates of sensitivity andspecificity and produce a statistically sound summaryROC line or provide 95 confidence ellipses aroundthe mean values of sensitivity and specificity (Fig 4)

EXAMINATION OF SOURCES OF VARIATION AND DIFFERENCES

IN ACCURACY BETWEEN TESTS

Results from individual studies often vary within a re-view There are several possible causes for variationwhich can be categorized according to the groupsshown in Table 3

Both advanced models are regression models thatallow flexibility in examining sources of heterogeneityby including study-level covariates This feature pro-vides the option of formally comparing the results ofstudies with a specific feature (eg partial verification)with the results of studies that have avoided partial ver-ification In the same way we can examine whether theaccuracy results from studies examining test A are dif-ferent from studies examining test B Limiting thecomparison of tests to studies with a cross-over designin which both index tests have been applied in the samepatient may be a preferred approach These so-called

Fig 3 Pairs of sensitivity and specificity values fromstudies examining 3 different rules for the diagnosisof pulmonary embolism (A Wells rule with a cutoffvalue 2 B Wells rule with a cutoff value of 4 CGestalt) Adapted from Lucassen et al (11 )

Fig 4 ROC plot showing the summary estimate ofsensitivity and specificity and the corresponding 95confidence ellipse for 3 different clinical decisionrules for the diagnosis of pulmonary embolism (A

Wells rule with cutoff value 2 B Wells rule withcutoff value of 4 C Gestalt) Adapted from Lucas-sen et al (11 )

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1541

paired comparisons provide more valid evidence thanresults generated from unpaired studies (separate stud-ies) that may reflect other underlying differences in de-sign and conduct between studies (ie confoundingfactors)

Diagnostic accuracy can vary between clinical sub-groups Examining such differences in a systematic re-view is problematic if primary studies do not reportstratified results for these subgroups In the absenceof stratified results researchers have to use study-levelsummaries of the covariate representing the clinicalsubgroup Such summaries have limited power for de-tecting differences in accuracy between clinical sub-groups As an example if reviewers are interested inwhether accuracy of a test varies between men andwomen they could use the percentage of males in eachstudy as a study-level covariate in their model A study-

level covariate reduces the power to find differencesbetween males and females which would clearly be thecase if all included studies had similar percentages ofmales Even if a clear difference in accuracy existed be-tween males and females in each study it would remainundetected in a regression model based on the percent-age of males Individual patient data (IPD) metaanaly-sis provides more power and flexibility to examinevariation in accuracy between clinical subgroups

Just as for any regression model used for examiningcovariates clear boundaries exist that define what can bedone or what is sensible given the sample size of the studyInsufficient statistical power or an increased risk of find-ing false-positive associations when many covariates areexamined are concerns when diagnostic reviews are con-ducted The number of different studies within a review isthe key limitation for examining covariates

The results from the bivariate model comparingthe three different rules are summarized in Fig 4 andTable 4

These results show that (as expected) mean sensi-tivity is significantly lower for the Wells studies using acutoff value of 2 compared with studies using a cutoffvalue of 4 but at the same time specificity is signifi-cantly lower Such differences are expected when low-ering the threshold for positivity The results from theGestalt studies are comparable with the Wells studiesusing a cutoff value of 2 although there appears to bemore heterogeneity in the reported specificities ofGestalt studies (Fig 2)

In the example review on pulmonary embolismthe authors examined whether the prevalence in astudy had an impact on the levels of sensitivity andspecificity in a study by including it as a covariate in thebivariate metaregression model There are several rea-sons why prevalence might be associated with sensitiv-ity and specificity An overview of these potential rea-sons is given in (24 )

In this case the authors hypothesized that differ-ences in prevalence could be seen as a proxy for differ-ences in case mix between studies In studies with lowerprevalence more patients may be in an early stage ofthe disease which would hamper detection and lead tomore false-negative results and hence lower sensitiv-ity In this review increased prevalence was associatedwith higher sensitivity and lower specificity (Table 4)

V INTERPRETING RESULTS AND DRAWING CONCLUSIONS

This is the part of the review process in which all theresults of the different steps within a systematic reviewhave to be combined to answer the review question(s)at hand Key ingredients include the methodologicalquality of the evidence whether the included studiesexamined the same intended role of the test as ex-

Table 3 Causes for variation in sensitivity andspecificity results between primary studies within a

review

Chance variation The majority of diagnostic accuracystudies are moderate to small insample size Considerablevariation by chance can then beexpected especially forsensitivity when the prevalenceis low The advanced modelsproperly take into account theprecision by which sensitivityand specificity have beenmeasured in each study

Differences inthreshold

Explicit or implicit differences inthresholds for positivity betweenstudies will lead to differences insensitivity and specificity inopposite directions creatingnegative correlations Theadvanced models take thepossible correlations intoaccount

Bias Deficiencies in the design andconduct of diagnostic studies canlead to biased results oftenproducing more exaggeratedresults Advanced models canexamine the impact ofdeficiencies in design byincluding study-level covariates

Variation byclinicalsubgroups

Examine stratified results orsummaries at a study level

Unexplainedvariation

It is likely that remaining variationbeyond chance will be present inDTA reviews The advancedmodels use random effects toincorporate variation beyondchance

Review

1542 Clinical Chemistry 5811 (2012)

pressed in the review question and the precision andvariability in accuracy results

Reviews with a comparative question (eg is test Abetter than test B at a specific point in the diagnosticpathway) can directly examine whether sensitivity orspecificity or both are higher for one test than the otherA distinction should be made between primary studiesdirectly comparing the 2 index tests in the same patient(direct evidence) and studies examining only one ofthese index tests (indirect evidence) Direct evidence ispreferred because important factors that may have animpact on accuracy (ie potential confounding factorssuch as the population and choice of reference stan-dard) will be constant when the index tests are com-pared If sufficient studies with direct evidence areavailable the main analysis or any sensitivity analysesshould focus on these studies providing directevidence

If both sensitivity and specificity are higher or theentire summary ROC curve for one test is to the left andabove that of the other test the conclusion is straightfor-ward If sensitivity is higher for one test and specificity forthe other or if the summary ROC curves of the 2 testscross it is important to examine and weigh the potentialnegative consequences associated with false-positive orfalse-negative test results One way to provide this insightis to subject a hypothetical cohort of 1000 patients to bothtests and calculate the number of patients with differentcorrect and incorrect test results based on summary esti-

mates of sensitivity and specificity and a reasonable esti-mate of the expected prevalence

The intended role of a test is also helpful in structur-ing the interpretation of results In triage questions thenumber of missed cases (eg false-negative test results) isthe key concern so sensitivity or the negative predictivevalue are the key accuracy measures The desired mini-mum level for these measures will still be a subjectivechoice and depend on the condition at hand In our ex-ample most experts will agree that the clinical decisionrule should not miss more than 5 of the patients withpulmonary embolism so therefore sensitivity should be atleast 95 From the results of the rules alone it is clear thata large part of the confidence ellipse and even the sum-mary estimate of sensitivity do not meet this criterion (Ta-ble 4) This observation leads to a firm conclusion thatclinical decisions alone are not suited for use in the triageof patients suspected to have pulmonary embolismTherefore D-dimer results have been added to the triageof patients suspected for pulmonary embolism In thisscenario patients will not undergo further testing if boththe clinical decision rule AND the D-dimer are negativeThe proportion of patients who had negative results forboth tests but who had a final diagnosis of pulmonaryembolism (failure rate) has been metaanalyzed Adding aqualitative D-dimer to the clinical decision rule led tofailure rates that were lower than 2 (Table 4) Thisfrequency has been considered sufficiently low andtherefore such strategies have been implemented in

Table 4 Mean (95 CI) values of sensitivity and specificity for 3 different clinical decision rules for pulmonaryembolism the impact of prevalence on sensitivity and specificity and failure rate and efficiency of a strategy in

which patients with a low probability of disease and a negative D-dimer receive no further testing

Subgroup (no of studies) Sensitivity (95 CI) Specificity (95 CI)

Type of rule

Wells cutoff value of 2 (n 19) 84 (78ndash89) 58 (52ndash65)

Wells cutoff value of 4 (n 11) 60 (49ndash69) 80 (75ndash84)

Gestalt (n 15) 85 (78ndash90) 51 (39ndash63)

P value Wells 2 vs Wells 4 P 0001 P 0001

P value Wells 2 vs Gestalt P 096 P 031

P value Wells 4 vs Gestalt P 0001 P 0001

Impact prevalence within Wells 2 studies

Prevalence 5 67 (58ndash75) 72 (65ndash79)

Prevalence 15 85 (80ndash89) 58 (52ndash63)

Prevalence 30 91 (88ndash94) 47 (40ndash55)

P value for trend P 0001 P 0001

Adding D-dimer testing to rule Failure rate (95 CI) Efficiency (95 CI)

Wells 4 with quantitative D-dimer (n 4) 05 (02ndash09) 39 (30ndash48)

Wells 2 with qualitative D-dimer (n 5) 09 (05ndash17) 40 (32ndash49)

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1543

practice The efficiencies of such strategies arearound 40 meaning that in 40 of the patients nofurther testing is required

Similar to any other review there is the threat ofpublication bias in DTA reviews (18 ) Publication biasoccurs when studies containing less favorable resultsare less likely to be published Summary results basedon published findings will then generate an overopti-mistic picture of the accuracy of a test Unfortunatelylittle information exists regarding the presence andmagnitude of publication bias in diagnostic accuracystudies Unlike randomized trials there are no registriesfor protocols of diagnostic accuracy studies

Recent Developments

In this section we highlight some recent developmentsthat are relevant for diagnostic accuracy reviews of bio-chemical tests and markers

NETWORK METAANALYSIS

In many diagnostic scenarios there are several alterna-tive tests available which leads to the key questionwhich test is the best Direct comparisons of tests(head-to-head comparison in the same patients by useof a cross-over design or a parallel randomized design)offer the most valid study design but are not alwaysavailable in the literature Systematic reviews focusingon more than one diagnostic test have to incorporateindirect comparisons (accuracy of different tests as-sessed in different populations) Network metaanaly-ses have been developed in the field of intervention tocombine both direct and indirect comparisons within asingle statistical model to allow for ranking of the avail-able treatments (25 ) In addition these models provideestimates of heterogeneity and inconsistency of effectsSuch network metaanalyses would be a welcome addi-tion for ranking and selecting the best test among sev-eral alternatives

IPD METAANALYSIS

IPD metaanalyses use individual patient data ratherthan published summary results of a study In an IPDmetaanalysis there is more flexibility and more statisti-cal power to examine how patientsrsquo characteristics af-fect diagnostic test accuracy (subgroup analyses or ef-fect modification) IPD metaanalysis also offers moreflexibility in handling differences in thresholds for pos-itivity for continuous index test results and for deter-mining the optimal cutoff value (26 )

Concluding Remarks

Many improvements have been made in the method-ology of performing systematic reviews of the accuracy

of diagnostic tests and multivariable diagnostic mod-els Methods have been improved for locating diagnos-tic accuracy studies for assessing the risk for bias andsources of variation and for developing advanced andflexible models to metaanalyze 2 possible correlatedoutcomes However the biggest obstacle for generatinghigh-quality clinically useful diagnostic reviews is thepoor methodological quality of the existing body ofdiagnostic accuracy studies reported in the literatureFortunately interest in the methods for the evaluationof diagnostic tests has grown considerably in the lastdecade Higher-quality and more informative primarystudies will in return generate more informative diag-nostic reviews

Appendix 1

Accuracy of diagnostic decision rules without and withD-dimer assay for the diagnosis of pulmonary embo-lism Pulmonary embolism (PE) is an important condi-tion for physicians to consider because case fatality ishigh if left untreated However diagnosing PE in sus-pected patients is challenging because signs and symp-toms are often nonspecific Physicians constantly facethe dilemma of not wanting to miss a PE while at thesame time wanting to avoid performing too many un-necessary additional diagnostic procedures that can beexpensive burdensome and possibly harmful Diag-nostic strategies in suspected patients therefore focuson identifying patients in whom PE can be safely ruledout on the basis of findings from the patient history andphysical examination Many different diagnostic deci-sion rules for excluding PE on the basis of symptomsand signs with or without D-dimer assay have beendeveloped and validated but there remains uncertaintyas to whether these different rules differ in their accu-racy in a meaningful way In this example we focus on 3rules

bull Wells rule using a cutoff value of 2 for defining apositive (abnormal) test result

bull Wells rule using a cutoff value of 4bull Gestalt rule

In the Wells rules points are scored when certainsigns and symptoms (eg heart rate 100 previousdeep venous thrombosis) are present resulting in a to-tal score In the Gestalt rule physicians provide anoverall empirical assessment of the likelihood of pul-monary embolism being present after examination of apatient To safely exclude pulmonary embolism aD-dimer test can be added to the clinical rule to refrainfrom further testing if both tests (rule D-dimer assay)are negative Further details and more rules can befound in the original review (11 )

Review

1544 Clinical Chemistry 5811 (2012)

REVIEW AIMS

To determine and compare the diagnostic accuracy of 3different clinical decision rules Wells-2 (n 19 stud-ies) Wells-4 (n 11 studies) and Gestalt rule (n 15studies)

To examine whether a negative test from a rule incombination with a negative D-dimer test result is a safeand efficient strategy for excluding PE without referralfor further burdening and invasive imaging

Author Contributions All authors confirmed they have contributed tothe intellectual content of this paper and have met the following 3 re-quirements (a) significant contributions to the conception and design

acquisition of data or analysis and interpretation of data (b) draftingor revising the article for intellectual content and (c) final approval ofthe published article

Authorsrsquo Disclosures or Potential Conflicts of Interest Upon man-uscript submission all authors completed the author disclosure formDisclosures andor potential conflicts of interest

Employment or Leadership None declaredConsultant or Advisory Role None declaredStock Ownership None declaredHonoraria None declaredResearch Funding The Netherlands Organisation for Health Re-search and Development (ZonMW) KGM Moons the Nether-lands Organisation for Scientific Research (projects 91208004 and91810615)Expert Testimony None declared

References

1 Lijmer JG Leeflang M Bossuyt PM Proposals fora phased evaluation of medical tests Med DecisMaking 200929E13ndash21

2 Linnet K Bossuyt PM Moons KG Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

3 Moons KG de Groot JA Linnet K Reitsma JBBossuyt PM Quantifying the added value of adiagnostic test or marker Clin Chem 2012581408ndash17

4 Bossuyt PM Reitsma JB Linnet K Moons KGBeyond diagnostic accuracy the clinical utility ofdiagnostic tests Clin Chem [Epub ahead of print2012 Jun 22]

5 Reitsma JB Glas AS Rutjes AW Scholten RJBossuyt PM Zwinderman AH Bivariate analysisof sensitivity and specificity produces informativesummary measures in diagnostic reviews J ClinEpidemiol 200558982ndash90

6 Whiting P Rutjes AW Reitsma JB Glas ASBossuyt PM Kleijnen J Sources of variation andbias in studies of diagnostic accuracy a system-atic review Ann Intern Med 2004140189ndash202

7 Leeflang MM Deeks JJ Gatsonis C Bossuyt PMon behalf of the Cochrane Diagnostic Test Accu-racy Working Group Systematic reviews of diag-nostic test accuracy Ann Intern Med 2008149889ndash97

8 Whiting PF Rutjes AW Westwood ME Mallett SDeeks JJ Reitsma JB et al QUADAS-2 a revisedtool for the quality assessment of diagnostic ac-curacy studies Ann Intern Med 2011155529ndash36

9 Diagnostic test accuracy working group httpsrdtacochraneorg (Accessed August 2012)

10 Leeflang MM Debets-Ossenkopp YJ Visser CEScholten RJ Hooft L Bijlmer HA et al Galacto-mannan detection for invasive aspergillosis in

immunocompromized patients Cochrane Data-base Syst Rev 2008CD007394

11 Lucassen W Geersing GJ Erkens PM Reitsma JBMoons KG Buller H van Weert HC Clinical de-cision rules for excluding pulmonary embolism ameta-analysis Ann Intern Med 2011155448ndash60

12 Bossuyt PM Irwig L Craig J Glasziou P Com-parative accuracy assessing new tests againstexisting diagnostic pathways BMJ 20063321089ndash92

13 Doust JA Pietrzak E Sanders S Glasziou PPIdentifying studies for systematic reviews of di-agnostic tests was difficult due to the poor sen-sitivity and precision of methodologic filters andthe lack of information in the abstract J ClinEpidemiol 200558444ndash9

14 Leeflang MM Scholten RJ Rutjes AW ReitsmaJB Bossuyt PM Use of methodological searchfilters to identify diagnostic accuracy studies canlead to the omission of relevant studies J ClinEpidemiol 200659234ndash40

15 Savoie I Helmer D Green CJ Kazanjian A Be-yond Medline reducing bias through extendedsystematic review search Int J Technol AssessHealth Care 200319168ndash78

16 Fraser C Mowatt G Siddiqui R Burr J Searchingfor diagnostic test accuracy studies an applica-tion to screening for open angle glaucoma (OAG)[Abstract] Cochrane Colloquium AbstractsJournal 2006 httpwwwimbiuni-freiburgdeOJSccaindexphpjournalccaamppagearticleampopviewamppath[]1980 (Accessed October2012)

17 Whiting P Westwood M Burke M Sterne JGlanville J Systematic reviews of test accuracyshould search a range of databases to identifyprimary studies J Clin Epidemiol 200861

357ndash 6418 Song F Eastwood AJ Gilbody S Duley L Sutton

AJ Publication and related biases Health TechnolAssess 200041ndash115

19 Whiting P Rutjes AW Reitsma JB Bossuyt PMKleijnen J The development of QUADAS a toolfor the quality assessment of studies of diagnos-tic accuracy included in systematic reviews BMCMed Res Methodol 2003325

20 Linnet K Bossuyt PMM Moons KGM Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

21 Moses LE Shapiro D Littenberg B Combiningindependent studies of a diagnostic test into asummary ROC curve data-analytic approachesand some additional considerations Stat Med1993121293ndash316

22 Rutter CM Gatsonis CA A hierarchical regressionapproach to meta-analysis of diagnostic test ac-curacy evaluations Stat Med 2001202865ndash84

23 Harbord RM Deeks JJ Egger M Whiting PSterne JA A unification of models for meta-analysis of diagnostic accuracy studies Biostatis-tics 20078239ndash51

24 Leeflang MM Bossuyt PM Irwig L Diagnostictest accuracy may vary with prevalence implica-tions for evidence-based diagnosis J Clin Epide-miol 2009625ndash12

25 Li T Puhan MA Vedula SS Singh S Dickersin Kthe Ad Hoc Network Meta-analysis MethodsMeeting Working Group Network meta-analysis-highly attractive but more methodological re-search is needed BMC Med 2011 27979

26 Khan KS Bachmann LM ter Riet G Systematicreviews with individual patient data meta-analysis to evaluate diagnostic tests Eur J ObstetGynecol Reprod Biol 2003108121ndash5

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1545

come) system used in therapeutic studies can also behelpful in framing the question in diagnostic reviewswith the following elements

P Population the patients in whom the test will beapplied in practice Important elements settingpresenting symptoms prior testing

I Index test(s) the test under evaluationC Comparator test(s) relevant if there is an interest

in comparing the accuracy of different index testsO Outcome the target condition and how the final

diagnosis will be made (ie reference standard)

In our example of clinical decision rules for pul-monary embolism (see Appendix 1) the key point ofinterest is whether these rules (alone or in combinationwith a D-dimer test) can be used to select patients whodo not require further invasive or costly testing Thistype of intended role of a test has been referred to astriage Therefore the number of patients with negativetest results but with a final diagnosis of pulmonary em-bolism is of key interest (ie missed cases)

II SEARCHING FOR AND LOCATING STUDIES

Identifying all relevant studies is a key objective in anysystematic review Searching for diagnostic accuracystudies proves to be more difficult than for interven-tion studies because there are no specific search termsfor diagnostic test accuracy studies such as ldquorandom-ized clinical trialrdquo for therapeutic intervention studies(13 ) Search strategies in diagnostic reviews are gener-ally based on combining different sets of terms relatedto (a) the test(s) under evaluation and (b) the clinicalcondition of interest Both MESH terms and free textwords describing the index test and condition of inter-est should be used in the search The articles of these 2sets can then be combined using the Boolean ldquoANDrdquooperator The use of filters to limit a set of articles todiagnostic accuracy studies is not recommended be-cause the use of these filters can cause a meaningfulnumber of relevant studies to be missed (ie in situa-tions in which the filter is not sensitive enough) or incase of highly sensitive filters can lead to a situation inwhich hardly any articles are eliminated from thosethat need to be screened (14 )

Limiting the search to a single database for exam-ple MEDLINE is generally not considered adequate forsystematic reviews (15 ) Relying solely on MEDLINEmay result in the retrieval of a set of reports unrepre-sentative of all reports that would have been identifiedthrough a comprehensive search of several sourcesEMBASE is a logical additional source to be searchedbecause it also covers all areas of healthcare (16 ndash18 )Many more specific databases exist that can be usefuladditional sources depending on the topic of the re-view The IFCC database may be a useful extra source

because it contains diagnostic reviews of tests or mark-ers used in the domain of clinical chemistry (websitewwwifccorg)

In our pulmonary embolism example bothMEDLINE and EMBASE have been searched usingmultiple alternative search terms for ldquopulmonary em-bolismrdquo combined with multiple search words that canindicate studies reporting results from diagnosticstudies The full search strategy is available as anappendix on the website of the Annals of InternalMedicine (11 )

III QUALITY ASSESSMENT

Quality assessment in diagnostic accuracy reviews fo-cuses on 2 different but related concepts Assessingboth these concepts is important because they may ex-plain why findings differ between studies The first di-mension to consider is whether the results of a studymay be biased Similarly to therapeutic interventionstudies diagnostic accuracy studies have key features inthe design and execution that can produce incorrectresults within a study This is often described as ldquointer-nal validityrdquo Examples of threats to internal validityare the use of an inappropriate reference standardstudies in which all patients are verified by the refer-ence standard to determine the presence or absence ofthe target condition (partial verification) or knowl-edge of the outcome of the reference standard when theresults of the index test are interpreted

Even if there are no apparent flaws directly leadingto bias a diagnostic accuracy study may generate re-sults that are not applicable for the particular questionthat the review tries to answer The patients in the studymay not be similar to those in whom the test is used thetest may be used at a different point in the care path-way or the test may be used in a different way than inpractice This refers to the issue of external validity orgeneralizability of results

In 2003 the Quality Assessment of Diagnostic Ac-curacy Studies (QUADAS) checklist was published as ageneric tool for the quality assessment of primary stud-ies within a diagnostic review (19 ) The QUADAS toolconsists of 14 items covering risk of bias sources ofvariation and reporting quality Each item is ratedldquoyesrdquo ldquonordquo or ldquounclearrdquo where yes indicates the ab-sence of problems Recently the QUADAS tool hasbeen revised on the basis of comments from users andexperts in the field (8 ) The items of the revisedQUADAS-2 checklist are shown in Table 2 Reviewersare encouraged to add additional items that are rele-vant for the particular review that they are carrying out

Quality assessment of included studies takes timeand effort because reporting is often incomplete or un-clear Furthermore several quality items require a sub-jective judgment of the assessor for example when

Review

1536 Clinical Chemistry 5811 (2012)

judging whether the spectrum of patients in a studymatches that of the intended population defined in thereview question Given these difficulties the strong adviceis that at least 2 persons should independently performthe quality assessment These persons should have rele-vant knowledge of both the methodological issues in di-agnostic accuracy studies and the clinical topic area

Results of quality assessment can be presented intables and graphs Tables can be used to document allfeatures of the included studies including theQUADAS-2 items Such a table takes up a lot of spaceand does not provide a useful succinct summary for thereader These tables are often reported as supplementalmaterial on the websites of journals

Table 2 Revised QUADAS-2 checklist

Table 1 Risk of bias and applicability judgments in QUADAS-2a

Domain Patient selection Index test Reference standard Flow and timing

Description Describe methods of patientselection

Describe the index test and howit was conducted andinterpreted

Describe the referencestandard and howit was conductedand interpreted

Describe any patientswho did notreceive the indextests or referencestandard or whowere excludedfrom the 2 2table (refer to flowdiagram)

Describe included patients(previous testingpresentation intendeduse of index test andsetting)

Describe the intervaland anyinterventionsbetween indextests and thereference standard

Signaling questions(yes no orunclear)

Was a consecutive orrandom sample ofpatients enrolled

Were the index test resultsinterpreted without know-ledge of the results of thereference standard

Is the referencestandard likely tocorrectly classify thetarget condition

Was there anappropriateinterval betweenindex tests andreferencestandard

Was a case control designavoided

If a threshold was used was itprespecified

Were the referencestandard resultsinterpreted withoutknowledge of theresults of the indextest

Did all patientsreceive a referencestandard

Did the study avoidinappropriate exclusions

Did all patientsreceive the samereferencestandard

Were all patientsincluded in theanalysis

Risk of bias (highlow or unclear)

Could the selection ofpatients have introducedbias

Could the conduct orinterpretation of the indextest have introduced bias

Could the referencestandard itsconduct or itsinterpretation haveintroduced bias

Could the patientflow haveintroduced bias

Concerns aboutapplicability(high low orunclear)

Are there concerns that theincluded patients do notmatch the reviewquestion

Are there concerns that theindex test its conduct or itsinterpretation differ from thereview question

Are there concernsthat the targetcondition as definedby the referencestandard does notmatch the reviewquestion

a Reproduced with permission from Whiting et al (8 )

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1537

Two graphical summaries are recommended forpresenting the results of the quality assessment Themethodological quality graph presents for each qualityassessment item the percentage of included studies inwhich the item was rated ldquoyesrdquo ldquonordquo and ldquounclearrdquo ina stacked bar chart This type of graph provides thereader with a quick overview of the study quality withinthe whole review The methodological quality graph ofour pulmonary embolism example is given in Fig 1which shows that potential areas of concern are differ-ential verification (eg the use of different referencestandard for different groups of patients in a study) andthe large proportion of studies providing no data thatuninterpretable test results were present

A systematic review provides an opportunity toinvestigate how features of study design executionand reporting may have an impact on study findingsOne way is to give a narrative summary of the qualityassessment and discuss how susceptible the results areto particular biases Another approach is to do a sensitivityanalysis in which studies that fail to meet some standard ofquality are excluded Metaregression allows direct exam-ination of the impact of specific individual quality itemson diagnostic accuracy (see next section)

IV METAANALYSIS AND PRESENTATION OF POOLED

DIAGNOSTIC TEST ACCURACY RESULTS

Metaanalysis is the use of statistical techniques to com-bine the results from a set of individual studies We canuse metaanalysis to obtain summaries of the results ofrelevant included studies such as an estimate of themean diagnostic accuracy of a test or marker the sta-

tistical uncertainty around this mean expressed with95 CIs and the variability of individual study find-ings around mean estimates Metaanalytical regressionmodels can statistically compare the accuracy of 2 ormore different tests and examine how test accuracyvaries with specific study characteristics

In the metaanalysis of diagnostic accuracy studiesthe focus is on 2 statistical measures of diagnostic ac-curacy the sensitivity of the test (the proportion ofpatients with the target disease who have an abnormaltest result) and the specificity of the test (the propor-tion of patients without the target disease who have anormal test result) Statistical methods for diagnostictest accuracy have to deal with 2 outcomes simultane-ously (ie sensitivity and specificity) rather than a sin-gle outcome measure (eg a relative risk or odds ratio)as is the case for reviews of therapeutic interventions(5 ) The diagnostic metaanalytical models have to al-low for the trade-off between sensitivity and specificitythat can arise because studies may vary in the thresholdvalue used to define test positives and test negatives[also see report 1 in our series (20 )] Another feature ofdiagnostic reviews is the many potential sources forvariation in test accuracy results between studies Ex-amining factors that can (partly) explain variation inthese results and the use of a random effects model arekey features of a DTA review

DESCRIPTIVE STATISTICS

The first step in the analysis is to visualize the resultsfrom the individual studies within a review There are 2types of figures that can be used forest plots of sensi-tivity and specificity and plots of these measures inROC space

Forest plots display the estimates of sensitivity andspecificity of each study the corresponding CIs andthe underlying raw numbers in a paired way (Fig 2)These plots give a visual impression of the variation inresults between studies an indication of the precisionby which sensitivity and specificity have been measuredin each study the presence of outliers and a sense forthe mean values of sensitivity and specificity

Plotting the pairs of sensitivity and specificity esti-mates from separate studies in the ROC space providesadditional insight regarding the variation of results be-tween studies in particular whether sensitivity andspecificity are negatively correlated (Fig 3) The x axisof the ROC plot displays the (1 specificity) obtainedin the studies in the review and the y axis shows thecorresponding sensitivity The rising diagonal line in-dicates values of sensitivity and specificity belonging toa test that is not informative ie the chances for apositive test result are identical for patients with andwithout the target disease Better (eg more informa-tive) tests will have higher values of both sensitivity and

Fig 1 Quality assessment of included studies per-formed with the original QUADAS checklist in a re-cently published systematic review on diagnostic deci-sion rules used alone or in combination with a D-dimerassay for the diagnosis of pulmonary embolism

Reproduced with permission from Lucassen et al (11 )

Review

1538 Clinical Chemistry 5811 (2012)

Fig 2 Paired forest plot of sensitivity (A) and specificity (B) and the corresponding 95 CIs from studies examiningthe diagnostic accuracy of the Wells rule with a cutoff value of 2 Wells rule with cutoff value of 4 and Gestalt forthe diagnosis of pulmonary embolism

Studies within a rule are sorted by prevalence Adapted from Lucassen et al (11 )Continued on page 1540

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1539

specificity and are therefore located more toward thetop-left corner of the ROC-space If there is a tradeoff(eg negative correlation) between sensitivity andspecificity a shoulderlike pattern in the ROC space willemerge This pattern will be comparable to the pattern

that arises in a single study of a test that produces acontinuous result in which the threshold has been var-ied Lowering the threshold will then increase the like-lihood of a positive test result in patients with the targetdisease thereby increasing sensitivity while at the same

Fig 2 Continued

Review

1540 Clinical Chemistry 5811 (2012)

it increases the risk of a false-positive result in patientswithout the target disease thereby lowering specificityThis trade-off or negative correlation will generate thisshoulder-like pattern in the ROC space

The ROC plot of our 3 clinical decision rules forpulmonary embolism clearly indicates the presence ofnegative correlations both within rules as well as acrossdifferent rules (Fig 3)

Metaanalysis of Diagnostic Accuracy Data

Metaanalyses of studies reporting sensitivities andspecificities have often used the MosesndashLittenberg lin-ear regression approach (21 ) to obtain a summaryROC curve It has become clear that this approach hasstatistical shortcomings (5 22 ) and therefore it is nolonger recommended for evaluating differences be-tween summary ROC curves between tests or examin-ing the impact of covariates on accuracy

To overcome the shortcomings of the MosesndashLittenberg approach 2 more rigorous statistical ap-proaches have since been developed These are the hi-erarchical summary ROC approach and the bivariaterandom effects model (5 22 ) Both models are hierar-chical random effects models that take into account thebetween-study variation in sensitivities and specifici-ties (eg random effects models) and their possiblecorrelations as well as the precision of these estimates

within a study (eg weighting of studies) Although thestarting points of these 2 models are different the 2models are mathematically equivalent (23 ) Both mod-els can produce summary estimates of sensitivity andspecificity and produce a statistically sound summaryROC line or provide 95 confidence ellipses aroundthe mean values of sensitivity and specificity (Fig 4)

EXAMINATION OF SOURCES OF VARIATION AND DIFFERENCES

IN ACCURACY BETWEEN TESTS

Results from individual studies often vary within a re-view There are several possible causes for variationwhich can be categorized according to the groupsshown in Table 3

Both advanced models are regression models thatallow flexibility in examining sources of heterogeneityby including study-level covariates This feature pro-vides the option of formally comparing the results ofstudies with a specific feature (eg partial verification)with the results of studies that have avoided partial ver-ification In the same way we can examine whether theaccuracy results from studies examining test A are dif-ferent from studies examining test B Limiting thecomparison of tests to studies with a cross-over designin which both index tests have been applied in the samepatient may be a preferred approach These so-called

Fig 3 Pairs of sensitivity and specificity values fromstudies examining 3 different rules for the diagnosisof pulmonary embolism (A Wells rule with a cutoffvalue 2 B Wells rule with a cutoff value of 4 CGestalt) Adapted from Lucassen et al (11 )

Fig 4 ROC plot showing the summary estimate ofsensitivity and specificity and the corresponding 95confidence ellipse for 3 different clinical decisionrules for the diagnosis of pulmonary embolism (A

Wells rule with cutoff value 2 B Wells rule withcutoff value of 4 C Gestalt) Adapted from Lucas-sen et al (11 )

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1541

paired comparisons provide more valid evidence thanresults generated from unpaired studies (separate stud-ies) that may reflect other underlying differences in de-sign and conduct between studies (ie confoundingfactors)

Diagnostic accuracy can vary between clinical sub-groups Examining such differences in a systematic re-view is problematic if primary studies do not reportstratified results for these subgroups In the absenceof stratified results researchers have to use study-levelsummaries of the covariate representing the clinicalsubgroup Such summaries have limited power for de-tecting differences in accuracy between clinical sub-groups As an example if reviewers are interested inwhether accuracy of a test varies between men andwomen they could use the percentage of males in eachstudy as a study-level covariate in their model A study-

level covariate reduces the power to find differencesbetween males and females which would clearly be thecase if all included studies had similar percentages ofmales Even if a clear difference in accuracy existed be-tween males and females in each study it would remainundetected in a regression model based on the percent-age of males Individual patient data (IPD) metaanaly-sis provides more power and flexibility to examinevariation in accuracy between clinical subgroups

Just as for any regression model used for examiningcovariates clear boundaries exist that define what can bedone or what is sensible given the sample size of the studyInsufficient statistical power or an increased risk of find-ing false-positive associations when many covariates areexamined are concerns when diagnostic reviews are con-ducted The number of different studies within a review isthe key limitation for examining covariates

The results from the bivariate model comparingthe three different rules are summarized in Fig 4 andTable 4

These results show that (as expected) mean sensi-tivity is significantly lower for the Wells studies using acutoff value of 2 compared with studies using a cutoffvalue of 4 but at the same time specificity is signifi-cantly lower Such differences are expected when low-ering the threshold for positivity The results from theGestalt studies are comparable with the Wells studiesusing a cutoff value of 2 although there appears to bemore heterogeneity in the reported specificities ofGestalt studies (Fig 2)

In the example review on pulmonary embolismthe authors examined whether the prevalence in astudy had an impact on the levels of sensitivity andspecificity in a study by including it as a covariate in thebivariate metaregression model There are several rea-sons why prevalence might be associated with sensitiv-ity and specificity An overview of these potential rea-sons is given in (24 )

In this case the authors hypothesized that differ-ences in prevalence could be seen as a proxy for differ-ences in case mix between studies In studies with lowerprevalence more patients may be in an early stage ofthe disease which would hamper detection and lead tomore false-negative results and hence lower sensitiv-ity In this review increased prevalence was associatedwith higher sensitivity and lower specificity (Table 4)

V INTERPRETING RESULTS AND DRAWING CONCLUSIONS

This is the part of the review process in which all theresults of the different steps within a systematic reviewhave to be combined to answer the review question(s)at hand Key ingredients include the methodologicalquality of the evidence whether the included studiesexamined the same intended role of the test as ex-

Table 3 Causes for variation in sensitivity andspecificity results between primary studies within a

review

Chance variation The majority of diagnostic accuracystudies are moderate to small insample size Considerablevariation by chance can then beexpected especially forsensitivity when the prevalenceis low The advanced modelsproperly take into account theprecision by which sensitivityand specificity have beenmeasured in each study

Differences inthreshold

Explicit or implicit differences inthresholds for positivity betweenstudies will lead to differences insensitivity and specificity inopposite directions creatingnegative correlations Theadvanced models take thepossible correlations intoaccount

Bias Deficiencies in the design andconduct of diagnostic studies canlead to biased results oftenproducing more exaggeratedresults Advanced models canexamine the impact ofdeficiencies in design byincluding study-level covariates

Variation byclinicalsubgroups

Examine stratified results orsummaries at a study level

Unexplainedvariation

It is likely that remaining variationbeyond chance will be present inDTA reviews The advancedmodels use random effects toincorporate variation beyondchance

Review

1542 Clinical Chemistry 5811 (2012)

pressed in the review question and the precision andvariability in accuracy results

Reviews with a comparative question (eg is test Abetter than test B at a specific point in the diagnosticpathway) can directly examine whether sensitivity orspecificity or both are higher for one test than the otherA distinction should be made between primary studiesdirectly comparing the 2 index tests in the same patient(direct evidence) and studies examining only one ofthese index tests (indirect evidence) Direct evidence ispreferred because important factors that may have animpact on accuracy (ie potential confounding factorssuch as the population and choice of reference stan-dard) will be constant when the index tests are com-pared If sufficient studies with direct evidence areavailable the main analysis or any sensitivity analysesshould focus on these studies providing directevidence

If both sensitivity and specificity are higher or theentire summary ROC curve for one test is to the left andabove that of the other test the conclusion is straightfor-ward If sensitivity is higher for one test and specificity forthe other or if the summary ROC curves of the 2 testscross it is important to examine and weigh the potentialnegative consequences associated with false-positive orfalse-negative test results One way to provide this insightis to subject a hypothetical cohort of 1000 patients to bothtests and calculate the number of patients with differentcorrect and incorrect test results based on summary esti-

mates of sensitivity and specificity and a reasonable esti-mate of the expected prevalence

The intended role of a test is also helpful in structur-ing the interpretation of results In triage questions thenumber of missed cases (eg false-negative test results) isthe key concern so sensitivity or the negative predictivevalue are the key accuracy measures The desired mini-mum level for these measures will still be a subjectivechoice and depend on the condition at hand In our ex-ample most experts will agree that the clinical decisionrule should not miss more than 5 of the patients withpulmonary embolism so therefore sensitivity should be atleast 95 From the results of the rules alone it is clear thata large part of the confidence ellipse and even the sum-mary estimate of sensitivity do not meet this criterion (Ta-ble 4) This observation leads to a firm conclusion thatclinical decisions alone are not suited for use in the triageof patients suspected to have pulmonary embolismTherefore D-dimer results have been added to the triageof patients suspected for pulmonary embolism In thisscenario patients will not undergo further testing if boththe clinical decision rule AND the D-dimer are negativeThe proportion of patients who had negative results forboth tests but who had a final diagnosis of pulmonaryembolism (failure rate) has been metaanalyzed Adding aqualitative D-dimer to the clinical decision rule led tofailure rates that were lower than 2 (Table 4) Thisfrequency has been considered sufficiently low andtherefore such strategies have been implemented in

Table 4 Mean (95 CI) values of sensitivity and specificity for 3 different clinical decision rules for pulmonaryembolism the impact of prevalence on sensitivity and specificity and failure rate and efficiency of a strategy in

which patients with a low probability of disease and a negative D-dimer receive no further testing

Subgroup (no of studies) Sensitivity (95 CI) Specificity (95 CI)

Type of rule

Wells cutoff value of 2 (n 19) 84 (78ndash89) 58 (52ndash65)

Wells cutoff value of 4 (n 11) 60 (49ndash69) 80 (75ndash84)

Gestalt (n 15) 85 (78ndash90) 51 (39ndash63)

P value Wells 2 vs Wells 4 P 0001 P 0001

P value Wells 2 vs Gestalt P 096 P 031

P value Wells 4 vs Gestalt P 0001 P 0001

Impact prevalence within Wells 2 studies

Prevalence 5 67 (58ndash75) 72 (65ndash79)

Prevalence 15 85 (80ndash89) 58 (52ndash63)

Prevalence 30 91 (88ndash94) 47 (40ndash55)

P value for trend P 0001 P 0001

Adding D-dimer testing to rule Failure rate (95 CI) Efficiency (95 CI)

Wells 4 with quantitative D-dimer (n 4) 05 (02ndash09) 39 (30ndash48)

Wells 2 with qualitative D-dimer (n 5) 09 (05ndash17) 40 (32ndash49)

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1543

practice The efficiencies of such strategies arearound 40 meaning that in 40 of the patients nofurther testing is required

Similar to any other review there is the threat ofpublication bias in DTA reviews (18 ) Publication biasoccurs when studies containing less favorable resultsare less likely to be published Summary results basedon published findings will then generate an overopti-mistic picture of the accuracy of a test Unfortunatelylittle information exists regarding the presence andmagnitude of publication bias in diagnostic accuracystudies Unlike randomized trials there are no registriesfor protocols of diagnostic accuracy studies

Recent Developments

In this section we highlight some recent developmentsthat are relevant for diagnostic accuracy reviews of bio-chemical tests and markers

NETWORK METAANALYSIS

In many diagnostic scenarios there are several alterna-tive tests available which leads to the key questionwhich test is the best Direct comparisons of tests(head-to-head comparison in the same patients by useof a cross-over design or a parallel randomized design)offer the most valid study design but are not alwaysavailable in the literature Systematic reviews focusingon more than one diagnostic test have to incorporateindirect comparisons (accuracy of different tests as-sessed in different populations) Network metaanaly-ses have been developed in the field of intervention tocombine both direct and indirect comparisons within asingle statistical model to allow for ranking of the avail-able treatments (25 ) In addition these models provideestimates of heterogeneity and inconsistency of effectsSuch network metaanalyses would be a welcome addi-tion for ranking and selecting the best test among sev-eral alternatives

IPD METAANALYSIS

IPD metaanalyses use individual patient data ratherthan published summary results of a study In an IPDmetaanalysis there is more flexibility and more statisti-cal power to examine how patientsrsquo characteristics af-fect diagnostic test accuracy (subgroup analyses or ef-fect modification) IPD metaanalysis also offers moreflexibility in handling differences in thresholds for pos-itivity for continuous index test results and for deter-mining the optimal cutoff value (26 )

Concluding Remarks

Many improvements have been made in the method-ology of performing systematic reviews of the accuracy

of diagnostic tests and multivariable diagnostic mod-els Methods have been improved for locating diagnos-tic accuracy studies for assessing the risk for bias andsources of variation and for developing advanced andflexible models to metaanalyze 2 possible correlatedoutcomes However the biggest obstacle for generatinghigh-quality clinically useful diagnostic reviews is thepoor methodological quality of the existing body ofdiagnostic accuracy studies reported in the literatureFortunately interest in the methods for the evaluationof diagnostic tests has grown considerably in the lastdecade Higher-quality and more informative primarystudies will in return generate more informative diag-nostic reviews

Appendix 1

Accuracy of diagnostic decision rules without and withD-dimer assay for the diagnosis of pulmonary embo-lism Pulmonary embolism (PE) is an important condi-tion for physicians to consider because case fatality ishigh if left untreated However diagnosing PE in sus-pected patients is challenging because signs and symp-toms are often nonspecific Physicians constantly facethe dilemma of not wanting to miss a PE while at thesame time wanting to avoid performing too many un-necessary additional diagnostic procedures that can beexpensive burdensome and possibly harmful Diag-nostic strategies in suspected patients therefore focuson identifying patients in whom PE can be safely ruledout on the basis of findings from the patient history andphysical examination Many different diagnostic deci-sion rules for excluding PE on the basis of symptomsand signs with or without D-dimer assay have beendeveloped and validated but there remains uncertaintyas to whether these different rules differ in their accu-racy in a meaningful way In this example we focus on 3rules

bull Wells rule using a cutoff value of 2 for defining apositive (abnormal) test result

bull Wells rule using a cutoff value of 4bull Gestalt rule

In the Wells rules points are scored when certainsigns and symptoms (eg heart rate 100 previousdeep venous thrombosis) are present resulting in a to-tal score In the Gestalt rule physicians provide anoverall empirical assessment of the likelihood of pul-monary embolism being present after examination of apatient To safely exclude pulmonary embolism aD-dimer test can be added to the clinical rule to refrainfrom further testing if both tests (rule D-dimer assay)are negative Further details and more rules can befound in the original review (11 )

Review

1544 Clinical Chemistry 5811 (2012)

REVIEW AIMS

To determine and compare the diagnostic accuracy of 3different clinical decision rules Wells-2 (n 19 stud-ies) Wells-4 (n 11 studies) and Gestalt rule (n 15studies)

To examine whether a negative test from a rule incombination with a negative D-dimer test result is a safeand efficient strategy for excluding PE without referralfor further burdening and invasive imaging

Author Contributions All authors confirmed they have contributed tothe intellectual content of this paper and have met the following 3 re-quirements (a) significant contributions to the conception and design

acquisition of data or analysis and interpretation of data (b) draftingor revising the article for intellectual content and (c) final approval ofthe published article

Authorsrsquo Disclosures or Potential Conflicts of Interest Upon man-uscript submission all authors completed the author disclosure formDisclosures andor potential conflicts of interest

Employment or Leadership None declaredConsultant or Advisory Role None declaredStock Ownership None declaredHonoraria None declaredResearch Funding The Netherlands Organisation for Health Re-search and Development (ZonMW) KGM Moons the Nether-lands Organisation for Scientific Research (projects 91208004 and91810615)Expert Testimony None declared

References

1 Lijmer JG Leeflang M Bossuyt PM Proposals fora phased evaluation of medical tests Med DecisMaking 200929E13ndash21

2 Linnet K Bossuyt PM Moons KG Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

3 Moons KG de Groot JA Linnet K Reitsma JBBossuyt PM Quantifying the added value of adiagnostic test or marker Clin Chem 2012581408ndash17

4 Bossuyt PM Reitsma JB Linnet K Moons KGBeyond diagnostic accuracy the clinical utility ofdiagnostic tests Clin Chem [Epub ahead of print2012 Jun 22]

5 Reitsma JB Glas AS Rutjes AW Scholten RJBossuyt PM Zwinderman AH Bivariate analysisof sensitivity and specificity produces informativesummary measures in diagnostic reviews J ClinEpidemiol 200558982ndash90

6 Whiting P Rutjes AW Reitsma JB Glas ASBossuyt PM Kleijnen J Sources of variation andbias in studies of diagnostic accuracy a system-atic review Ann Intern Med 2004140189ndash202

7 Leeflang MM Deeks JJ Gatsonis C Bossuyt PMon behalf of the Cochrane Diagnostic Test Accu-racy Working Group Systematic reviews of diag-nostic test accuracy Ann Intern Med 2008149889ndash97

8 Whiting PF Rutjes AW Westwood ME Mallett SDeeks JJ Reitsma JB et al QUADAS-2 a revisedtool for the quality assessment of diagnostic ac-curacy studies Ann Intern Med 2011155529ndash36

9 Diagnostic test accuracy working group httpsrdtacochraneorg (Accessed August 2012)

10 Leeflang MM Debets-Ossenkopp YJ Visser CEScholten RJ Hooft L Bijlmer HA et al Galacto-mannan detection for invasive aspergillosis in

immunocompromized patients Cochrane Data-base Syst Rev 2008CD007394

11 Lucassen W Geersing GJ Erkens PM Reitsma JBMoons KG Buller H van Weert HC Clinical de-cision rules for excluding pulmonary embolism ameta-analysis Ann Intern Med 2011155448ndash60

12 Bossuyt PM Irwig L Craig J Glasziou P Com-parative accuracy assessing new tests againstexisting diagnostic pathways BMJ 20063321089ndash92

13 Doust JA Pietrzak E Sanders S Glasziou PPIdentifying studies for systematic reviews of di-agnostic tests was difficult due to the poor sen-sitivity and precision of methodologic filters andthe lack of information in the abstract J ClinEpidemiol 200558444ndash9

14 Leeflang MM Scholten RJ Rutjes AW ReitsmaJB Bossuyt PM Use of methodological searchfilters to identify diagnostic accuracy studies canlead to the omission of relevant studies J ClinEpidemiol 200659234ndash40

15 Savoie I Helmer D Green CJ Kazanjian A Be-yond Medline reducing bias through extendedsystematic review search Int J Technol AssessHealth Care 200319168ndash78

16 Fraser C Mowatt G Siddiqui R Burr J Searchingfor diagnostic test accuracy studies an applica-tion to screening for open angle glaucoma (OAG)[Abstract] Cochrane Colloquium AbstractsJournal 2006 httpwwwimbiuni-freiburgdeOJSccaindexphpjournalccaamppagearticleampopviewamppath[]1980 (Accessed October2012)

17 Whiting P Westwood M Burke M Sterne JGlanville J Systematic reviews of test accuracyshould search a range of databases to identifyprimary studies J Clin Epidemiol 200861

357ndash 6418 Song F Eastwood AJ Gilbody S Duley L Sutton

AJ Publication and related biases Health TechnolAssess 200041ndash115

19 Whiting P Rutjes AW Reitsma JB Bossuyt PMKleijnen J The development of QUADAS a toolfor the quality assessment of studies of diagnos-tic accuracy included in systematic reviews BMCMed Res Methodol 2003325

20 Linnet K Bossuyt PMM Moons KGM Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

21 Moses LE Shapiro D Littenberg B Combiningindependent studies of a diagnostic test into asummary ROC curve data-analytic approachesand some additional considerations Stat Med1993121293ndash316

22 Rutter CM Gatsonis CA A hierarchical regressionapproach to meta-analysis of diagnostic test ac-curacy evaluations Stat Med 2001202865ndash84

23 Harbord RM Deeks JJ Egger M Whiting PSterne JA A unification of models for meta-analysis of diagnostic accuracy studies Biostatis-tics 20078239ndash51

24 Leeflang MM Bossuyt PM Irwig L Diagnostictest accuracy may vary with prevalence implica-tions for evidence-based diagnosis J Clin Epide-miol 2009625ndash12

25 Li T Puhan MA Vedula SS Singh S Dickersin Kthe Ad Hoc Network Meta-analysis MethodsMeeting Working Group Network meta-analysis-highly attractive but more methodological re-search is needed BMC Med 2011 27979

26 Khan KS Bachmann LM ter Riet G Systematicreviews with individual patient data meta-analysis to evaluate diagnostic tests Eur J ObstetGynecol Reprod Biol 2003108121ndash5

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1545

judging whether the spectrum of patients in a studymatches that of the intended population defined in thereview question Given these difficulties the strong adviceis that at least 2 persons should independently performthe quality assessment These persons should have rele-vant knowledge of both the methodological issues in di-agnostic accuracy studies and the clinical topic area

Results of quality assessment can be presented intables and graphs Tables can be used to document allfeatures of the included studies including theQUADAS-2 items Such a table takes up a lot of spaceand does not provide a useful succinct summary for thereader These tables are often reported as supplementalmaterial on the websites of journals

Table 2 Revised QUADAS-2 checklist

Table 1 Risk of bias and applicability judgments in QUADAS-2a

Domain Patient selection Index test Reference standard Flow and timing

Description Describe methods of patientselection

Describe the index test and howit was conducted andinterpreted

Describe the referencestandard and howit was conductedand interpreted

Describe any patientswho did notreceive the indextests or referencestandard or whowere excludedfrom the 2 2table (refer to flowdiagram)

Describe included patients(previous testingpresentation intendeduse of index test andsetting)

Describe the intervaland anyinterventionsbetween indextests and thereference standard

Signaling questions(yes no orunclear)

Was a consecutive orrandom sample ofpatients enrolled

Were the index test resultsinterpreted without know-ledge of the results of thereference standard

Is the referencestandard likely tocorrectly classify thetarget condition

Was there anappropriateinterval betweenindex tests andreferencestandard

Was a case control designavoided

If a threshold was used was itprespecified

Were the referencestandard resultsinterpreted withoutknowledge of theresults of the indextest

Did all patientsreceive a referencestandard

Did the study avoidinappropriate exclusions

Did all patientsreceive the samereferencestandard

Were all patientsincluded in theanalysis

Risk of bias (highlow or unclear)

Could the selection ofpatients have introducedbias

Could the conduct orinterpretation of the indextest have introduced bias

Could the referencestandard itsconduct or itsinterpretation haveintroduced bias

Could the patientflow haveintroduced bias

Concerns aboutapplicability(high low orunclear)

Are there concerns that theincluded patients do notmatch the reviewquestion

Are there concerns that theindex test its conduct or itsinterpretation differ from thereview question

Are there concernsthat the targetcondition as definedby the referencestandard does notmatch the reviewquestion

a Reproduced with permission from Whiting et al (8 )

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1537

Two graphical summaries are recommended forpresenting the results of the quality assessment Themethodological quality graph presents for each qualityassessment item the percentage of included studies inwhich the item was rated ldquoyesrdquo ldquonordquo and ldquounclearrdquo ina stacked bar chart This type of graph provides thereader with a quick overview of the study quality withinthe whole review The methodological quality graph ofour pulmonary embolism example is given in Fig 1which shows that potential areas of concern are differ-ential verification (eg the use of different referencestandard for different groups of patients in a study) andthe large proportion of studies providing no data thatuninterpretable test results were present

A systematic review provides an opportunity toinvestigate how features of study design executionand reporting may have an impact on study findingsOne way is to give a narrative summary of the qualityassessment and discuss how susceptible the results areto particular biases Another approach is to do a sensitivityanalysis in which studies that fail to meet some standard ofquality are excluded Metaregression allows direct exam-ination of the impact of specific individual quality itemson diagnostic accuracy (see next section)

IV METAANALYSIS AND PRESENTATION OF POOLED

DIAGNOSTIC TEST ACCURACY RESULTS

Metaanalysis is the use of statistical techniques to com-bine the results from a set of individual studies We canuse metaanalysis to obtain summaries of the results ofrelevant included studies such as an estimate of themean diagnostic accuracy of a test or marker the sta-

tistical uncertainty around this mean expressed with95 CIs and the variability of individual study find-ings around mean estimates Metaanalytical regressionmodels can statistically compare the accuracy of 2 ormore different tests and examine how test accuracyvaries with specific study characteristics

In the metaanalysis of diagnostic accuracy studiesthe focus is on 2 statistical measures of diagnostic ac-curacy the sensitivity of the test (the proportion ofpatients with the target disease who have an abnormaltest result) and the specificity of the test (the propor-tion of patients without the target disease who have anormal test result) Statistical methods for diagnostictest accuracy have to deal with 2 outcomes simultane-ously (ie sensitivity and specificity) rather than a sin-gle outcome measure (eg a relative risk or odds ratio)as is the case for reviews of therapeutic interventions(5 ) The diagnostic metaanalytical models have to al-low for the trade-off between sensitivity and specificitythat can arise because studies may vary in the thresholdvalue used to define test positives and test negatives[also see report 1 in our series (20 )] Another feature ofdiagnostic reviews is the many potential sources forvariation in test accuracy results between studies Ex-amining factors that can (partly) explain variation inthese results and the use of a random effects model arekey features of a DTA review

DESCRIPTIVE STATISTICS

The first step in the analysis is to visualize the resultsfrom the individual studies within a review There are 2types of figures that can be used forest plots of sensi-tivity and specificity and plots of these measures inROC space

Forest plots display the estimates of sensitivity andspecificity of each study the corresponding CIs andthe underlying raw numbers in a paired way (Fig 2)These plots give a visual impression of the variation inresults between studies an indication of the precisionby which sensitivity and specificity have been measuredin each study the presence of outliers and a sense forthe mean values of sensitivity and specificity

Plotting the pairs of sensitivity and specificity esti-mates from separate studies in the ROC space providesadditional insight regarding the variation of results be-tween studies in particular whether sensitivity andspecificity are negatively correlated (Fig 3) The x axisof the ROC plot displays the (1 specificity) obtainedin the studies in the review and the y axis shows thecorresponding sensitivity The rising diagonal line in-dicates values of sensitivity and specificity belonging toa test that is not informative ie the chances for apositive test result are identical for patients with andwithout the target disease Better (eg more informa-tive) tests will have higher values of both sensitivity and

Fig 1 Quality assessment of included studies per-formed with the original QUADAS checklist in a re-cently published systematic review on diagnostic deci-sion rules used alone or in combination with a D-dimerassay for the diagnosis of pulmonary embolism

Reproduced with permission from Lucassen et al (11 )

Review

1538 Clinical Chemistry 5811 (2012)

Fig 2 Paired forest plot of sensitivity (A) and specificity (B) and the corresponding 95 CIs from studies examiningthe diagnostic accuracy of the Wells rule with a cutoff value of 2 Wells rule with cutoff value of 4 and Gestalt forthe diagnosis of pulmonary embolism

Studies within a rule are sorted by prevalence Adapted from Lucassen et al (11 )Continued on page 1540

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1539

specificity and are therefore located more toward thetop-left corner of the ROC-space If there is a tradeoff(eg negative correlation) between sensitivity andspecificity a shoulderlike pattern in the ROC space willemerge This pattern will be comparable to the pattern

that arises in a single study of a test that produces acontinuous result in which the threshold has been var-ied Lowering the threshold will then increase the like-lihood of a positive test result in patients with the targetdisease thereby increasing sensitivity while at the same

Fig 2 Continued

Review

1540 Clinical Chemistry 5811 (2012)

it increases the risk of a false-positive result in patientswithout the target disease thereby lowering specificityThis trade-off or negative correlation will generate thisshoulder-like pattern in the ROC space

The ROC plot of our 3 clinical decision rules forpulmonary embolism clearly indicates the presence ofnegative correlations both within rules as well as acrossdifferent rules (Fig 3)

Metaanalysis of Diagnostic Accuracy Data

Metaanalyses of studies reporting sensitivities andspecificities have often used the MosesndashLittenberg lin-ear regression approach (21 ) to obtain a summaryROC curve It has become clear that this approach hasstatistical shortcomings (5 22 ) and therefore it is nolonger recommended for evaluating differences be-tween summary ROC curves between tests or examin-ing the impact of covariates on accuracy

To overcome the shortcomings of the MosesndashLittenberg approach 2 more rigorous statistical ap-proaches have since been developed These are the hi-erarchical summary ROC approach and the bivariaterandom effects model (5 22 ) Both models are hierar-chical random effects models that take into account thebetween-study variation in sensitivities and specifici-ties (eg random effects models) and their possiblecorrelations as well as the precision of these estimates

within a study (eg weighting of studies) Although thestarting points of these 2 models are different the 2models are mathematically equivalent (23 ) Both mod-els can produce summary estimates of sensitivity andspecificity and produce a statistically sound summaryROC line or provide 95 confidence ellipses aroundthe mean values of sensitivity and specificity (Fig 4)

EXAMINATION OF SOURCES OF VARIATION AND DIFFERENCES

IN ACCURACY BETWEEN TESTS

Results from individual studies often vary within a re-view There are several possible causes for variationwhich can be categorized according to the groupsshown in Table 3

Both advanced models are regression models thatallow flexibility in examining sources of heterogeneityby including study-level covariates This feature pro-vides the option of formally comparing the results ofstudies with a specific feature (eg partial verification)with the results of studies that have avoided partial ver-ification In the same way we can examine whether theaccuracy results from studies examining test A are dif-ferent from studies examining test B Limiting thecomparison of tests to studies with a cross-over designin which both index tests have been applied in the samepatient may be a preferred approach These so-called

Fig 3 Pairs of sensitivity and specificity values fromstudies examining 3 different rules for the diagnosisof pulmonary embolism (A Wells rule with a cutoffvalue 2 B Wells rule with a cutoff value of 4 CGestalt) Adapted from Lucassen et al (11 )

Fig 4 ROC plot showing the summary estimate ofsensitivity and specificity and the corresponding 95confidence ellipse for 3 different clinical decisionrules for the diagnosis of pulmonary embolism (A

Wells rule with cutoff value 2 B Wells rule withcutoff value of 4 C Gestalt) Adapted from Lucas-sen et al (11 )

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1541

paired comparisons provide more valid evidence thanresults generated from unpaired studies (separate stud-ies) that may reflect other underlying differences in de-sign and conduct between studies (ie confoundingfactors)

Diagnostic accuracy can vary between clinical sub-groups Examining such differences in a systematic re-view is problematic if primary studies do not reportstratified results for these subgroups In the absenceof stratified results researchers have to use study-levelsummaries of the covariate representing the clinicalsubgroup Such summaries have limited power for de-tecting differences in accuracy between clinical sub-groups As an example if reviewers are interested inwhether accuracy of a test varies between men andwomen they could use the percentage of males in eachstudy as a study-level covariate in their model A study-

level covariate reduces the power to find differencesbetween males and females which would clearly be thecase if all included studies had similar percentages ofmales Even if a clear difference in accuracy existed be-tween males and females in each study it would remainundetected in a regression model based on the percent-age of males Individual patient data (IPD) metaanaly-sis provides more power and flexibility to examinevariation in accuracy between clinical subgroups

Just as for any regression model used for examiningcovariates clear boundaries exist that define what can bedone or what is sensible given the sample size of the studyInsufficient statistical power or an increased risk of find-ing false-positive associations when many covariates areexamined are concerns when diagnostic reviews are con-ducted The number of different studies within a review isthe key limitation for examining covariates

The results from the bivariate model comparingthe three different rules are summarized in Fig 4 andTable 4

These results show that (as expected) mean sensi-tivity is significantly lower for the Wells studies using acutoff value of 2 compared with studies using a cutoffvalue of 4 but at the same time specificity is signifi-cantly lower Such differences are expected when low-ering the threshold for positivity The results from theGestalt studies are comparable with the Wells studiesusing a cutoff value of 2 although there appears to bemore heterogeneity in the reported specificities ofGestalt studies (Fig 2)

In the example review on pulmonary embolismthe authors examined whether the prevalence in astudy had an impact on the levels of sensitivity andspecificity in a study by including it as a covariate in thebivariate metaregression model There are several rea-sons why prevalence might be associated with sensitiv-ity and specificity An overview of these potential rea-sons is given in (24 )

In this case the authors hypothesized that differ-ences in prevalence could be seen as a proxy for differ-ences in case mix between studies In studies with lowerprevalence more patients may be in an early stage ofthe disease which would hamper detection and lead tomore false-negative results and hence lower sensitiv-ity In this review increased prevalence was associatedwith higher sensitivity and lower specificity (Table 4)

V INTERPRETING RESULTS AND DRAWING CONCLUSIONS

This is the part of the review process in which all theresults of the different steps within a systematic reviewhave to be combined to answer the review question(s)at hand Key ingredients include the methodologicalquality of the evidence whether the included studiesexamined the same intended role of the test as ex-

Table 3 Causes for variation in sensitivity andspecificity results between primary studies within a

review

Chance variation The majority of diagnostic accuracystudies are moderate to small insample size Considerablevariation by chance can then beexpected especially forsensitivity when the prevalenceis low The advanced modelsproperly take into account theprecision by which sensitivityand specificity have beenmeasured in each study

Differences inthreshold

Explicit or implicit differences inthresholds for positivity betweenstudies will lead to differences insensitivity and specificity inopposite directions creatingnegative correlations Theadvanced models take thepossible correlations intoaccount

Bias Deficiencies in the design andconduct of diagnostic studies canlead to biased results oftenproducing more exaggeratedresults Advanced models canexamine the impact ofdeficiencies in design byincluding study-level covariates

Variation byclinicalsubgroups

Examine stratified results orsummaries at a study level

Unexplainedvariation

It is likely that remaining variationbeyond chance will be present inDTA reviews The advancedmodels use random effects toincorporate variation beyondchance

Review

1542 Clinical Chemistry 5811 (2012)

pressed in the review question and the precision andvariability in accuracy results

Reviews with a comparative question (eg is test Abetter than test B at a specific point in the diagnosticpathway) can directly examine whether sensitivity orspecificity or both are higher for one test than the otherA distinction should be made between primary studiesdirectly comparing the 2 index tests in the same patient(direct evidence) and studies examining only one ofthese index tests (indirect evidence) Direct evidence ispreferred because important factors that may have animpact on accuracy (ie potential confounding factorssuch as the population and choice of reference stan-dard) will be constant when the index tests are com-pared If sufficient studies with direct evidence areavailable the main analysis or any sensitivity analysesshould focus on these studies providing directevidence

If both sensitivity and specificity are higher or theentire summary ROC curve for one test is to the left andabove that of the other test the conclusion is straightfor-ward If sensitivity is higher for one test and specificity forthe other or if the summary ROC curves of the 2 testscross it is important to examine and weigh the potentialnegative consequences associated with false-positive orfalse-negative test results One way to provide this insightis to subject a hypothetical cohort of 1000 patients to bothtests and calculate the number of patients with differentcorrect and incorrect test results based on summary esti-

mates of sensitivity and specificity and a reasonable esti-mate of the expected prevalence

The intended role of a test is also helpful in structur-ing the interpretation of results In triage questions thenumber of missed cases (eg false-negative test results) isthe key concern so sensitivity or the negative predictivevalue are the key accuracy measures The desired mini-mum level for these measures will still be a subjectivechoice and depend on the condition at hand In our ex-ample most experts will agree that the clinical decisionrule should not miss more than 5 of the patients withpulmonary embolism so therefore sensitivity should be atleast 95 From the results of the rules alone it is clear thata large part of the confidence ellipse and even the sum-mary estimate of sensitivity do not meet this criterion (Ta-ble 4) This observation leads to a firm conclusion thatclinical decisions alone are not suited for use in the triageof patients suspected to have pulmonary embolismTherefore D-dimer results have been added to the triageof patients suspected for pulmonary embolism In thisscenario patients will not undergo further testing if boththe clinical decision rule AND the D-dimer are negativeThe proportion of patients who had negative results forboth tests but who had a final diagnosis of pulmonaryembolism (failure rate) has been metaanalyzed Adding aqualitative D-dimer to the clinical decision rule led tofailure rates that were lower than 2 (Table 4) Thisfrequency has been considered sufficiently low andtherefore such strategies have been implemented in

Table 4 Mean (95 CI) values of sensitivity and specificity for 3 different clinical decision rules for pulmonaryembolism the impact of prevalence on sensitivity and specificity and failure rate and efficiency of a strategy in

which patients with a low probability of disease and a negative D-dimer receive no further testing

Subgroup (no of studies) Sensitivity (95 CI) Specificity (95 CI)

Type of rule

Wells cutoff value of 2 (n 19) 84 (78ndash89) 58 (52ndash65)

Wells cutoff value of 4 (n 11) 60 (49ndash69) 80 (75ndash84)

Gestalt (n 15) 85 (78ndash90) 51 (39ndash63)

P value Wells 2 vs Wells 4 P 0001 P 0001

P value Wells 2 vs Gestalt P 096 P 031

P value Wells 4 vs Gestalt P 0001 P 0001

Impact prevalence within Wells 2 studies

Prevalence 5 67 (58ndash75) 72 (65ndash79)

Prevalence 15 85 (80ndash89) 58 (52ndash63)

Prevalence 30 91 (88ndash94) 47 (40ndash55)

P value for trend P 0001 P 0001

Adding D-dimer testing to rule Failure rate (95 CI) Efficiency (95 CI)

Wells 4 with quantitative D-dimer (n 4) 05 (02ndash09) 39 (30ndash48)

Wells 2 with qualitative D-dimer (n 5) 09 (05ndash17) 40 (32ndash49)

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1543

practice The efficiencies of such strategies arearound 40 meaning that in 40 of the patients nofurther testing is required

Similar to any other review there is the threat ofpublication bias in DTA reviews (18 ) Publication biasoccurs when studies containing less favorable resultsare less likely to be published Summary results basedon published findings will then generate an overopti-mistic picture of the accuracy of a test Unfortunatelylittle information exists regarding the presence andmagnitude of publication bias in diagnostic accuracystudies Unlike randomized trials there are no registriesfor protocols of diagnostic accuracy studies

Recent Developments

In this section we highlight some recent developmentsthat are relevant for diagnostic accuracy reviews of bio-chemical tests and markers

NETWORK METAANALYSIS

In many diagnostic scenarios there are several alterna-tive tests available which leads to the key questionwhich test is the best Direct comparisons of tests(head-to-head comparison in the same patients by useof a cross-over design or a parallel randomized design)offer the most valid study design but are not alwaysavailable in the literature Systematic reviews focusingon more than one diagnostic test have to incorporateindirect comparisons (accuracy of different tests as-sessed in different populations) Network metaanaly-ses have been developed in the field of intervention tocombine both direct and indirect comparisons within asingle statistical model to allow for ranking of the avail-able treatments (25 ) In addition these models provideestimates of heterogeneity and inconsistency of effectsSuch network metaanalyses would be a welcome addi-tion for ranking and selecting the best test among sev-eral alternatives

IPD METAANALYSIS

IPD metaanalyses use individual patient data ratherthan published summary results of a study In an IPDmetaanalysis there is more flexibility and more statisti-cal power to examine how patientsrsquo characteristics af-fect diagnostic test accuracy (subgroup analyses or ef-fect modification) IPD metaanalysis also offers moreflexibility in handling differences in thresholds for pos-itivity for continuous index test results and for deter-mining the optimal cutoff value (26 )

Concluding Remarks

Many improvements have been made in the method-ology of performing systematic reviews of the accuracy

of diagnostic tests and multivariable diagnostic mod-els Methods have been improved for locating diagnos-tic accuracy studies for assessing the risk for bias andsources of variation and for developing advanced andflexible models to metaanalyze 2 possible correlatedoutcomes However the biggest obstacle for generatinghigh-quality clinically useful diagnostic reviews is thepoor methodological quality of the existing body ofdiagnostic accuracy studies reported in the literatureFortunately interest in the methods for the evaluationof diagnostic tests has grown considerably in the lastdecade Higher-quality and more informative primarystudies will in return generate more informative diag-nostic reviews

Appendix 1

Accuracy of diagnostic decision rules without and withD-dimer assay for the diagnosis of pulmonary embo-lism Pulmonary embolism (PE) is an important condi-tion for physicians to consider because case fatality ishigh if left untreated However diagnosing PE in sus-pected patients is challenging because signs and symp-toms are often nonspecific Physicians constantly facethe dilemma of not wanting to miss a PE while at thesame time wanting to avoid performing too many un-necessary additional diagnostic procedures that can beexpensive burdensome and possibly harmful Diag-nostic strategies in suspected patients therefore focuson identifying patients in whom PE can be safely ruledout on the basis of findings from the patient history andphysical examination Many different diagnostic deci-sion rules for excluding PE on the basis of symptomsand signs with or without D-dimer assay have beendeveloped and validated but there remains uncertaintyas to whether these different rules differ in their accu-racy in a meaningful way In this example we focus on 3rules

bull Wells rule using a cutoff value of 2 for defining apositive (abnormal) test result

bull Wells rule using a cutoff value of 4bull Gestalt rule

In the Wells rules points are scored when certainsigns and symptoms (eg heart rate 100 previousdeep venous thrombosis) are present resulting in a to-tal score In the Gestalt rule physicians provide anoverall empirical assessment of the likelihood of pul-monary embolism being present after examination of apatient To safely exclude pulmonary embolism aD-dimer test can be added to the clinical rule to refrainfrom further testing if both tests (rule D-dimer assay)are negative Further details and more rules can befound in the original review (11 )

Review

1544 Clinical Chemistry 5811 (2012)

REVIEW AIMS

To determine and compare the diagnostic accuracy of 3different clinical decision rules Wells-2 (n 19 stud-ies) Wells-4 (n 11 studies) and Gestalt rule (n 15studies)

To examine whether a negative test from a rule incombination with a negative D-dimer test result is a safeand efficient strategy for excluding PE without referralfor further burdening and invasive imaging

Author Contributions All authors confirmed they have contributed tothe intellectual content of this paper and have met the following 3 re-quirements (a) significant contributions to the conception and design

acquisition of data or analysis and interpretation of data (b) draftingor revising the article for intellectual content and (c) final approval ofthe published article

Authorsrsquo Disclosures or Potential Conflicts of Interest Upon man-uscript submission all authors completed the author disclosure formDisclosures andor potential conflicts of interest

Employment or Leadership None declaredConsultant or Advisory Role None declaredStock Ownership None declaredHonoraria None declaredResearch Funding The Netherlands Organisation for Health Re-search and Development (ZonMW) KGM Moons the Nether-lands Organisation for Scientific Research (projects 91208004 and91810615)Expert Testimony None declared

References

1 Lijmer JG Leeflang M Bossuyt PM Proposals fora phased evaluation of medical tests Med DecisMaking 200929E13ndash21

2 Linnet K Bossuyt PM Moons KG Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

3 Moons KG de Groot JA Linnet K Reitsma JBBossuyt PM Quantifying the added value of adiagnostic test or marker Clin Chem 2012581408ndash17

4 Bossuyt PM Reitsma JB Linnet K Moons KGBeyond diagnostic accuracy the clinical utility ofdiagnostic tests Clin Chem [Epub ahead of print2012 Jun 22]

5 Reitsma JB Glas AS Rutjes AW Scholten RJBossuyt PM Zwinderman AH Bivariate analysisof sensitivity and specificity produces informativesummary measures in diagnostic reviews J ClinEpidemiol 200558982ndash90

6 Whiting P Rutjes AW Reitsma JB Glas ASBossuyt PM Kleijnen J Sources of variation andbias in studies of diagnostic accuracy a system-atic review Ann Intern Med 2004140189ndash202

7 Leeflang MM Deeks JJ Gatsonis C Bossuyt PMon behalf of the Cochrane Diagnostic Test Accu-racy Working Group Systematic reviews of diag-nostic test accuracy Ann Intern Med 2008149889ndash97

8 Whiting PF Rutjes AW Westwood ME Mallett SDeeks JJ Reitsma JB et al QUADAS-2 a revisedtool for the quality assessment of diagnostic ac-curacy studies Ann Intern Med 2011155529ndash36

9 Diagnostic test accuracy working group httpsrdtacochraneorg (Accessed August 2012)

10 Leeflang MM Debets-Ossenkopp YJ Visser CEScholten RJ Hooft L Bijlmer HA et al Galacto-mannan detection for invasive aspergillosis in

immunocompromized patients Cochrane Data-base Syst Rev 2008CD007394

11 Lucassen W Geersing GJ Erkens PM Reitsma JBMoons KG Buller H van Weert HC Clinical de-cision rules for excluding pulmonary embolism ameta-analysis Ann Intern Med 2011155448ndash60

12 Bossuyt PM Irwig L Craig J Glasziou P Com-parative accuracy assessing new tests againstexisting diagnostic pathways BMJ 20063321089ndash92

13 Doust JA Pietrzak E Sanders S Glasziou PPIdentifying studies for systematic reviews of di-agnostic tests was difficult due to the poor sen-sitivity and precision of methodologic filters andthe lack of information in the abstract J ClinEpidemiol 200558444ndash9

14 Leeflang MM Scholten RJ Rutjes AW ReitsmaJB Bossuyt PM Use of methodological searchfilters to identify diagnostic accuracy studies canlead to the omission of relevant studies J ClinEpidemiol 200659234ndash40

15 Savoie I Helmer D Green CJ Kazanjian A Be-yond Medline reducing bias through extendedsystematic review search Int J Technol AssessHealth Care 200319168ndash78

16 Fraser C Mowatt G Siddiqui R Burr J Searchingfor diagnostic test accuracy studies an applica-tion to screening for open angle glaucoma (OAG)[Abstract] Cochrane Colloquium AbstractsJournal 2006 httpwwwimbiuni-freiburgdeOJSccaindexphpjournalccaamppagearticleampopviewamppath[]1980 (Accessed October2012)

17 Whiting P Westwood M Burke M Sterne JGlanville J Systematic reviews of test accuracyshould search a range of databases to identifyprimary studies J Clin Epidemiol 200861

357ndash 6418 Song F Eastwood AJ Gilbody S Duley L Sutton

AJ Publication and related biases Health TechnolAssess 200041ndash115

19 Whiting P Rutjes AW Reitsma JB Bossuyt PMKleijnen J The development of QUADAS a toolfor the quality assessment of studies of diagnos-tic accuracy included in systematic reviews BMCMed Res Methodol 2003325

20 Linnet K Bossuyt PMM Moons KGM Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

21 Moses LE Shapiro D Littenberg B Combiningindependent studies of a diagnostic test into asummary ROC curve data-analytic approachesand some additional considerations Stat Med1993121293ndash316

22 Rutter CM Gatsonis CA A hierarchical regressionapproach to meta-analysis of diagnostic test ac-curacy evaluations Stat Med 2001202865ndash84

23 Harbord RM Deeks JJ Egger M Whiting PSterne JA A unification of models for meta-analysis of diagnostic accuracy studies Biostatis-tics 20078239ndash51

24 Leeflang MM Bossuyt PM Irwig L Diagnostictest accuracy may vary with prevalence implica-tions for evidence-based diagnosis J Clin Epide-miol 2009625ndash12

25 Li T Puhan MA Vedula SS Singh S Dickersin Kthe Ad Hoc Network Meta-analysis MethodsMeeting Working Group Network meta-analysis-highly attractive but more methodological re-search is needed BMC Med 2011 27979

26 Khan KS Bachmann LM ter Riet G Systematicreviews with individual patient data meta-analysis to evaluate diagnostic tests Eur J ObstetGynecol Reprod Biol 2003108121ndash5

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1545

Two graphical summaries are recommended forpresenting the results of the quality assessment Themethodological quality graph presents for each qualityassessment item the percentage of included studies inwhich the item was rated ldquoyesrdquo ldquonordquo and ldquounclearrdquo ina stacked bar chart This type of graph provides thereader with a quick overview of the study quality withinthe whole review The methodological quality graph ofour pulmonary embolism example is given in Fig 1which shows that potential areas of concern are differ-ential verification (eg the use of different referencestandard for different groups of patients in a study) andthe large proportion of studies providing no data thatuninterpretable test results were present

A systematic review provides an opportunity toinvestigate how features of study design executionand reporting may have an impact on study findingsOne way is to give a narrative summary of the qualityassessment and discuss how susceptible the results areto particular biases Another approach is to do a sensitivityanalysis in which studies that fail to meet some standard ofquality are excluded Metaregression allows direct exam-ination of the impact of specific individual quality itemson diagnostic accuracy (see next section)

IV METAANALYSIS AND PRESENTATION OF POOLED

DIAGNOSTIC TEST ACCURACY RESULTS

Metaanalysis is the use of statistical techniques to com-bine the results from a set of individual studies We canuse metaanalysis to obtain summaries of the results ofrelevant included studies such as an estimate of themean diagnostic accuracy of a test or marker the sta-

tistical uncertainty around this mean expressed with95 CIs and the variability of individual study find-ings around mean estimates Metaanalytical regressionmodels can statistically compare the accuracy of 2 ormore different tests and examine how test accuracyvaries with specific study characteristics

In the metaanalysis of diagnostic accuracy studiesthe focus is on 2 statistical measures of diagnostic ac-curacy the sensitivity of the test (the proportion ofpatients with the target disease who have an abnormaltest result) and the specificity of the test (the propor-tion of patients without the target disease who have anormal test result) Statistical methods for diagnostictest accuracy have to deal with 2 outcomes simultane-ously (ie sensitivity and specificity) rather than a sin-gle outcome measure (eg a relative risk or odds ratio)as is the case for reviews of therapeutic interventions(5 ) The diagnostic metaanalytical models have to al-low for the trade-off between sensitivity and specificitythat can arise because studies may vary in the thresholdvalue used to define test positives and test negatives[also see report 1 in our series (20 )] Another feature ofdiagnostic reviews is the many potential sources forvariation in test accuracy results between studies Ex-amining factors that can (partly) explain variation inthese results and the use of a random effects model arekey features of a DTA review

DESCRIPTIVE STATISTICS

The first step in the analysis is to visualize the resultsfrom the individual studies within a review There are 2types of figures that can be used forest plots of sensi-tivity and specificity and plots of these measures inROC space

Forest plots display the estimates of sensitivity andspecificity of each study the corresponding CIs andthe underlying raw numbers in a paired way (Fig 2)These plots give a visual impression of the variation inresults between studies an indication of the precisionby which sensitivity and specificity have been measuredin each study the presence of outliers and a sense forthe mean values of sensitivity and specificity

Plotting the pairs of sensitivity and specificity esti-mates from separate studies in the ROC space providesadditional insight regarding the variation of results be-tween studies in particular whether sensitivity andspecificity are negatively correlated (Fig 3) The x axisof the ROC plot displays the (1 specificity) obtainedin the studies in the review and the y axis shows thecorresponding sensitivity The rising diagonal line in-dicates values of sensitivity and specificity belonging toa test that is not informative ie the chances for apositive test result are identical for patients with andwithout the target disease Better (eg more informa-tive) tests will have higher values of both sensitivity and

Fig 1 Quality assessment of included studies per-formed with the original QUADAS checklist in a re-cently published systematic review on diagnostic deci-sion rules used alone or in combination with a D-dimerassay for the diagnosis of pulmonary embolism

Reproduced with permission from Lucassen et al (11 )

Review

1538 Clinical Chemistry 5811 (2012)

Fig 2 Paired forest plot of sensitivity (A) and specificity (B) and the corresponding 95 CIs from studies examiningthe diagnostic accuracy of the Wells rule with a cutoff value of 2 Wells rule with cutoff value of 4 and Gestalt forthe diagnosis of pulmonary embolism

Studies within a rule are sorted by prevalence Adapted from Lucassen et al (11 )Continued on page 1540

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1539

specificity and are therefore located more toward thetop-left corner of the ROC-space If there is a tradeoff(eg negative correlation) between sensitivity andspecificity a shoulderlike pattern in the ROC space willemerge This pattern will be comparable to the pattern

that arises in a single study of a test that produces acontinuous result in which the threshold has been var-ied Lowering the threshold will then increase the like-lihood of a positive test result in patients with the targetdisease thereby increasing sensitivity while at the same

Fig 2 Continued

Review

1540 Clinical Chemistry 5811 (2012)

it increases the risk of a false-positive result in patientswithout the target disease thereby lowering specificityThis trade-off or negative correlation will generate thisshoulder-like pattern in the ROC space

The ROC plot of our 3 clinical decision rules forpulmonary embolism clearly indicates the presence ofnegative correlations both within rules as well as acrossdifferent rules (Fig 3)

Metaanalysis of Diagnostic Accuracy Data

Metaanalyses of studies reporting sensitivities andspecificities have often used the MosesndashLittenberg lin-ear regression approach (21 ) to obtain a summaryROC curve It has become clear that this approach hasstatistical shortcomings (5 22 ) and therefore it is nolonger recommended for evaluating differences be-tween summary ROC curves between tests or examin-ing the impact of covariates on accuracy

To overcome the shortcomings of the MosesndashLittenberg approach 2 more rigorous statistical ap-proaches have since been developed These are the hi-erarchical summary ROC approach and the bivariaterandom effects model (5 22 ) Both models are hierar-chical random effects models that take into account thebetween-study variation in sensitivities and specifici-ties (eg random effects models) and their possiblecorrelations as well as the precision of these estimates

within a study (eg weighting of studies) Although thestarting points of these 2 models are different the 2models are mathematically equivalent (23 ) Both mod-els can produce summary estimates of sensitivity andspecificity and produce a statistically sound summaryROC line or provide 95 confidence ellipses aroundthe mean values of sensitivity and specificity (Fig 4)

EXAMINATION OF SOURCES OF VARIATION AND DIFFERENCES

IN ACCURACY BETWEEN TESTS

Results from individual studies often vary within a re-view There are several possible causes for variationwhich can be categorized according to the groupsshown in Table 3

Both advanced models are regression models thatallow flexibility in examining sources of heterogeneityby including study-level covariates This feature pro-vides the option of formally comparing the results ofstudies with a specific feature (eg partial verification)with the results of studies that have avoided partial ver-ification In the same way we can examine whether theaccuracy results from studies examining test A are dif-ferent from studies examining test B Limiting thecomparison of tests to studies with a cross-over designin which both index tests have been applied in the samepatient may be a preferred approach These so-called

Fig 3 Pairs of sensitivity and specificity values fromstudies examining 3 different rules for the diagnosisof pulmonary embolism (A Wells rule with a cutoffvalue 2 B Wells rule with a cutoff value of 4 CGestalt) Adapted from Lucassen et al (11 )

Fig 4 ROC plot showing the summary estimate ofsensitivity and specificity and the corresponding 95confidence ellipse for 3 different clinical decisionrules for the diagnosis of pulmonary embolism (A

Wells rule with cutoff value 2 B Wells rule withcutoff value of 4 C Gestalt) Adapted from Lucas-sen et al (11 )

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1541

paired comparisons provide more valid evidence thanresults generated from unpaired studies (separate stud-ies) that may reflect other underlying differences in de-sign and conduct between studies (ie confoundingfactors)

Diagnostic accuracy can vary between clinical sub-groups Examining such differences in a systematic re-view is problematic if primary studies do not reportstratified results for these subgroups In the absenceof stratified results researchers have to use study-levelsummaries of the covariate representing the clinicalsubgroup Such summaries have limited power for de-tecting differences in accuracy between clinical sub-groups As an example if reviewers are interested inwhether accuracy of a test varies between men andwomen they could use the percentage of males in eachstudy as a study-level covariate in their model A study-

level covariate reduces the power to find differencesbetween males and females which would clearly be thecase if all included studies had similar percentages ofmales Even if a clear difference in accuracy existed be-tween males and females in each study it would remainundetected in a regression model based on the percent-age of males Individual patient data (IPD) metaanaly-sis provides more power and flexibility to examinevariation in accuracy between clinical subgroups

Just as for any regression model used for examiningcovariates clear boundaries exist that define what can bedone or what is sensible given the sample size of the studyInsufficient statistical power or an increased risk of find-ing false-positive associations when many covariates areexamined are concerns when diagnostic reviews are con-ducted The number of different studies within a review isthe key limitation for examining covariates

The results from the bivariate model comparingthe three different rules are summarized in Fig 4 andTable 4

These results show that (as expected) mean sensi-tivity is significantly lower for the Wells studies using acutoff value of 2 compared with studies using a cutoffvalue of 4 but at the same time specificity is signifi-cantly lower Such differences are expected when low-ering the threshold for positivity The results from theGestalt studies are comparable with the Wells studiesusing a cutoff value of 2 although there appears to bemore heterogeneity in the reported specificities ofGestalt studies (Fig 2)

In the example review on pulmonary embolismthe authors examined whether the prevalence in astudy had an impact on the levels of sensitivity andspecificity in a study by including it as a covariate in thebivariate metaregression model There are several rea-sons why prevalence might be associated with sensitiv-ity and specificity An overview of these potential rea-sons is given in (24 )

In this case the authors hypothesized that differ-ences in prevalence could be seen as a proxy for differ-ences in case mix between studies In studies with lowerprevalence more patients may be in an early stage ofthe disease which would hamper detection and lead tomore false-negative results and hence lower sensitiv-ity In this review increased prevalence was associatedwith higher sensitivity and lower specificity (Table 4)

V INTERPRETING RESULTS AND DRAWING CONCLUSIONS

This is the part of the review process in which all theresults of the different steps within a systematic reviewhave to be combined to answer the review question(s)at hand Key ingredients include the methodologicalquality of the evidence whether the included studiesexamined the same intended role of the test as ex-

Table 3 Causes for variation in sensitivity andspecificity results between primary studies within a

review

Chance variation The majority of diagnostic accuracystudies are moderate to small insample size Considerablevariation by chance can then beexpected especially forsensitivity when the prevalenceis low The advanced modelsproperly take into account theprecision by which sensitivityand specificity have beenmeasured in each study

Differences inthreshold

Explicit or implicit differences inthresholds for positivity betweenstudies will lead to differences insensitivity and specificity inopposite directions creatingnegative correlations Theadvanced models take thepossible correlations intoaccount

Bias Deficiencies in the design andconduct of diagnostic studies canlead to biased results oftenproducing more exaggeratedresults Advanced models canexamine the impact ofdeficiencies in design byincluding study-level covariates

Variation byclinicalsubgroups

Examine stratified results orsummaries at a study level

Unexplainedvariation

It is likely that remaining variationbeyond chance will be present inDTA reviews The advancedmodels use random effects toincorporate variation beyondchance

Review

1542 Clinical Chemistry 5811 (2012)

pressed in the review question and the precision andvariability in accuracy results

Reviews with a comparative question (eg is test Abetter than test B at a specific point in the diagnosticpathway) can directly examine whether sensitivity orspecificity or both are higher for one test than the otherA distinction should be made between primary studiesdirectly comparing the 2 index tests in the same patient(direct evidence) and studies examining only one ofthese index tests (indirect evidence) Direct evidence ispreferred because important factors that may have animpact on accuracy (ie potential confounding factorssuch as the population and choice of reference stan-dard) will be constant when the index tests are com-pared If sufficient studies with direct evidence areavailable the main analysis or any sensitivity analysesshould focus on these studies providing directevidence

If both sensitivity and specificity are higher or theentire summary ROC curve for one test is to the left andabove that of the other test the conclusion is straightfor-ward If sensitivity is higher for one test and specificity forthe other or if the summary ROC curves of the 2 testscross it is important to examine and weigh the potentialnegative consequences associated with false-positive orfalse-negative test results One way to provide this insightis to subject a hypothetical cohort of 1000 patients to bothtests and calculate the number of patients with differentcorrect and incorrect test results based on summary esti-

mates of sensitivity and specificity and a reasonable esti-mate of the expected prevalence

The intended role of a test is also helpful in structur-ing the interpretation of results In triage questions thenumber of missed cases (eg false-negative test results) isthe key concern so sensitivity or the negative predictivevalue are the key accuracy measures The desired mini-mum level for these measures will still be a subjectivechoice and depend on the condition at hand In our ex-ample most experts will agree that the clinical decisionrule should not miss more than 5 of the patients withpulmonary embolism so therefore sensitivity should be atleast 95 From the results of the rules alone it is clear thata large part of the confidence ellipse and even the sum-mary estimate of sensitivity do not meet this criterion (Ta-ble 4) This observation leads to a firm conclusion thatclinical decisions alone are not suited for use in the triageof patients suspected to have pulmonary embolismTherefore D-dimer results have been added to the triageof patients suspected for pulmonary embolism In thisscenario patients will not undergo further testing if boththe clinical decision rule AND the D-dimer are negativeThe proportion of patients who had negative results forboth tests but who had a final diagnosis of pulmonaryembolism (failure rate) has been metaanalyzed Adding aqualitative D-dimer to the clinical decision rule led tofailure rates that were lower than 2 (Table 4) Thisfrequency has been considered sufficiently low andtherefore such strategies have been implemented in

Table 4 Mean (95 CI) values of sensitivity and specificity for 3 different clinical decision rules for pulmonaryembolism the impact of prevalence on sensitivity and specificity and failure rate and efficiency of a strategy in

which patients with a low probability of disease and a negative D-dimer receive no further testing

Subgroup (no of studies) Sensitivity (95 CI) Specificity (95 CI)

Type of rule

Wells cutoff value of 2 (n 19) 84 (78ndash89) 58 (52ndash65)

Wells cutoff value of 4 (n 11) 60 (49ndash69) 80 (75ndash84)

Gestalt (n 15) 85 (78ndash90) 51 (39ndash63)

P value Wells 2 vs Wells 4 P 0001 P 0001

P value Wells 2 vs Gestalt P 096 P 031

P value Wells 4 vs Gestalt P 0001 P 0001

Impact prevalence within Wells 2 studies

Prevalence 5 67 (58ndash75) 72 (65ndash79)

Prevalence 15 85 (80ndash89) 58 (52ndash63)

Prevalence 30 91 (88ndash94) 47 (40ndash55)

P value for trend P 0001 P 0001

Adding D-dimer testing to rule Failure rate (95 CI) Efficiency (95 CI)

Wells 4 with quantitative D-dimer (n 4) 05 (02ndash09) 39 (30ndash48)

Wells 2 with qualitative D-dimer (n 5) 09 (05ndash17) 40 (32ndash49)

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1543

practice The efficiencies of such strategies arearound 40 meaning that in 40 of the patients nofurther testing is required

Similar to any other review there is the threat ofpublication bias in DTA reviews (18 ) Publication biasoccurs when studies containing less favorable resultsare less likely to be published Summary results basedon published findings will then generate an overopti-mistic picture of the accuracy of a test Unfortunatelylittle information exists regarding the presence andmagnitude of publication bias in diagnostic accuracystudies Unlike randomized trials there are no registriesfor protocols of diagnostic accuracy studies

Recent Developments

In this section we highlight some recent developmentsthat are relevant for diagnostic accuracy reviews of bio-chemical tests and markers

NETWORK METAANALYSIS

In many diagnostic scenarios there are several alterna-tive tests available which leads to the key questionwhich test is the best Direct comparisons of tests(head-to-head comparison in the same patients by useof a cross-over design or a parallel randomized design)offer the most valid study design but are not alwaysavailable in the literature Systematic reviews focusingon more than one diagnostic test have to incorporateindirect comparisons (accuracy of different tests as-sessed in different populations) Network metaanaly-ses have been developed in the field of intervention tocombine both direct and indirect comparisons within asingle statistical model to allow for ranking of the avail-able treatments (25 ) In addition these models provideestimates of heterogeneity and inconsistency of effectsSuch network metaanalyses would be a welcome addi-tion for ranking and selecting the best test among sev-eral alternatives

IPD METAANALYSIS

IPD metaanalyses use individual patient data ratherthan published summary results of a study In an IPDmetaanalysis there is more flexibility and more statisti-cal power to examine how patientsrsquo characteristics af-fect diagnostic test accuracy (subgroup analyses or ef-fect modification) IPD metaanalysis also offers moreflexibility in handling differences in thresholds for pos-itivity for continuous index test results and for deter-mining the optimal cutoff value (26 )

Concluding Remarks

Many improvements have been made in the method-ology of performing systematic reviews of the accuracy

of diagnostic tests and multivariable diagnostic mod-els Methods have been improved for locating diagnos-tic accuracy studies for assessing the risk for bias andsources of variation and for developing advanced andflexible models to metaanalyze 2 possible correlatedoutcomes However the biggest obstacle for generatinghigh-quality clinically useful diagnostic reviews is thepoor methodological quality of the existing body ofdiagnostic accuracy studies reported in the literatureFortunately interest in the methods for the evaluationof diagnostic tests has grown considerably in the lastdecade Higher-quality and more informative primarystudies will in return generate more informative diag-nostic reviews

Appendix 1

Accuracy of diagnostic decision rules without and withD-dimer assay for the diagnosis of pulmonary embo-lism Pulmonary embolism (PE) is an important condi-tion for physicians to consider because case fatality ishigh if left untreated However diagnosing PE in sus-pected patients is challenging because signs and symp-toms are often nonspecific Physicians constantly facethe dilemma of not wanting to miss a PE while at thesame time wanting to avoid performing too many un-necessary additional diagnostic procedures that can beexpensive burdensome and possibly harmful Diag-nostic strategies in suspected patients therefore focuson identifying patients in whom PE can be safely ruledout on the basis of findings from the patient history andphysical examination Many different diagnostic deci-sion rules for excluding PE on the basis of symptomsand signs with or without D-dimer assay have beendeveloped and validated but there remains uncertaintyas to whether these different rules differ in their accu-racy in a meaningful way In this example we focus on 3rules

bull Wells rule using a cutoff value of 2 for defining apositive (abnormal) test result

bull Wells rule using a cutoff value of 4bull Gestalt rule

In the Wells rules points are scored when certainsigns and symptoms (eg heart rate 100 previousdeep venous thrombosis) are present resulting in a to-tal score In the Gestalt rule physicians provide anoverall empirical assessment of the likelihood of pul-monary embolism being present after examination of apatient To safely exclude pulmonary embolism aD-dimer test can be added to the clinical rule to refrainfrom further testing if both tests (rule D-dimer assay)are negative Further details and more rules can befound in the original review (11 )

Review

1544 Clinical Chemistry 5811 (2012)

REVIEW AIMS

To determine and compare the diagnostic accuracy of 3different clinical decision rules Wells-2 (n 19 stud-ies) Wells-4 (n 11 studies) and Gestalt rule (n 15studies)

To examine whether a negative test from a rule incombination with a negative D-dimer test result is a safeand efficient strategy for excluding PE without referralfor further burdening and invasive imaging

Author Contributions All authors confirmed they have contributed tothe intellectual content of this paper and have met the following 3 re-quirements (a) significant contributions to the conception and design

acquisition of data or analysis and interpretation of data (b) draftingor revising the article for intellectual content and (c) final approval ofthe published article

Authorsrsquo Disclosures or Potential Conflicts of Interest Upon man-uscript submission all authors completed the author disclosure formDisclosures andor potential conflicts of interest

Employment or Leadership None declaredConsultant or Advisory Role None declaredStock Ownership None declaredHonoraria None declaredResearch Funding The Netherlands Organisation for Health Re-search and Development (ZonMW) KGM Moons the Nether-lands Organisation for Scientific Research (projects 91208004 and91810615)Expert Testimony None declared

References

1 Lijmer JG Leeflang M Bossuyt PM Proposals fora phased evaluation of medical tests Med DecisMaking 200929E13ndash21

2 Linnet K Bossuyt PM Moons KG Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

3 Moons KG de Groot JA Linnet K Reitsma JBBossuyt PM Quantifying the added value of adiagnostic test or marker Clin Chem 2012581408ndash17

4 Bossuyt PM Reitsma JB Linnet K Moons KGBeyond diagnostic accuracy the clinical utility ofdiagnostic tests Clin Chem [Epub ahead of print2012 Jun 22]

5 Reitsma JB Glas AS Rutjes AW Scholten RJBossuyt PM Zwinderman AH Bivariate analysisof sensitivity and specificity produces informativesummary measures in diagnostic reviews J ClinEpidemiol 200558982ndash90

6 Whiting P Rutjes AW Reitsma JB Glas ASBossuyt PM Kleijnen J Sources of variation andbias in studies of diagnostic accuracy a system-atic review Ann Intern Med 2004140189ndash202

7 Leeflang MM Deeks JJ Gatsonis C Bossuyt PMon behalf of the Cochrane Diagnostic Test Accu-racy Working Group Systematic reviews of diag-nostic test accuracy Ann Intern Med 2008149889ndash97

8 Whiting PF Rutjes AW Westwood ME Mallett SDeeks JJ Reitsma JB et al QUADAS-2 a revisedtool for the quality assessment of diagnostic ac-curacy studies Ann Intern Med 2011155529ndash36

9 Diagnostic test accuracy working group httpsrdtacochraneorg (Accessed August 2012)

10 Leeflang MM Debets-Ossenkopp YJ Visser CEScholten RJ Hooft L Bijlmer HA et al Galacto-mannan detection for invasive aspergillosis in

immunocompromized patients Cochrane Data-base Syst Rev 2008CD007394

11 Lucassen W Geersing GJ Erkens PM Reitsma JBMoons KG Buller H van Weert HC Clinical de-cision rules for excluding pulmonary embolism ameta-analysis Ann Intern Med 2011155448ndash60

12 Bossuyt PM Irwig L Craig J Glasziou P Com-parative accuracy assessing new tests againstexisting diagnostic pathways BMJ 20063321089ndash92

13 Doust JA Pietrzak E Sanders S Glasziou PPIdentifying studies for systematic reviews of di-agnostic tests was difficult due to the poor sen-sitivity and precision of methodologic filters andthe lack of information in the abstract J ClinEpidemiol 200558444ndash9

14 Leeflang MM Scholten RJ Rutjes AW ReitsmaJB Bossuyt PM Use of methodological searchfilters to identify diagnostic accuracy studies canlead to the omission of relevant studies J ClinEpidemiol 200659234ndash40

15 Savoie I Helmer D Green CJ Kazanjian A Be-yond Medline reducing bias through extendedsystematic review search Int J Technol AssessHealth Care 200319168ndash78

16 Fraser C Mowatt G Siddiqui R Burr J Searchingfor diagnostic test accuracy studies an applica-tion to screening for open angle glaucoma (OAG)[Abstract] Cochrane Colloquium AbstractsJournal 2006 httpwwwimbiuni-freiburgdeOJSccaindexphpjournalccaamppagearticleampopviewamppath[]1980 (Accessed October2012)

17 Whiting P Westwood M Burke M Sterne JGlanville J Systematic reviews of test accuracyshould search a range of databases to identifyprimary studies J Clin Epidemiol 200861

357ndash 6418 Song F Eastwood AJ Gilbody S Duley L Sutton

AJ Publication and related biases Health TechnolAssess 200041ndash115

19 Whiting P Rutjes AW Reitsma JB Bossuyt PMKleijnen J The development of QUADAS a toolfor the quality assessment of studies of diagnos-tic accuracy included in systematic reviews BMCMed Res Methodol 2003325

20 Linnet K Bossuyt PMM Moons KGM Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

21 Moses LE Shapiro D Littenberg B Combiningindependent studies of a diagnostic test into asummary ROC curve data-analytic approachesand some additional considerations Stat Med1993121293ndash316

22 Rutter CM Gatsonis CA A hierarchical regressionapproach to meta-analysis of diagnostic test ac-curacy evaluations Stat Med 2001202865ndash84

23 Harbord RM Deeks JJ Egger M Whiting PSterne JA A unification of models for meta-analysis of diagnostic accuracy studies Biostatis-tics 20078239ndash51

24 Leeflang MM Bossuyt PM Irwig L Diagnostictest accuracy may vary with prevalence implica-tions for evidence-based diagnosis J Clin Epide-miol 2009625ndash12

25 Li T Puhan MA Vedula SS Singh S Dickersin Kthe Ad Hoc Network Meta-analysis MethodsMeeting Working Group Network meta-analysis-highly attractive but more methodological re-search is needed BMC Med 2011 27979

26 Khan KS Bachmann LM ter Riet G Systematicreviews with individual patient data meta-analysis to evaluate diagnostic tests Eur J ObstetGynecol Reprod Biol 2003108121ndash5

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1545

Fig 2 Paired forest plot of sensitivity (A) and specificity (B) and the corresponding 95 CIs from studies examiningthe diagnostic accuracy of the Wells rule with a cutoff value of 2 Wells rule with cutoff value of 4 and Gestalt forthe diagnosis of pulmonary embolism

Studies within a rule are sorted by prevalence Adapted from Lucassen et al (11 )Continued on page 1540

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1539

specificity and are therefore located more toward thetop-left corner of the ROC-space If there is a tradeoff(eg negative correlation) between sensitivity andspecificity a shoulderlike pattern in the ROC space willemerge This pattern will be comparable to the pattern

that arises in a single study of a test that produces acontinuous result in which the threshold has been var-ied Lowering the threshold will then increase the like-lihood of a positive test result in patients with the targetdisease thereby increasing sensitivity while at the same

Fig 2 Continued

Review

1540 Clinical Chemistry 5811 (2012)

it increases the risk of a false-positive result in patientswithout the target disease thereby lowering specificityThis trade-off or negative correlation will generate thisshoulder-like pattern in the ROC space

The ROC plot of our 3 clinical decision rules forpulmonary embolism clearly indicates the presence ofnegative correlations both within rules as well as acrossdifferent rules (Fig 3)

Metaanalysis of Diagnostic Accuracy Data

Metaanalyses of studies reporting sensitivities andspecificities have often used the MosesndashLittenberg lin-ear regression approach (21 ) to obtain a summaryROC curve It has become clear that this approach hasstatistical shortcomings (5 22 ) and therefore it is nolonger recommended for evaluating differences be-tween summary ROC curves between tests or examin-ing the impact of covariates on accuracy

To overcome the shortcomings of the MosesndashLittenberg approach 2 more rigorous statistical ap-proaches have since been developed These are the hi-erarchical summary ROC approach and the bivariaterandom effects model (5 22 ) Both models are hierar-chical random effects models that take into account thebetween-study variation in sensitivities and specifici-ties (eg random effects models) and their possiblecorrelations as well as the precision of these estimates

within a study (eg weighting of studies) Although thestarting points of these 2 models are different the 2models are mathematically equivalent (23 ) Both mod-els can produce summary estimates of sensitivity andspecificity and produce a statistically sound summaryROC line or provide 95 confidence ellipses aroundthe mean values of sensitivity and specificity (Fig 4)

EXAMINATION OF SOURCES OF VARIATION AND DIFFERENCES

IN ACCURACY BETWEEN TESTS

Results from individual studies often vary within a re-view There are several possible causes for variationwhich can be categorized according to the groupsshown in Table 3

Both advanced models are regression models thatallow flexibility in examining sources of heterogeneityby including study-level covariates This feature pro-vides the option of formally comparing the results ofstudies with a specific feature (eg partial verification)with the results of studies that have avoided partial ver-ification In the same way we can examine whether theaccuracy results from studies examining test A are dif-ferent from studies examining test B Limiting thecomparison of tests to studies with a cross-over designin which both index tests have been applied in the samepatient may be a preferred approach These so-called

Fig 3 Pairs of sensitivity and specificity values fromstudies examining 3 different rules for the diagnosisof pulmonary embolism (A Wells rule with a cutoffvalue 2 B Wells rule with a cutoff value of 4 CGestalt) Adapted from Lucassen et al (11 )

Fig 4 ROC plot showing the summary estimate ofsensitivity and specificity and the corresponding 95confidence ellipse for 3 different clinical decisionrules for the diagnosis of pulmonary embolism (A

Wells rule with cutoff value 2 B Wells rule withcutoff value of 4 C Gestalt) Adapted from Lucas-sen et al (11 )

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1541

paired comparisons provide more valid evidence thanresults generated from unpaired studies (separate stud-ies) that may reflect other underlying differences in de-sign and conduct between studies (ie confoundingfactors)

Diagnostic accuracy can vary between clinical sub-groups Examining such differences in a systematic re-view is problematic if primary studies do not reportstratified results for these subgroups In the absenceof stratified results researchers have to use study-levelsummaries of the covariate representing the clinicalsubgroup Such summaries have limited power for de-tecting differences in accuracy between clinical sub-groups As an example if reviewers are interested inwhether accuracy of a test varies between men andwomen they could use the percentage of males in eachstudy as a study-level covariate in their model A study-

level covariate reduces the power to find differencesbetween males and females which would clearly be thecase if all included studies had similar percentages ofmales Even if a clear difference in accuracy existed be-tween males and females in each study it would remainundetected in a regression model based on the percent-age of males Individual patient data (IPD) metaanaly-sis provides more power and flexibility to examinevariation in accuracy between clinical subgroups

Just as for any regression model used for examiningcovariates clear boundaries exist that define what can bedone or what is sensible given the sample size of the studyInsufficient statistical power or an increased risk of find-ing false-positive associations when many covariates areexamined are concerns when diagnostic reviews are con-ducted The number of different studies within a review isthe key limitation for examining covariates

The results from the bivariate model comparingthe three different rules are summarized in Fig 4 andTable 4

These results show that (as expected) mean sensi-tivity is significantly lower for the Wells studies using acutoff value of 2 compared with studies using a cutoffvalue of 4 but at the same time specificity is signifi-cantly lower Such differences are expected when low-ering the threshold for positivity The results from theGestalt studies are comparable with the Wells studiesusing a cutoff value of 2 although there appears to bemore heterogeneity in the reported specificities ofGestalt studies (Fig 2)

In the example review on pulmonary embolismthe authors examined whether the prevalence in astudy had an impact on the levels of sensitivity andspecificity in a study by including it as a covariate in thebivariate metaregression model There are several rea-sons why prevalence might be associated with sensitiv-ity and specificity An overview of these potential rea-sons is given in (24 )

In this case the authors hypothesized that differ-ences in prevalence could be seen as a proxy for differ-ences in case mix between studies In studies with lowerprevalence more patients may be in an early stage ofthe disease which would hamper detection and lead tomore false-negative results and hence lower sensitiv-ity In this review increased prevalence was associatedwith higher sensitivity and lower specificity (Table 4)

V INTERPRETING RESULTS AND DRAWING CONCLUSIONS

This is the part of the review process in which all theresults of the different steps within a systematic reviewhave to be combined to answer the review question(s)at hand Key ingredients include the methodologicalquality of the evidence whether the included studiesexamined the same intended role of the test as ex-

Table 3 Causes for variation in sensitivity andspecificity results between primary studies within a

review

Chance variation The majority of diagnostic accuracystudies are moderate to small insample size Considerablevariation by chance can then beexpected especially forsensitivity when the prevalenceis low The advanced modelsproperly take into account theprecision by which sensitivityand specificity have beenmeasured in each study

Differences inthreshold

Explicit or implicit differences inthresholds for positivity betweenstudies will lead to differences insensitivity and specificity inopposite directions creatingnegative correlations Theadvanced models take thepossible correlations intoaccount

Bias Deficiencies in the design andconduct of diagnostic studies canlead to biased results oftenproducing more exaggeratedresults Advanced models canexamine the impact ofdeficiencies in design byincluding study-level covariates

Variation byclinicalsubgroups

Examine stratified results orsummaries at a study level

Unexplainedvariation

It is likely that remaining variationbeyond chance will be present inDTA reviews The advancedmodels use random effects toincorporate variation beyondchance

Review

1542 Clinical Chemistry 5811 (2012)

pressed in the review question and the precision andvariability in accuracy results

Reviews with a comparative question (eg is test Abetter than test B at a specific point in the diagnosticpathway) can directly examine whether sensitivity orspecificity or both are higher for one test than the otherA distinction should be made between primary studiesdirectly comparing the 2 index tests in the same patient(direct evidence) and studies examining only one ofthese index tests (indirect evidence) Direct evidence ispreferred because important factors that may have animpact on accuracy (ie potential confounding factorssuch as the population and choice of reference stan-dard) will be constant when the index tests are com-pared If sufficient studies with direct evidence areavailable the main analysis or any sensitivity analysesshould focus on these studies providing directevidence

If both sensitivity and specificity are higher or theentire summary ROC curve for one test is to the left andabove that of the other test the conclusion is straightfor-ward If sensitivity is higher for one test and specificity forthe other or if the summary ROC curves of the 2 testscross it is important to examine and weigh the potentialnegative consequences associated with false-positive orfalse-negative test results One way to provide this insightis to subject a hypothetical cohort of 1000 patients to bothtests and calculate the number of patients with differentcorrect and incorrect test results based on summary esti-

mates of sensitivity and specificity and a reasonable esti-mate of the expected prevalence

The intended role of a test is also helpful in structur-ing the interpretation of results In triage questions thenumber of missed cases (eg false-negative test results) isthe key concern so sensitivity or the negative predictivevalue are the key accuracy measures The desired mini-mum level for these measures will still be a subjectivechoice and depend on the condition at hand In our ex-ample most experts will agree that the clinical decisionrule should not miss more than 5 of the patients withpulmonary embolism so therefore sensitivity should be atleast 95 From the results of the rules alone it is clear thata large part of the confidence ellipse and even the sum-mary estimate of sensitivity do not meet this criterion (Ta-ble 4) This observation leads to a firm conclusion thatclinical decisions alone are not suited for use in the triageof patients suspected to have pulmonary embolismTherefore D-dimer results have been added to the triageof patients suspected for pulmonary embolism In thisscenario patients will not undergo further testing if boththe clinical decision rule AND the D-dimer are negativeThe proportion of patients who had negative results forboth tests but who had a final diagnosis of pulmonaryembolism (failure rate) has been metaanalyzed Adding aqualitative D-dimer to the clinical decision rule led tofailure rates that were lower than 2 (Table 4) Thisfrequency has been considered sufficiently low andtherefore such strategies have been implemented in

Table 4 Mean (95 CI) values of sensitivity and specificity for 3 different clinical decision rules for pulmonaryembolism the impact of prevalence on sensitivity and specificity and failure rate and efficiency of a strategy in

which patients with a low probability of disease and a negative D-dimer receive no further testing

Subgroup (no of studies) Sensitivity (95 CI) Specificity (95 CI)

Type of rule

Wells cutoff value of 2 (n 19) 84 (78ndash89) 58 (52ndash65)

Wells cutoff value of 4 (n 11) 60 (49ndash69) 80 (75ndash84)

Gestalt (n 15) 85 (78ndash90) 51 (39ndash63)

P value Wells 2 vs Wells 4 P 0001 P 0001

P value Wells 2 vs Gestalt P 096 P 031

P value Wells 4 vs Gestalt P 0001 P 0001

Impact prevalence within Wells 2 studies

Prevalence 5 67 (58ndash75) 72 (65ndash79)

Prevalence 15 85 (80ndash89) 58 (52ndash63)

Prevalence 30 91 (88ndash94) 47 (40ndash55)

P value for trend P 0001 P 0001

Adding D-dimer testing to rule Failure rate (95 CI) Efficiency (95 CI)

Wells 4 with quantitative D-dimer (n 4) 05 (02ndash09) 39 (30ndash48)

Wells 2 with qualitative D-dimer (n 5) 09 (05ndash17) 40 (32ndash49)

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1543

practice The efficiencies of such strategies arearound 40 meaning that in 40 of the patients nofurther testing is required

Similar to any other review there is the threat ofpublication bias in DTA reviews (18 ) Publication biasoccurs when studies containing less favorable resultsare less likely to be published Summary results basedon published findings will then generate an overopti-mistic picture of the accuracy of a test Unfortunatelylittle information exists regarding the presence andmagnitude of publication bias in diagnostic accuracystudies Unlike randomized trials there are no registriesfor protocols of diagnostic accuracy studies

Recent Developments

In this section we highlight some recent developmentsthat are relevant for diagnostic accuracy reviews of bio-chemical tests and markers

NETWORK METAANALYSIS

In many diagnostic scenarios there are several alterna-tive tests available which leads to the key questionwhich test is the best Direct comparisons of tests(head-to-head comparison in the same patients by useof a cross-over design or a parallel randomized design)offer the most valid study design but are not alwaysavailable in the literature Systematic reviews focusingon more than one diagnostic test have to incorporateindirect comparisons (accuracy of different tests as-sessed in different populations) Network metaanaly-ses have been developed in the field of intervention tocombine both direct and indirect comparisons within asingle statistical model to allow for ranking of the avail-able treatments (25 ) In addition these models provideestimates of heterogeneity and inconsistency of effectsSuch network metaanalyses would be a welcome addi-tion for ranking and selecting the best test among sev-eral alternatives

IPD METAANALYSIS

IPD metaanalyses use individual patient data ratherthan published summary results of a study In an IPDmetaanalysis there is more flexibility and more statisti-cal power to examine how patientsrsquo characteristics af-fect diagnostic test accuracy (subgroup analyses or ef-fect modification) IPD metaanalysis also offers moreflexibility in handling differences in thresholds for pos-itivity for continuous index test results and for deter-mining the optimal cutoff value (26 )

Concluding Remarks

Many improvements have been made in the method-ology of performing systematic reviews of the accuracy

of diagnostic tests and multivariable diagnostic mod-els Methods have been improved for locating diagnos-tic accuracy studies for assessing the risk for bias andsources of variation and for developing advanced andflexible models to metaanalyze 2 possible correlatedoutcomes However the biggest obstacle for generatinghigh-quality clinically useful diagnostic reviews is thepoor methodological quality of the existing body ofdiagnostic accuracy studies reported in the literatureFortunately interest in the methods for the evaluationof diagnostic tests has grown considerably in the lastdecade Higher-quality and more informative primarystudies will in return generate more informative diag-nostic reviews

Appendix 1

Accuracy of diagnostic decision rules without and withD-dimer assay for the diagnosis of pulmonary embo-lism Pulmonary embolism (PE) is an important condi-tion for physicians to consider because case fatality ishigh if left untreated However diagnosing PE in sus-pected patients is challenging because signs and symp-toms are often nonspecific Physicians constantly facethe dilemma of not wanting to miss a PE while at thesame time wanting to avoid performing too many un-necessary additional diagnostic procedures that can beexpensive burdensome and possibly harmful Diag-nostic strategies in suspected patients therefore focuson identifying patients in whom PE can be safely ruledout on the basis of findings from the patient history andphysical examination Many different diagnostic deci-sion rules for excluding PE on the basis of symptomsand signs with or without D-dimer assay have beendeveloped and validated but there remains uncertaintyas to whether these different rules differ in their accu-racy in a meaningful way In this example we focus on 3rules

bull Wells rule using a cutoff value of 2 for defining apositive (abnormal) test result

bull Wells rule using a cutoff value of 4bull Gestalt rule

In the Wells rules points are scored when certainsigns and symptoms (eg heart rate 100 previousdeep venous thrombosis) are present resulting in a to-tal score In the Gestalt rule physicians provide anoverall empirical assessment of the likelihood of pul-monary embolism being present after examination of apatient To safely exclude pulmonary embolism aD-dimer test can be added to the clinical rule to refrainfrom further testing if both tests (rule D-dimer assay)are negative Further details and more rules can befound in the original review (11 )

Review

1544 Clinical Chemistry 5811 (2012)

REVIEW AIMS

To determine and compare the diagnostic accuracy of 3different clinical decision rules Wells-2 (n 19 stud-ies) Wells-4 (n 11 studies) and Gestalt rule (n 15studies)

To examine whether a negative test from a rule incombination with a negative D-dimer test result is a safeand efficient strategy for excluding PE without referralfor further burdening and invasive imaging

Author Contributions All authors confirmed they have contributed tothe intellectual content of this paper and have met the following 3 re-quirements (a) significant contributions to the conception and design

acquisition of data or analysis and interpretation of data (b) draftingor revising the article for intellectual content and (c) final approval ofthe published article

Authorsrsquo Disclosures or Potential Conflicts of Interest Upon man-uscript submission all authors completed the author disclosure formDisclosures andor potential conflicts of interest

Employment or Leadership None declaredConsultant or Advisory Role None declaredStock Ownership None declaredHonoraria None declaredResearch Funding The Netherlands Organisation for Health Re-search and Development (ZonMW) KGM Moons the Nether-lands Organisation for Scientific Research (projects 91208004 and91810615)Expert Testimony None declared

References

1 Lijmer JG Leeflang M Bossuyt PM Proposals fora phased evaluation of medical tests Med DecisMaking 200929E13ndash21

2 Linnet K Bossuyt PM Moons KG Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

3 Moons KG de Groot JA Linnet K Reitsma JBBossuyt PM Quantifying the added value of adiagnostic test or marker Clin Chem 2012581408ndash17

4 Bossuyt PM Reitsma JB Linnet K Moons KGBeyond diagnostic accuracy the clinical utility ofdiagnostic tests Clin Chem [Epub ahead of print2012 Jun 22]

5 Reitsma JB Glas AS Rutjes AW Scholten RJBossuyt PM Zwinderman AH Bivariate analysisof sensitivity and specificity produces informativesummary measures in diagnostic reviews J ClinEpidemiol 200558982ndash90

6 Whiting P Rutjes AW Reitsma JB Glas ASBossuyt PM Kleijnen J Sources of variation andbias in studies of diagnostic accuracy a system-atic review Ann Intern Med 2004140189ndash202

7 Leeflang MM Deeks JJ Gatsonis C Bossuyt PMon behalf of the Cochrane Diagnostic Test Accu-racy Working Group Systematic reviews of diag-nostic test accuracy Ann Intern Med 2008149889ndash97

8 Whiting PF Rutjes AW Westwood ME Mallett SDeeks JJ Reitsma JB et al QUADAS-2 a revisedtool for the quality assessment of diagnostic ac-curacy studies Ann Intern Med 2011155529ndash36

9 Diagnostic test accuracy working group httpsrdtacochraneorg (Accessed August 2012)

10 Leeflang MM Debets-Ossenkopp YJ Visser CEScholten RJ Hooft L Bijlmer HA et al Galacto-mannan detection for invasive aspergillosis in

immunocompromized patients Cochrane Data-base Syst Rev 2008CD007394

11 Lucassen W Geersing GJ Erkens PM Reitsma JBMoons KG Buller H van Weert HC Clinical de-cision rules for excluding pulmonary embolism ameta-analysis Ann Intern Med 2011155448ndash60

12 Bossuyt PM Irwig L Craig J Glasziou P Com-parative accuracy assessing new tests againstexisting diagnostic pathways BMJ 20063321089ndash92

13 Doust JA Pietrzak E Sanders S Glasziou PPIdentifying studies for systematic reviews of di-agnostic tests was difficult due to the poor sen-sitivity and precision of methodologic filters andthe lack of information in the abstract J ClinEpidemiol 200558444ndash9

14 Leeflang MM Scholten RJ Rutjes AW ReitsmaJB Bossuyt PM Use of methodological searchfilters to identify diagnostic accuracy studies canlead to the omission of relevant studies J ClinEpidemiol 200659234ndash40

15 Savoie I Helmer D Green CJ Kazanjian A Be-yond Medline reducing bias through extendedsystematic review search Int J Technol AssessHealth Care 200319168ndash78

16 Fraser C Mowatt G Siddiqui R Burr J Searchingfor diagnostic test accuracy studies an applica-tion to screening for open angle glaucoma (OAG)[Abstract] Cochrane Colloquium AbstractsJournal 2006 httpwwwimbiuni-freiburgdeOJSccaindexphpjournalccaamppagearticleampopviewamppath[]1980 (Accessed October2012)

17 Whiting P Westwood M Burke M Sterne JGlanville J Systematic reviews of test accuracyshould search a range of databases to identifyprimary studies J Clin Epidemiol 200861

357ndash 6418 Song F Eastwood AJ Gilbody S Duley L Sutton

AJ Publication and related biases Health TechnolAssess 200041ndash115

19 Whiting P Rutjes AW Reitsma JB Bossuyt PMKleijnen J The development of QUADAS a toolfor the quality assessment of studies of diagnos-tic accuracy included in systematic reviews BMCMed Res Methodol 2003325

20 Linnet K Bossuyt PMM Moons KGM Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

21 Moses LE Shapiro D Littenberg B Combiningindependent studies of a diagnostic test into asummary ROC curve data-analytic approachesand some additional considerations Stat Med1993121293ndash316

22 Rutter CM Gatsonis CA A hierarchical regressionapproach to meta-analysis of diagnostic test ac-curacy evaluations Stat Med 2001202865ndash84

23 Harbord RM Deeks JJ Egger M Whiting PSterne JA A unification of models for meta-analysis of diagnostic accuracy studies Biostatis-tics 20078239ndash51

24 Leeflang MM Bossuyt PM Irwig L Diagnostictest accuracy may vary with prevalence implica-tions for evidence-based diagnosis J Clin Epide-miol 2009625ndash12

25 Li T Puhan MA Vedula SS Singh S Dickersin Kthe Ad Hoc Network Meta-analysis MethodsMeeting Working Group Network meta-analysis-highly attractive but more methodological re-search is needed BMC Med 2011 27979

26 Khan KS Bachmann LM ter Riet G Systematicreviews with individual patient data meta-analysis to evaluate diagnostic tests Eur J ObstetGynecol Reprod Biol 2003108121ndash5

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1545

specificity and are therefore located more toward thetop-left corner of the ROC-space If there is a tradeoff(eg negative correlation) between sensitivity andspecificity a shoulderlike pattern in the ROC space willemerge This pattern will be comparable to the pattern

that arises in a single study of a test that produces acontinuous result in which the threshold has been var-ied Lowering the threshold will then increase the like-lihood of a positive test result in patients with the targetdisease thereby increasing sensitivity while at the same

Fig 2 Continued

Review

1540 Clinical Chemistry 5811 (2012)

it increases the risk of a false-positive result in patientswithout the target disease thereby lowering specificityThis trade-off or negative correlation will generate thisshoulder-like pattern in the ROC space

The ROC plot of our 3 clinical decision rules forpulmonary embolism clearly indicates the presence ofnegative correlations both within rules as well as acrossdifferent rules (Fig 3)

Metaanalysis of Diagnostic Accuracy Data

Metaanalyses of studies reporting sensitivities andspecificities have often used the MosesndashLittenberg lin-ear regression approach (21 ) to obtain a summaryROC curve It has become clear that this approach hasstatistical shortcomings (5 22 ) and therefore it is nolonger recommended for evaluating differences be-tween summary ROC curves between tests or examin-ing the impact of covariates on accuracy

To overcome the shortcomings of the MosesndashLittenberg approach 2 more rigorous statistical ap-proaches have since been developed These are the hi-erarchical summary ROC approach and the bivariaterandom effects model (5 22 ) Both models are hierar-chical random effects models that take into account thebetween-study variation in sensitivities and specifici-ties (eg random effects models) and their possiblecorrelations as well as the precision of these estimates

within a study (eg weighting of studies) Although thestarting points of these 2 models are different the 2models are mathematically equivalent (23 ) Both mod-els can produce summary estimates of sensitivity andspecificity and produce a statistically sound summaryROC line or provide 95 confidence ellipses aroundthe mean values of sensitivity and specificity (Fig 4)

EXAMINATION OF SOURCES OF VARIATION AND DIFFERENCES

IN ACCURACY BETWEEN TESTS

Results from individual studies often vary within a re-view There are several possible causes for variationwhich can be categorized according to the groupsshown in Table 3

Both advanced models are regression models thatallow flexibility in examining sources of heterogeneityby including study-level covariates This feature pro-vides the option of formally comparing the results ofstudies with a specific feature (eg partial verification)with the results of studies that have avoided partial ver-ification In the same way we can examine whether theaccuracy results from studies examining test A are dif-ferent from studies examining test B Limiting thecomparison of tests to studies with a cross-over designin which both index tests have been applied in the samepatient may be a preferred approach These so-called

Fig 3 Pairs of sensitivity and specificity values fromstudies examining 3 different rules for the diagnosisof pulmonary embolism (A Wells rule with a cutoffvalue 2 B Wells rule with a cutoff value of 4 CGestalt) Adapted from Lucassen et al (11 )

Fig 4 ROC plot showing the summary estimate ofsensitivity and specificity and the corresponding 95confidence ellipse for 3 different clinical decisionrules for the diagnosis of pulmonary embolism (A

Wells rule with cutoff value 2 B Wells rule withcutoff value of 4 C Gestalt) Adapted from Lucas-sen et al (11 )

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1541

paired comparisons provide more valid evidence thanresults generated from unpaired studies (separate stud-ies) that may reflect other underlying differences in de-sign and conduct between studies (ie confoundingfactors)

Diagnostic accuracy can vary between clinical sub-groups Examining such differences in a systematic re-view is problematic if primary studies do not reportstratified results for these subgroups In the absenceof stratified results researchers have to use study-levelsummaries of the covariate representing the clinicalsubgroup Such summaries have limited power for de-tecting differences in accuracy between clinical sub-groups As an example if reviewers are interested inwhether accuracy of a test varies between men andwomen they could use the percentage of males in eachstudy as a study-level covariate in their model A study-

level covariate reduces the power to find differencesbetween males and females which would clearly be thecase if all included studies had similar percentages ofmales Even if a clear difference in accuracy existed be-tween males and females in each study it would remainundetected in a regression model based on the percent-age of males Individual patient data (IPD) metaanaly-sis provides more power and flexibility to examinevariation in accuracy between clinical subgroups

Just as for any regression model used for examiningcovariates clear boundaries exist that define what can bedone or what is sensible given the sample size of the studyInsufficient statistical power or an increased risk of find-ing false-positive associations when many covariates areexamined are concerns when diagnostic reviews are con-ducted The number of different studies within a review isthe key limitation for examining covariates

The results from the bivariate model comparingthe three different rules are summarized in Fig 4 andTable 4

These results show that (as expected) mean sensi-tivity is significantly lower for the Wells studies using acutoff value of 2 compared with studies using a cutoffvalue of 4 but at the same time specificity is signifi-cantly lower Such differences are expected when low-ering the threshold for positivity The results from theGestalt studies are comparable with the Wells studiesusing a cutoff value of 2 although there appears to bemore heterogeneity in the reported specificities ofGestalt studies (Fig 2)

In the example review on pulmonary embolismthe authors examined whether the prevalence in astudy had an impact on the levels of sensitivity andspecificity in a study by including it as a covariate in thebivariate metaregression model There are several rea-sons why prevalence might be associated with sensitiv-ity and specificity An overview of these potential rea-sons is given in (24 )

In this case the authors hypothesized that differ-ences in prevalence could be seen as a proxy for differ-ences in case mix between studies In studies with lowerprevalence more patients may be in an early stage ofthe disease which would hamper detection and lead tomore false-negative results and hence lower sensitiv-ity In this review increased prevalence was associatedwith higher sensitivity and lower specificity (Table 4)

V INTERPRETING RESULTS AND DRAWING CONCLUSIONS

This is the part of the review process in which all theresults of the different steps within a systematic reviewhave to be combined to answer the review question(s)at hand Key ingredients include the methodologicalquality of the evidence whether the included studiesexamined the same intended role of the test as ex-

Table 3 Causes for variation in sensitivity andspecificity results between primary studies within a

review

Chance variation The majority of diagnostic accuracystudies are moderate to small insample size Considerablevariation by chance can then beexpected especially forsensitivity when the prevalenceis low The advanced modelsproperly take into account theprecision by which sensitivityand specificity have beenmeasured in each study

Differences inthreshold

Explicit or implicit differences inthresholds for positivity betweenstudies will lead to differences insensitivity and specificity inopposite directions creatingnegative correlations Theadvanced models take thepossible correlations intoaccount

Bias Deficiencies in the design andconduct of diagnostic studies canlead to biased results oftenproducing more exaggeratedresults Advanced models canexamine the impact ofdeficiencies in design byincluding study-level covariates

Variation byclinicalsubgroups

Examine stratified results orsummaries at a study level

Unexplainedvariation

It is likely that remaining variationbeyond chance will be present inDTA reviews The advancedmodels use random effects toincorporate variation beyondchance

Review

1542 Clinical Chemistry 5811 (2012)

pressed in the review question and the precision andvariability in accuracy results

Reviews with a comparative question (eg is test Abetter than test B at a specific point in the diagnosticpathway) can directly examine whether sensitivity orspecificity or both are higher for one test than the otherA distinction should be made between primary studiesdirectly comparing the 2 index tests in the same patient(direct evidence) and studies examining only one ofthese index tests (indirect evidence) Direct evidence ispreferred because important factors that may have animpact on accuracy (ie potential confounding factorssuch as the population and choice of reference stan-dard) will be constant when the index tests are com-pared If sufficient studies with direct evidence areavailable the main analysis or any sensitivity analysesshould focus on these studies providing directevidence

If both sensitivity and specificity are higher or theentire summary ROC curve for one test is to the left andabove that of the other test the conclusion is straightfor-ward If sensitivity is higher for one test and specificity forthe other or if the summary ROC curves of the 2 testscross it is important to examine and weigh the potentialnegative consequences associated with false-positive orfalse-negative test results One way to provide this insightis to subject a hypothetical cohort of 1000 patients to bothtests and calculate the number of patients with differentcorrect and incorrect test results based on summary esti-

mates of sensitivity and specificity and a reasonable esti-mate of the expected prevalence

The intended role of a test is also helpful in structur-ing the interpretation of results In triage questions thenumber of missed cases (eg false-negative test results) isthe key concern so sensitivity or the negative predictivevalue are the key accuracy measures The desired mini-mum level for these measures will still be a subjectivechoice and depend on the condition at hand In our ex-ample most experts will agree that the clinical decisionrule should not miss more than 5 of the patients withpulmonary embolism so therefore sensitivity should be atleast 95 From the results of the rules alone it is clear thata large part of the confidence ellipse and even the sum-mary estimate of sensitivity do not meet this criterion (Ta-ble 4) This observation leads to a firm conclusion thatclinical decisions alone are not suited for use in the triageof patients suspected to have pulmonary embolismTherefore D-dimer results have been added to the triageof patients suspected for pulmonary embolism In thisscenario patients will not undergo further testing if boththe clinical decision rule AND the D-dimer are negativeThe proportion of patients who had negative results forboth tests but who had a final diagnosis of pulmonaryembolism (failure rate) has been metaanalyzed Adding aqualitative D-dimer to the clinical decision rule led tofailure rates that were lower than 2 (Table 4) Thisfrequency has been considered sufficiently low andtherefore such strategies have been implemented in

Table 4 Mean (95 CI) values of sensitivity and specificity for 3 different clinical decision rules for pulmonaryembolism the impact of prevalence on sensitivity and specificity and failure rate and efficiency of a strategy in

which patients with a low probability of disease and a negative D-dimer receive no further testing

Subgroup (no of studies) Sensitivity (95 CI) Specificity (95 CI)

Type of rule

Wells cutoff value of 2 (n 19) 84 (78ndash89) 58 (52ndash65)

Wells cutoff value of 4 (n 11) 60 (49ndash69) 80 (75ndash84)

Gestalt (n 15) 85 (78ndash90) 51 (39ndash63)

P value Wells 2 vs Wells 4 P 0001 P 0001

P value Wells 2 vs Gestalt P 096 P 031

P value Wells 4 vs Gestalt P 0001 P 0001

Impact prevalence within Wells 2 studies

Prevalence 5 67 (58ndash75) 72 (65ndash79)

Prevalence 15 85 (80ndash89) 58 (52ndash63)

Prevalence 30 91 (88ndash94) 47 (40ndash55)

P value for trend P 0001 P 0001

Adding D-dimer testing to rule Failure rate (95 CI) Efficiency (95 CI)

Wells 4 with quantitative D-dimer (n 4) 05 (02ndash09) 39 (30ndash48)

Wells 2 with qualitative D-dimer (n 5) 09 (05ndash17) 40 (32ndash49)

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1543

practice The efficiencies of such strategies arearound 40 meaning that in 40 of the patients nofurther testing is required

Similar to any other review there is the threat ofpublication bias in DTA reviews (18 ) Publication biasoccurs when studies containing less favorable resultsare less likely to be published Summary results basedon published findings will then generate an overopti-mistic picture of the accuracy of a test Unfortunatelylittle information exists regarding the presence andmagnitude of publication bias in diagnostic accuracystudies Unlike randomized trials there are no registriesfor protocols of diagnostic accuracy studies

Recent Developments

In this section we highlight some recent developmentsthat are relevant for diagnostic accuracy reviews of bio-chemical tests and markers

NETWORK METAANALYSIS

In many diagnostic scenarios there are several alterna-tive tests available which leads to the key questionwhich test is the best Direct comparisons of tests(head-to-head comparison in the same patients by useof a cross-over design or a parallel randomized design)offer the most valid study design but are not alwaysavailable in the literature Systematic reviews focusingon more than one diagnostic test have to incorporateindirect comparisons (accuracy of different tests as-sessed in different populations) Network metaanaly-ses have been developed in the field of intervention tocombine both direct and indirect comparisons within asingle statistical model to allow for ranking of the avail-able treatments (25 ) In addition these models provideestimates of heterogeneity and inconsistency of effectsSuch network metaanalyses would be a welcome addi-tion for ranking and selecting the best test among sev-eral alternatives

IPD METAANALYSIS

IPD metaanalyses use individual patient data ratherthan published summary results of a study In an IPDmetaanalysis there is more flexibility and more statisti-cal power to examine how patientsrsquo characteristics af-fect diagnostic test accuracy (subgroup analyses or ef-fect modification) IPD metaanalysis also offers moreflexibility in handling differences in thresholds for pos-itivity for continuous index test results and for deter-mining the optimal cutoff value (26 )

Concluding Remarks

Many improvements have been made in the method-ology of performing systematic reviews of the accuracy

of diagnostic tests and multivariable diagnostic mod-els Methods have been improved for locating diagnos-tic accuracy studies for assessing the risk for bias andsources of variation and for developing advanced andflexible models to metaanalyze 2 possible correlatedoutcomes However the biggest obstacle for generatinghigh-quality clinically useful diagnostic reviews is thepoor methodological quality of the existing body ofdiagnostic accuracy studies reported in the literatureFortunately interest in the methods for the evaluationof diagnostic tests has grown considerably in the lastdecade Higher-quality and more informative primarystudies will in return generate more informative diag-nostic reviews

Appendix 1

Accuracy of diagnostic decision rules without and withD-dimer assay for the diagnosis of pulmonary embo-lism Pulmonary embolism (PE) is an important condi-tion for physicians to consider because case fatality ishigh if left untreated However diagnosing PE in sus-pected patients is challenging because signs and symp-toms are often nonspecific Physicians constantly facethe dilemma of not wanting to miss a PE while at thesame time wanting to avoid performing too many un-necessary additional diagnostic procedures that can beexpensive burdensome and possibly harmful Diag-nostic strategies in suspected patients therefore focuson identifying patients in whom PE can be safely ruledout on the basis of findings from the patient history andphysical examination Many different diagnostic deci-sion rules for excluding PE on the basis of symptomsand signs with or without D-dimer assay have beendeveloped and validated but there remains uncertaintyas to whether these different rules differ in their accu-racy in a meaningful way In this example we focus on 3rules

bull Wells rule using a cutoff value of 2 for defining apositive (abnormal) test result

bull Wells rule using a cutoff value of 4bull Gestalt rule

In the Wells rules points are scored when certainsigns and symptoms (eg heart rate 100 previousdeep venous thrombosis) are present resulting in a to-tal score In the Gestalt rule physicians provide anoverall empirical assessment of the likelihood of pul-monary embolism being present after examination of apatient To safely exclude pulmonary embolism aD-dimer test can be added to the clinical rule to refrainfrom further testing if both tests (rule D-dimer assay)are negative Further details and more rules can befound in the original review (11 )

Review

1544 Clinical Chemistry 5811 (2012)

REVIEW AIMS

To determine and compare the diagnostic accuracy of 3different clinical decision rules Wells-2 (n 19 stud-ies) Wells-4 (n 11 studies) and Gestalt rule (n 15studies)

To examine whether a negative test from a rule incombination with a negative D-dimer test result is a safeand efficient strategy for excluding PE without referralfor further burdening and invasive imaging

Author Contributions All authors confirmed they have contributed tothe intellectual content of this paper and have met the following 3 re-quirements (a) significant contributions to the conception and design

acquisition of data or analysis and interpretation of data (b) draftingor revising the article for intellectual content and (c) final approval ofthe published article

Authorsrsquo Disclosures or Potential Conflicts of Interest Upon man-uscript submission all authors completed the author disclosure formDisclosures andor potential conflicts of interest

Employment or Leadership None declaredConsultant or Advisory Role None declaredStock Ownership None declaredHonoraria None declaredResearch Funding The Netherlands Organisation for Health Re-search and Development (ZonMW) KGM Moons the Nether-lands Organisation for Scientific Research (projects 91208004 and91810615)Expert Testimony None declared

References

1 Lijmer JG Leeflang M Bossuyt PM Proposals fora phased evaluation of medical tests Med DecisMaking 200929E13ndash21

2 Linnet K Bossuyt PM Moons KG Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

3 Moons KG de Groot JA Linnet K Reitsma JBBossuyt PM Quantifying the added value of adiagnostic test or marker Clin Chem 2012581408ndash17

4 Bossuyt PM Reitsma JB Linnet K Moons KGBeyond diagnostic accuracy the clinical utility ofdiagnostic tests Clin Chem [Epub ahead of print2012 Jun 22]

5 Reitsma JB Glas AS Rutjes AW Scholten RJBossuyt PM Zwinderman AH Bivariate analysisof sensitivity and specificity produces informativesummary measures in diagnostic reviews J ClinEpidemiol 200558982ndash90

6 Whiting P Rutjes AW Reitsma JB Glas ASBossuyt PM Kleijnen J Sources of variation andbias in studies of diagnostic accuracy a system-atic review Ann Intern Med 2004140189ndash202

7 Leeflang MM Deeks JJ Gatsonis C Bossuyt PMon behalf of the Cochrane Diagnostic Test Accu-racy Working Group Systematic reviews of diag-nostic test accuracy Ann Intern Med 2008149889ndash97

8 Whiting PF Rutjes AW Westwood ME Mallett SDeeks JJ Reitsma JB et al QUADAS-2 a revisedtool for the quality assessment of diagnostic ac-curacy studies Ann Intern Med 2011155529ndash36

9 Diagnostic test accuracy working group httpsrdtacochraneorg (Accessed August 2012)

10 Leeflang MM Debets-Ossenkopp YJ Visser CEScholten RJ Hooft L Bijlmer HA et al Galacto-mannan detection for invasive aspergillosis in

immunocompromized patients Cochrane Data-base Syst Rev 2008CD007394

11 Lucassen W Geersing GJ Erkens PM Reitsma JBMoons KG Buller H van Weert HC Clinical de-cision rules for excluding pulmonary embolism ameta-analysis Ann Intern Med 2011155448ndash60

12 Bossuyt PM Irwig L Craig J Glasziou P Com-parative accuracy assessing new tests againstexisting diagnostic pathways BMJ 20063321089ndash92

13 Doust JA Pietrzak E Sanders S Glasziou PPIdentifying studies for systematic reviews of di-agnostic tests was difficult due to the poor sen-sitivity and precision of methodologic filters andthe lack of information in the abstract J ClinEpidemiol 200558444ndash9

14 Leeflang MM Scholten RJ Rutjes AW ReitsmaJB Bossuyt PM Use of methodological searchfilters to identify diagnostic accuracy studies canlead to the omission of relevant studies J ClinEpidemiol 200659234ndash40

15 Savoie I Helmer D Green CJ Kazanjian A Be-yond Medline reducing bias through extendedsystematic review search Int J Technol AssessHealth Care 200319168ndash78

16 Fraser C Mowatt G Siddiqui R Burr J Searchingfor diagnostic test accuracy studies an applica-tion to screening for open angle glaucoma (OAG)[Abstract] Cochrane Colloquium AbstractsJournal 2006 httpwwwimbiuni-freiburgdeOJSccaindexphpjournalccaamppagearticleampopviewamppath[]1980 (Accessed October2012)

17 Whiting P Westwood M Burke M Sterne JGlanville J Systematic reviews of test accuracyshould search a range of databases to identifyprimary studies J Clin Epidemiol 200861

357ndash 6418 Song F Eastwood AJ Gilbody S Duley L Sutton

AJ Publication and related biases Health TechnolAssess 200041ndash115

19 Whiting P Rutjes AW Reitsma JB Bossuyt PMKleijnen J The development of QUADAS a toolfor the quality assessment of studies of diagnos-tic accuracy included in systematic reviews BMCMed Res Methodol 2003325

20 Linnet K Bossuyt PMM Moons KGM Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

21 Moses LE Shapiro D Littenberg B Combiningindependent studies of a diagnostic test into asummary ROC curve data-analytic approachesand some additional considerations Stat Med1993121293ndash316

22 Rutter CM Gatsonis CA A hierarchical regressionapproach to meta-analysis of diagnostic test ac-curacy evaluations Stat Med 2001202865ndash84

23 Harbord RM Deeks JJ Egger M Whiting PSterne JA A unification of models for meta-analysis of diagnostic accuracy studies Biostatis-tics 20078239ndash51

24 Leeflang MM Bossuyt PM Irwig L Diagnostictest accuracy may vary with prevalence implica-tions for evidence-based diagnosis J Clin Epide-miol 2009625ndash12

25 Li T Puhan MA Vedula SS Singh S Dickersin Kthe Ad Hoc Network Meta-analysis MethodsMeeting Working Group Network meta-analysis-highly attractive but more methodological re-search is needed BMC Med 2011 27979

26 Khan KS Bachmann LM ter Riet G Systematicreviews with individual patient data meta-analysis to evaluate diagnostic tests Eur J ObstetGynecol Reprod Biol 2003108121ndash5

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1545

it increases the risk of a false-positive result in patientswithout the target disease thereby lowering specificityThis trade-off or negative correlation will generate thisshoulder-like pattern in the ROC space

The ROC plot of our 3 clinical decision rules forpulmonary embolism clearly indicates the presence ofnegative correlations both within rules as well as acrossdifferent rules (Fig 3)

Metaanalysis of Diagnostic Accuracy Data

Metaanalyses of studies reporting sensitivities andspecificities have often used the MosesndashLittenberg lin-ear regression approach (21 ) to obtain a summaryROC curve It has become clear that this approach hasstatistical shortcomings (5 22 ) and therefore it is nolonger recommended for evaluating differences be-tween summary ROC curves between tests or examin-ing the impact of covariates on accuracy

To overcome the shortcomings of the MosesndashLittenberg approach 2 more rigorous statistical ap-proaches have since been developed These are the hi-erarchical summary ROC approach and the bivariaterandom effects model (5 22 ) Both models are hierar-chical random effects models that take into account thebetween-study variation in sensitivities and specifici-ties (eg random effects models) and their possiblecorrelations as well as the precision of these estimates

within a study (eg weighting of studies) Although thestarting points of these 2 models are different the 2models are mathematically equivalent (23 ) Both mod-els can produce summary estimates of sensitivity andspecificity and produce a statistically sound summaryROC line or provide 95 confidence ellipses aroundthe mean values of sensitivity and specificity (Fig 4)

EXAMINATION OF SOURCES OF VARIATION AND DIFFERENCES

IN ACCURACY BETWEEN TESTS

Results from individual studies often vary within a re-view There are several possible causes for variationwhich can be categorized according to the groupsshown in Table 3

Both advanced models are regression models thatallow flexibility in examining sources of heterogeneityby including study-level covariates This feature pro-vides the option of formally comparing the results ofstudies with a specific feature (eg partial verification)with the results of studies that have avoided partial ver-ification In the same way we can examine whether theaccuracy results from studies examining test A are dif-ferent from studies examining test B Limiting thecomparison of tests to studies with a cross-over designin which both index tests have been applied in the samepatient may be a preferred approach These so-called

Fig 3 Pairs of sensitivity and specificity values fromstudies examining 3 different rules for the diagnosisof pulmonary embolism (A Wells rule with a cutoffvalue 2 B Wells rule with a cutoff value of 4 CGestalt) Adapted from Lucassen et al (11 )

Fig 4 ROC plot showing the summary estimate ofsensitivity and specificity and the corresponding 95confidence ellipse for 3 different clinical decisionrules for the diagnosis of pulmonary embolism (A

Wells rule with cutoff value 2 B Wells rule withcutoff value of 4 C Gestalt) Adapted from Lucas-sen et al (11 )

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1541

paired comparisons provide more valid evidence thanresults generated from unpaired studies (separate stud-ies) that may reflect other underlying differences in de-sign and conduct between studies (ie confoundingfactors)

Diagnostic accuracy can vary between clinical sub-groups Examining such differences in a systematic re-view is problematic if primary studies do not reportstratified results for these subgroups In the absenceof stratified results researchers have to use study-levelsummaries of the covariate representing the clinicalsubgroup Such summaries have limited power for de-tecting differences in accuracy between clinical sub-groups As an example if reviewers are interested inwhether accuracy of a test varies between men andwomen they could use the percentage of males in eachstudy as a study-level covariate in their model A study-

level covariate reduces the power to find differencesbetween males and females which would clearly be thecase if all included studies had similar percentages ofmales Even if a clear difference in accuracy existed be-tween males and females in each study it would remainundetected in a regression model based on the percent-age of males Individual patient data (IPD) metaanaly-sis provides more power and flexibility to examinevariation in accuracy between clinical subgroups

Just as for any regression model used for examiningcovariates clear boundaries exist that define what can bedone or what is sensible given the sample size of the studyInsufficient statistical power or an increased risk of find-ing false-positive associations when many covariates areexamined are concerns when diagnostic reviews are con-ducted The number of different studies within a review isthe key limitation for examining covariates

The results from the bivariate model comparingthe three different rules are summarized in Fig 4 andTable 4

These results show that (as expected) mean sensi-tivity is significantly lower for the Wells studies using acutoff value of 2 compared with studies using a cutoffvalue of 4 but at the same time specificity is signifi-cantly lower Such differences are expected when low-ering the threshold for positivity The results from theGestalt studies are comparable with the Wells studiesusing a cutoff value of 2 although there appears to bemore heterogeneity in the reported specificities ofGestalt studies (Fig 2)

In the example review on pulmonary embolismthe authors examined whether the prevalence in astudy had an impact on the levels of sensitivity andspecificity in a study by including it as a covariate in thebivariate metaregression model There are several rea-sons why prevalence might be associated with sensitiv-ity and specificity An overview of these potential rea-sons is given in (24 )

In this case the authors hypothesized that differ-ences in prevalence could be seen as a proxy for differ-ences in case mix between studies In studies with lowerprevalence more patients may be in an early stage ofthe disease which would hamper detection and lead tomore false-negative results and hence lower sensitiv-ity In this review increased prevalence was associatedwith higher sensitivity and lower specificity (Table 4)

V INTERPRETING RESULTS AND DRAWING CONCLUSIONS

This is the part of the review process in which all theresults of the different steps within a systematic reviewhave to be combined to answer the review question(s)at hand Key ingredients include the methodologicalquality of the evidence whether the included studiesexamined the same intended role of the test as ex-

Table 3 Causes for variation in sensitivity andspecificity results between primary studies within a

review

Chance variation The majority of diagnostic accuracystudies are moderate to small insample size Considerablevariation by chance can then beexpected especially forsensitivity when the prevalenceis low The advanced modelsproperly take into account theprecision by which sensitivityand specificity have beenmeasured in each study

Differences inthreshold

Explicit or implicit differences inthresholds for positivity betweenstudies will lead to differences insensitivity and specificity inopposite directions creatingnegative correlations Theadvanced models take thepossible correlations intoaccount

Bias Deficiencies in the design andconduct of diagnostic studies canlead to biased results oftenproducing more exaggeratedresults Advanced models canexamine the impact ofdeficiencies in design byincluding study-level covariates

Variation byclinicalsubgroups

Examine stratified results orsummaries at a study level

Unexplainedvariation

It is likely that remaining variationbeyond chance will be present inDTA reviews The advancedmodels use random effects toincorporate variation beyondchance

Review

1542 Clinical Chemistry 5811 (2012)

pressed in the review question and the precision andvariability in accuracy results

Reviews with a comparative question (eg is test Abetter than test B at a specific point in the diagnosticpathway) can directly examine whether sensitivity orspecificity or both are higher for one test than the otherA distinction should be made between primary studiesdirectly comparing the 2 index tests in the same patient(direct evidence) and studies examining only one ofthese index tests (indirect evidence) Direct evidence ispreferred because important factors that may have animpact on accuracy (ie potential confounding factorssuch as the population and choice of reference stan-dard) will be constant when the index tests are com-pared If sufficient studies with direct evidence areavailable the main analysis or any sensitivity analysesshould focus on these studies providing directevidence

If both sensitivity and specificity are higher or theentire summary ROC curve for one test is to the left andabove that of the other test the conclusion is straightfor-ward If sensitivity is higher for one test and specificity forthe other or if the summary ROC curves of the 2 testscross it is important to examine and weigh the potentialnegative consequences associated with false-positive orfalse-negative test results One way to provide this insightis to subject a hypothetical cohort of 1000 patients to bothtests and calculate the number of patients with differentcorrect and incorrect test results based on summary esti-

mates of sensitivity and specificity and a reasonable esti-mate of the expected prevalence

The intended role of a test is also helpful in structur-ing the interpretation of results In triage questions thenumber of missed cases (eg false-negative test results) isthe key concern so sensitivity or the negative predictivevalue are the key accuracy measures The desired mini-mum level for these measures will still be a subjectivechoice and depend on the condition at hand In our ex-ample most experts will agree that the clinical decisionrule should not miss more than 5 of the patients withpulmonary embolism so therefore sensitivity should be atleast 95 From the results of the rules alone it is clear thata large part of the confidence ellipse and even the sum-mary estimate of sensitivity do not meet this criterion (Ta-ble 4) This observation leads to a firm conclusion thatclinical decisions alone are not suited for use in the triageof patients suspected to have pulmonary embolismTherefore D-dimer results have been added to the triageof patients suspected for pulmonary embolism In thisscenario patients will not undergo further testing if boththe clinical decision rule AND the D-dimer are negativeThe proportion of patients who had negative results forboth tests but who had a final diagnosis of pulmonaryembolism (failure rate) has been metaanalyzed Adding aqualitative D-dimer to the clinical decision rule led tofailure rates that were lower than 2 (Table 4) Thisfrequency has been considered sufficiently low andtherefore such strategies have been implemented in

Table 4 Mean (95 CI) values of sensitivity and specificity for 3 different clinical decision rules for pulmonaryembolism the impact of prevalence on sensitivity and specificity and failure rate and efficiency of a strategy in

which patients with a low probability of disease and a negative D-dimer receive no further testing

Subgroup (no of studies) Sensitivity (95 CI) Specificity (95 CI)

Type of rule

Wells cutoff value of 2 (n 19) 84 (78ndash89) 58 (52ndash65)

Wells cutoff value of 4 (n 11) 60 (49ndash69) 80 (75ndash84)

Gestalt (n 15) 85 (78ndash90) 51 (39ndash63)

P value Wells 2 vs Wells 4 P 0001 P 0001

P value Wells 2 vs Gestalt P 096 P 031

P value Wells 4 vs Gestalt P 0001 P 0001

Impact prevalence within Wells 2 studies

Prevalence 5 67 (58ndash75) 72 (65ndash79)

Prevalence 15 85 (80ndash89) 58 (52ndash63)

Prevalence 30 91 (88ndash94) 47 (40ndash55)

P value for trend P 0001 P 0001

Adding D-dimer testing to rule Failure rate (95 CI) Efficiency (95 CI)

Wells 4 with quantitative D-dimer (n 4) 05 (02ndash09) 39 (30ndash48)

Wells 2 with qualitative D-dimer (n 5) 09 (05ndash17) 40 (32ndash49)

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1543

practice The efficiencies of such strategies arearound 40 meaning that in 40 of the patients nofurther testing is required

Similar to any other review there is the threat ofpublication bias in DTA reviews (18 ) Publication biasoccurs when studies containing less favorable resultsare less likely to be published Summary results basedon published findings will then generate an overopti-mistic picture of the accuracy of a test Unfortunatelylittle information exists regarding the presence andmagnitude of publication bias in diagnostic accuracystudies Unlike randomized trials there are no registriesfor protocols of diagnostic accuracy studies

Recent Developments

In this section we highlight some recent developmentsthat are relevant for diagnostic accuracy reviews of bio-chemical tests and markers

NETWORK METAANALYSIS

In many diagnostic scenarios there are several alterna-tive tests available which leads to the key questionwhich test is the best Direct comparisons of tests(head-to-head comparison in the same patients by useof a cross-over design or a parallel randomized design)offer the most valid study design but are not alwaysavailable in the literature Systematic reviews focusingon more than one diagnostic test have to incorporateindirect comparisons (accuracy of different tests as-sessed in different populations) Network metaanaly-ses have been developed in the field of intervention tocombine both direct and indirect comparisons within asingle statistical model to allow for ranking of the avail-able treatments (25 ) In addition these models provideestimates of heterogeneity and inconsistency of effectsSuch network metaanalyses would be a welcome addi-tion for ranking and selecting the best test among sev-eral alternatives

IPD METAANALYSIS

IPD metaanalyses use individual patient data ratherthan published summary results of a study In an IPDmetaanalysis there is more flexibility and more statisti-cal power to examine how patientsrsquo characteristics af-fect diagnostic test accuracy (subgroup analyses or ef-fect modification) IPD metaanalysis also offers moreflexibility in handling differences in thresholds for pos-itivity for continuous index test results and for deter-mining the optimal cutoff value (26 )

Concluding Remarks

Many improvements have been made in the method-ology of performing systematic reviews of the accuracy

of diagnostic tests and multivariable diagnostic mod-els Methods have been improved for locating diagnos-tic accuracy studies for assessing the risk for bias andsources of variation and for developing advanced andflexible models to metaanalyze 2 possible correlatedoutcomes However the biggest obstacle for generatinghigh-quality clinically useful diagnostic reviews is thepoor methodological quality of the existing body ofdiagnostic accuracy studies reported in the literatureFortunately interest in the methods for the evaluationof diagnostic tests has grown considerably in the lastdecade Higher-quality and more informative primarystudies will in return generate more informative diag-nostic reviews

Appendix 1

Accuracy of diagnostic decision rules without and withD-dimer assay for the diagnosis of pulmonary embo-lism Pulmonary embolism (PE) is an important condi-tion for physicians to consider because case fatality ishigh if left untreated However diagnosing PE in sus-pected patients is challenging because signs and symp-toms are often nonspecific Physicians constantly facethe dilemma of not wanting to miss a PE while at thesame time wanting to avoid performing too many un-necessary additional diagnostic procedures that can beexpensive burdensome and possibly harmful Diag-nostic strategies in suspected patients therefore focuson identifying patients in whom PE can be safely ruledout on the basis of findings from the patient history andphysical examination Many different diagnostic deci-sion rules for excluding PE on the basis of symptomsand signs with or without D-dimer assay have beendeveloped and validated but there remains uncertaintyas to whether these different rules differ in their accu-racy in a meaningful way In this example we focus on 3rules

bull Wells rule using a cutoff value of 2 for defining apositive (abnormal) test result

bull Wells rule using a cutoff value of 4bull Gestalt rule

In the Wells rules points are scored when certainsigns and symptoms (eg heart rate 100 previousdeep venous thrombosis) are present resulting in a to-tal score In the Gestalt rule physicians provide anoverall empirical assessment of the likelihood of pul-monary embolism being present after examination of apatient To safely exclude pulmonary embolism aD-dimer test can be added to the clinical rule to refrainfrom further testing if both tests (rule D-dimer assay)are negative Further details and more rules can befound in the original review (11 )

Review

1544 Clinical Chemistry 5811 (2012)

REVIEW AIMS

To determine and compare the diagnostic accuracy of 3different clinical decision rules Wells-2 (n 19 stud-ies) Wells-4 (n 11 studies) and Gestalt rule (n 15studies)

To examine whether a negative test from a rule incombination with a negative D-dimer test result is a safeand efficient strategy for excluding PE without referralfor further burdening and invasive imaging

Author Contributions All authors confirmed they have contributed tothe intellectual content of this paper and have met the following 3 re-quirements (a) significant contributions to the conception and design

acquisition of data or analysis and interpretation of data (b) draftingor revising the article for intellectual content and (c) final approval ofthe published article

Authorsrsquo Disclosures or Potential Conflicts of Interest Upon man-uscript submission all authors completed the author disclosure formDisclosures andor potential conflicts of interest

Employment or Leadership None declaredConsultant or Advisory Role None declaredStock Ownership None declaredHonoraria None declaredResearch Funding The Netherlands Organisation for Health Re-search and Development (ZonMW) KGM Moons the Nether-lands Organisation for Scientific Research (projects 91208004 and91810615)Expert Testimony None declared

References

1 Lijmer JG Leeflang M Bossuyt PM Proposals fora phased evaluation of medical tests Med DecisMaking 200929E13ndash21

2 Linnet K Bossuyt PM Moons KG Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

3 Moons KG de Groot JA Linnet K Reitsma JBBossuyt PM Quantifying the added value of adiagnostic test or marker Clin Chem 2012581408ndash17

4 Bossuyt PM Reitsma JB Linnet K Moons KGBeyond diagnostic accuracy the clinical utility ofdiagnostic tests Clin Chem [Epub ahead of print2012 Jun 22]

5 Reitsma JB Glas AS Rutjes AW Scholten RJBossuyt PM Zwinderman AH Bivariate analysisof sensitivity and specificity produces informativesummary measures in diagnostic reviews J ClinEpidemiol 200558982ndash90

6 Whiting P Rutjes AW Reitsma JB Glas ASBossuyt PM Kleijnen J Sources of variation andbias in studies of diagnostic accuracy a system-atic review Ann Intern Med 2004140189ndash202

7 Leeflang MM Deeks JJ Gatsonis C Bossuyt PMon behalf of the Cochrane Diagnostic Test Accu-racy Working Group Systematic reviews of diag-nostic test accuracy Ann Intern Med 2008149889ndash97

8 Whiting PF Rutjes AW Westwood ME Mallett SDeeks JJ Reitsma JB et al QUADAS-2 a revisedtool for the quality assessment of diagnostic ac-curacy studies Ann Intern Med 2011155529ndash36

9 Diagnostic test accuracy working group httpsrdtacochraneorg (Accessed August 2012)

10 Leeflang MM Debets-Ossenkopp YJ Visser CEScholten RJ Hooft L Bijlmer HA et al Galacto-mannan detection for invasive aspergillosis in

immunocompromized patients Cochrane Data-base Syst Rev 2008CD007394

11 Lucassen W Geersing GJ Erkens PM Reitsma JBMoons KG Buller H van Weert HC Clinical de-cision rules for excluding pulmonary embolism ameta-analysis Ann Intern Med 2011155448ndash60

12 Bossuyt PM Irwig L Craig J Glasziou P Com-parative accuracy assessing new tests againstexisting diagnostic pathways BMJ 20063321089ndash92

13 Doust JA Pietrzak E Sanders S Glasziou PPIdentifying studies for systematic reviews of di-agnostic tests was difficult due to the poor sen-sitivity and precision of methodologic filters andthe lack of information in the abstract J ClinEpidemiol 200558444ndash9

14 Leeflang MM Scholten RJ Rutjes AW ReitsmaJB Bossuyt PM Use of methodological searchfilters to identify diagnostic accuracy studies canlead to the omission of relevant studies J ClinEpidemiol 200659234ndash40

15 Savoie I Helmer D Green CJ Kazanjian A Be-yond Medline reducing bias through extendedsystematic review search Int J Technol AssessHealth Care 200319168ndash78

16 Fraser C Mowatt G Siddiqui R Burr J Searchingfor diagnostic test accuracy studies an applica-tion to screening for open angle glaucoma (OAG)[Abstract] Cochrane Colloquium AbstractsJournal 2006 httpwwwimbiuni-freiburgdeOJSccaindexphpjournalccaamppagearticleampopviewamppath[]1980 (Accessed October2012)

17 Whiting P Westwood M Burke M Sterne JGlanville J Systematic reviews of test accuracyshould search a range of databases to identifyprimary studies J Clin Epidemiol 200861

357ndash 6418 Song F Eastwood AJ Gilbody S Duley L Sutton

AJ Publication and related biases Health TechnolAssess 200041ndash115

19 Whiting P Rutjes AW Reitsma JB Bossuyt PMKleijnen J The development of QUADAS a toolfor the quality assessment of studies of diagnos-tic accuracy included in systematic reviews BMCMed Res Methodol 2003325

20 Linnet K Bossuyt PMM Moons KGM Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

21 Moses LE Shapiro D Littenberg B Combiningindependent studies of a diagnostic test into asummary ROC curve data-analytic approachesand some additional considerations Stat Med1993121293ndash316

22 Rutter CM Gatsonis CA A hierarchical regressionapproach to meta-analysis of diagnostic test ac-curacy evaluations Stat Med 2001202865ndash84

23 Harbord RM Deeks JJ Egger M Whiting PSterne JA A unification of models for meta-analysis of diagnostic accuracy studies Biostatis-tics 20078239ndash51

24 Leeflang MM Bossuyt PM Irwig L Diagnostictest accuracy may vary with prevalence implica-tions for evidence-based diagnosis J Clin Epide-miol 2009625ndash12

25 Li T Puhan MA Vedula SS Singh S Dickersin Kthe Ad Hoc Network Meta-analysis MethodsMeeting Working Group Network meta-analysis-highly attractive but more methodological re-search is needed BMC Med 2011 27979

26 Khan KS Bachmann LM ter Riet G Systematicreviews with individual patient data meta-analysis to evaluate diagnostic tests Eur J ObstetGynecol Reprod Biol 2003108121ndash5

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1545

paired comparisons provide more valid evidence thanresults generated from unpaired studies (separate stud-ies) that may reflect other underlying differences in de-sign and conduct between studies (ie confoundingfactors)

Diagnostic accuracy can vary between clinical sub-groups Examining such differences in a systematic re-view is problematic if primary studies do not reportstratified results for these subgroups In the absenceof stratified results researchers have to use study-levelsummaries of the covariate representing the clinicalsubgroup Such summaries have limited power for de-tecting differences in accuracy between clinical sub-groups As an example if reviewers are interested inwhether accuracy of a test varies between men andwomen they could use the percentage of males in eachstudy as a study-level covariate in their model A study-

level covariate reduces the power to find differencesbetween males and females which would clearly be thecase if all included studies had similar percentages ofmales Even if a clear difference in accuracy existed be-tween males and females in each study it would remainundetected in a regression model based on the percent-age of males Individual patient data (IPD) metaanaly-sis provides more power and flexibility to examinevariation in accuracy between clinical subgroups

Just as for any regression model used for examiningcovariates clear boundaries exist that define what can bedone or what is sensible given the sample size of the studyInsufficient statistical power or an increased risk of find-ing false-positive associations when many covariates areexamined are concerns when diagnostic reviews are con-ducted The number of different studies within a review isthe key limitation for examining covariates

The results from the bivariate model comparingthe three different rules are summarized in Fig 4 andTable 4

These results show that (as expected) mean sensi-tivity is significantly lower for the Wells studies using acutoff value of 2 compared with studies using a cutoffvalue of 4 but at the same time specificity is signifi-cantly lower Such differences are expected when low-ering the threshold for positivity The results from theGestalt studies are comparable with the Wells studiesusing a cutoff value of 2 although there appears to bemore heterogeneity in the reported specificities ofGestalt studies (Fig 2)

In the example review on pulmonary embolismthe authors examined whether the prevalence in astudy had an impact on the levels of sensitivity andspecificity in a study by including it as a covariate in thebivariate metaregression model There are several rea-sons why prevalence might be associated with sensitiv-ity and specificity An overview of these potential rea-sons is given in (24 )

In this case the authors hypothesized that differ-ences in prevalence could be seen as a proxy for differ-ences in case mix between studies In studies with lowerprevalence more patients may be in an early stage ofthe disease which would hamper detection and lead tomore false-negative results and hence lower sensitiv-ity In this review increased prevalence was associatedwith higher sensitivity and lower specificity (Table 4)

V INTERPRETING RESULTS AND DRAWING CONCLUSIONS

This is the part of the review process in which all theresults of the different steps within a systematic reviewhave to be combined to answer the review question(s)at hand Key ingredients include the methodologicalquality of the evidence whether the included studiesexamined the same intended role of the test as ex-

Table 3 Causes for variation in sensitivity andspecificity results between primary studies within a

review

Chance variation The majority of diagnostic accuracystudies are moderate to small insample size Considerablevariation by chance can then beexpected especially forsensitivity when the prevalenceis low The advanced modelsproperly take into account theprecision by which sensitivityand specificity have beenmeasured in each study

Differences inthreshold

Explicit or implicit differences inthresholds for positivity betweenstudies will lead to differences insensitivity and specificity inopposite directions creatingnegative correlations Theadvanced models take thepossible correlations intoaccount

Bias Deficiencies in the design andconduct of diagnostic studies canlead to biased results oftenproducing more exaggeratedresults Advanced models canexamine the impact ofdeficiencies in design byincluding study-level covariates

Variation byclinicalsubgroups

Examine stratified results orsummaries at a study level

Unexplainedvariation

It is likely that remaining variationbeyond chance will be present inDTA reviews The advancedmodels use random effects toincorporate variation beyondchance

Review

1542 Clinical Chemistry 5811 (2012)

pressed in the review question and the precision andvariability in accuracy results

Reviews with a comparative question (eg is test Abetter than test B at a specific point in the diagnosticpathway) can directly examine whether sensitivity orspecificity or both are higher for one test than the otherA distinction should be made between primary studiesdirectly comparing the 2 index tests in the same patient(direct evidence) and studies examining only one ofthese index tests (indirect evidence) Direct evidence ispreferred because important factors that may have animpact on accuracy (ie potential confounding factorssuch as the population and choice of reference stan-dard) will be constant when the index tests are com-pared If sufficient studies with direct evidence areavailable the main analysis or any sensitivity analysesshould focus on these studies providing directevidence

If both sensitivity and specificity are higher or theentire summary ROC curve for one test is to the left andabove that of the other test the conclusion is straightfor-ward If sensitivity is higher for one test and specificity forthe other or if the summary ROC curves of the 2 testscross it is important to examine and weigh the potentialnegative consequences associated with false-positive orfalse-negative test results One way to provide this insightis to subject a hypothetical cohort of 1000 patients to bothtests and calculate the number of patients with differentcorrect and incorrect test results based on summary esti-

mates of sensitivity and specificity and a reasonable esti-mate of the expected prevalence

The intended role of a test is also helpful in structur-ing the interpretation of results In triage questions thenumber of missed cases (eg false-negative test results) isthe key concern so sensitivity or the negative predictivevalue are the key accuracy measures The desired mini-mum level for these measures will still be a subjectivechoice and depend on the condition at hand In our ex-ample most experts will agree that the clinical decisionrule should not miss more than 5 of the patients withpulmonary embolism so therefore sensitivity should be atleast 95 From the results of the rules alone it is clear thata large part of the confidence ellipse and even the sum-mary estimate of sensitivity do not meet this criterion (Ta-ble 4) This observation leads to a firm conclusion thatclinical decisions alone are not suited for use in the triageof patients suspected to have pulmonary embolismTherefore D-dimer results have been added to the triageof patients suspected for pulmonary embolism In thisscenario patients will not undergo further testing if boththe clinical decision rule AND the D-dimer are negativeThe proportion of patients who had negative results forboth tests but who had a final diagnosis of pulmonaryembolism (failure rate) has been metaanalyzed Adding aqualitative D-dimer to the clinical decision rule led tofailure rates that were lower than 2 (Table 4) Thisfrequency has been considered sufficiently low andtherefore such strategies have been implemented in

Table 4 Mean (95 CI) values of sensitivity and specificity for 3 different clinical decision rules for pulmonaryembolism the impact of prevalence on sensitivity and specificity and failure rate and efficiency of a strategy in

which patients with a low probability of disease and a negative D-dimer receive no further testing

Subgroup (no of studies) Sensitivity (95 CI) Specificity (95 CI)

Type of rule

Wells cutoff value of 2 (n 19) 84 (78ndash89) 58 (52ndash65)

Wells cutoff value of 4 (n 11) 60 (49ndash69) 80 (75ndash84)

Gestalt (n 15) 85 (78ndash90) 51 (39ndash63)

P value Wells 2 vs Wells 4 P 0001 P 0001

P value Wells 2 vs Gestalt P 096 P 031

P value Wells 4 vs Gestalt P 0001 P 0001

Impact prevalence within Wells 2 studies

Prevalence 5 67 (58ndash75) 72 (65ndash79)

Prevalence 15 85 (80ndash89) 58 (52ndash63)

Prevalence 30 91 (88ndash94) 47 (40ndash55)

P value for trend P 0001 P 0001

Adding D-dimer testing to rule Failure rate (95 CI) Efficiency (95 CI)

Wells 4 with quantitative D-dimer (n 4) 05 (02ndash09) 39 (30ndash48)

Wells 2 with qualitative D-dimer (n 5) 09 (05ndash17) 40 (32ndash49)

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1543

practice The efficiencies of such strategies arearound 40 meaning that in 40 of the patients nofurther testing is required

Similar to any other review there is the threat ofpublication bias in DTA reviews (18 ) Publication biasoccurs when studies containing less favorable resultsare less likely to be published Summary results basedon published findings will then generate an overopti-mistic picture of the accuracy of a test Unfortunatelylittle information exists regarding the presence andmagnitude of publication bias in diagnostic accuracystudies Unlike randomized trials there are no registriesfor protocols of diagnostic accuracy studies

Recent Developments

In this section we highlight some recent developmentsthat are relevant for diagnostic accuracy reviews of bio-chemical tests and markers

NETWORK METAANALYSIS

In many diagnostic scenarios there are several alterna-tive tests available which leads to the key questionwhich test is the best Direct comparisons of tests(head-to-head comparison in the same patients by useof a cross-over design or a parallel randomized design)offer the most valid study design but are not alwaysavailable in the literature Systematic reviews focusingon more than one diagnostic test have to incorporateindirect comparisons (accuracy of different tests as-sessed in different populations) Network metaanaly-ses have been developed in the field of intervention tocombine both direct and indirect comparisons within asingle statistical model to allow for ranking of the avail-able treatments (25 ) In addition these models provideestimates of heterogeneity and inconsistency of effectsSuch network metaanalyses would be a welcome addi-tion for ranking and selecting the best test among sev-eral alternatives

IPD METAANALYSIS

IPD metaanalyses use individual patient data ratherthan published summary results of a study In an IPDmetaanalysis there is more flexibility and more statisti-cal power to examine how patientsrsquo characteristics af-fect diagnostic test accuracy (subgroup analyses or ef-fect modification) IPD metaanalysis also offers moreflexibility in handling differences in thresholds for pos-itivity for continuous index test results and for deter-mining the optimal cutoff value (26 )

Concluding Remarks

Many improvements have been made in the method-ology of performing systematic reviews of the accuracy

of diagnostic tests and multivariable diagnostic mod-els Methods have been improved for locating diagnos-tic accuracy studies for assessing the risk for bias andsources of variation and for developing advanced andflexible models to metaanalyze 2 possible correlatedoutcomes However the biggest obstacle for generatinghigh-quality clinically useful diagnostic reviews is thepoor methodological quality of the existing body ofdiagnostic accuracy studies reported in the literatureFortunately interest in the methods for the evaluationof diagnostic tests has grown considerably in the lastdecade Higher-quality and more informative primarystudies will in return generate more informative diag-nostic reviews

Appendix 1

Accuracy of diagnostic decision rules without and withD-dimer assay for the diagnosis of pulmonary embo-lism Pulmonary embolism (PE) is an important condi-tion for physicians to consider because case fatality ishigh if left untreated However diagnosing PE in sus-pected patients is challenging because signs and symp-toms are often nonspecific Physicians constantly facethe dilemma of not wanting to miss a PE while at thesame time wanting to avoid performing too many un-necessary additional diagnostic procedures that can beexpensive burdensome and possibly harmful Diag-nostic strategies in suspected patients therefore focuson identifying patients in whom PE can be safely ruledout on the basis of findings from the patient history andphysical examination Many different diagnostic deci-sion rules for excluding PE on the basis of symptomsand signs with or without D-dimer assay have beendeveloped and validated but there remains uncertaintyas to whether these different rules differ in their accu-racy in a meaningful way In this example we focus on 3rules

bull Wells rule using a cutoff value of 2 for defining apositive (abnormal) test result

bull Wells rule using a cutoff value of 4bull Gestalt rule

In the Wells rules points are scored when certainsigns and symptoms (eg heart rate 100 previousdeep venous thrombosis) are present resulting in a to-tal score In the Gestalt rule physicians provide anoverall empirical assessment of the likelihood of pul-monary embolism being present after examination of apatient To safely exclude pulmonary embolism aD-dimer test can be added to the clinical rule to refrainfrom further testing if both tests (rule D-dimer assay)are negative Further details and more rules can befound in the original review (11 )

Review

1544 Clinical Chemistry 5811 (2012)

REVIEW AIMS

To determine and compare the diagnostic accuracy of 3different clinical decision rules Wells-2 (n 19 stud-ies) Wells-4 (n 11 studies) and Gestalt rule (n 15studies)

To examine whether a negative test from a rule incombination with a negative D-dimer test result is a safeand efficient strategy for excluding PE without referralfor further burdening and invasive imaging

Author Contributions All authors confirmed they have contributed tothe intellectual content of this paper and have met the following 3 re-quirements (a) significant contributions to the conception and design

acquisition of data or analysis and interpretation of data (b) draftingor revising the article for intellectual content and (c) final approval ofthe published article

Authorsrsquo Disclosures or Potential Conflicts of Interest Upon man-uscript submission all authors completed the author disclosure formDisclosures andor potential conflicts of interest

Employment or Leadership None declaredConsultant or Advisory Role None declaredStock Ownership None declaredHonoraria None declaredResearch Funding The Netherlands Organisation for Health Re-search and Development (ZonMW) KGM Moons the Nether-lands Organisation for Scientific Research (projects 91208004 and91810615)Expert Testimony None declared

References

1 Lijmer JG Leeflang M Bossuyt PM Proposals fora phased evaluation of medical tests Med DecisMaking 200929E13ndash21

2 Linnet K Bossuyt PM Moons KG Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

3 Moons KG de Groot JA Linnet K Reitsma JBBossuyt PM Quantifying the added value of adiagnostic test or marker Clin Chem 2012581408ndash17

4 Bossuyt PM Reitsma JB Linnet K Moons KGBeyond diagnostic accuracy the clinical utility ofdiagnostic tests Clin Chem [Epub ahead of print2012 Jun 22]

5 Reitsma JB Glas AS Rutjes AW Scholten RJBossuyt PM Zwinderman AH Bivariate analysisof sensitivity and specificity produces informativesummary measures in diagnostic reviews J ClinEpidemiol 200558982ndash90

6 Whiting P Rutjes AW Reitsma JB Glas ASBossuyt PM Kleijnen J Sources of variation andbias in studies of diagnostic accuracy a system-atic review Ann Intern Med 2004140189ndash202

7 Leeflang MM Deeks JJ Gatsonis C Bossuyt PMon behalf of the Cochrane Diagnostic Test Accu-racy Working Group Systematic reviews of diag-nostic test accuracy Ann Intern Med 2008149889ndash97

8 Whiting PF Rutjes AW Westwood ME Mallett SDeeks JJ Reitsma JB et al QUADAS-2 a revisedtool for the quality assessment of diagnostic ac-curacy studies Ann Intern Med 2011155529ndash36

9 Diagnostic test accuracy working group httpsrdtacochraneorg (Accessed August 2012)

10 Leeflang MM Debets-Ossenkopp YJ Visser CEScholten RJ Hooft L Bijlmer HA et al Galacto-mannan detection for invasive aspergillosis in

immunocompromized patients Cochrane Data-base Syst Rev 2008CD007394

11 Lucassen W Geersing GJ Erkens PM Reitsma JBMoons KG Buller H van Weert HC Clinical de-cision rules for excluding pulmonary embolism ameta-analysis Ann Intern Med 2011155448ndash60

12 Bossuyt PM Irwig L Craig J Glasziou P Com-parative accuracy assessing new tests againstexisting diagnostic pathways BMJ 20063321089ndash92

13 Doust JA Pietrzak E Sanders S Glasziou PPIdentifying studies for systematic reviews of di-agnostic tests was difficult due to the poor sen-sitivity and precision of methodologic filters andthe lack of information in the abstract J ClinEpidemiol 200558444ndash9

14 Leeflang MM Scholten RJ Rutjes AW ReitsmaJB Bossuyt PM Use of methodological searchfilters to identify diagnostic accuracy studies canlead to the omission of relevant studies J ClinEpidemiol 200659234ndash40

15 Savoie I Helmer D Green CJ Kazanjian A Be-yond Medline reducing bias through extendedsystematic review search Int J Technol AssessHealth Care 200319168ndash78

16 Fraser C Mowatt G Siddiqui R Burr J Searchingfor diagnostic test accuracy studies an applica-tion to screening for open angle glaucoma (OAG)[Abstract] Cochrane Colloquium AbstractsJournal 2006 httpwwwimbiuni-freiburgdeOJSccaindexphpjournalccaamppagearticleampopviewamppath[]1980 (Accessed October2012)

17 Whiting P Westwood M Burke M Sterne JGlanville J Systematic reviews of test accuracyshould search a range of databases to identifyprimary studies J Clin Epidemiol 200861

357ndash 6418 Song F Eastwood AJ Gilbody S Duley L Sutton

AJ Publication and related biases Health TechnolAssess 200041ndash115

19 Whiting P Rutjes AW Reitsma JB Bossuyt PMKleijnen J The development of QUADAS a toolfor the quality assessment of studies of diagnos-tic accuracy included in systematic reviews BMCMed Res Methodol 2003325

20 Linnet K Bossuyt PMM Moons KGM Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

21 Moses LE Shapiro D Littenberg B Combiningindependent studies of a diagnostic test into asummary ROC curve data-analytic approachesand some additional considerations Stat Med1993121293ndash316

22 Rutter CM Gatsonis CA A hierarchical regressionapproach to meta-analysis of diagnostic test ac-curacy evaluations Stat Med 2001202865ndash84

23 Harbord RM Deeks JJ Egger M Whiting PSterne JA A unification of models for meta-analysis of diagnostic accuracy studies Biostatis-tics 20078239ndash51

24 Leeflang MM Bossuyt PM Irwig L Diagnostictest accuracy may vary with prevalence implica-tions for evidence-based diagnosis J Clin Epide-miol 2009625ndash12

25 Li T Puhan MA Vedula SS Singh S Dickersin Kthe Ad Hoc Network Meta-analysis MethodsMeeting Working Group Network meta-analysis-highly attractive but more methodological re-search is needed BMC Med 2011 27979

26 Khan KS Bachmann LM ter Riet G Systematicreviews with individual patient data meta-analysis to evaluate diagnostic tests Eur J ObstetGynecol Reprod Biol 2003108121ndash5

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1545

pressed in the review question and the precision andvariability in accuracy results

Reviews with a comparative question (eg is test Abetter than test B at a specific point in the diagnosticpathway) can directly examine whether sensitivity orspecificity or both are higher for one test than the otherA distinction should be made between primary studiesdirectly comparing the 2 index tests in the same patient(direct evidence) and studies examining only one ofthese index tests (indirect evidence) Direct evidence ispreferred because important factors that may have animpact on accuracy (ie potential confounding factorssuch as the population and choice of reference stan-dard) will be constant when the index tests are com-pared If sufficient studies with direct evidence areavailable the main analysis or any sensitivity analysesshould focus on these studies providing directevidence

If both sensitivity and specificity are higher or theentire summary ROC curve for one test is to the left andabove that of the other test the conclusion is straightfor-ward If sensitivity is higher for one test and specificity forthe other or if the summary ROC curves of the 2 testscross it is important to examine and weigh the potentialnegative consequences associated with false-positive orfalse-negative test results One way to provide this insightis to subject a hypothetical cohort of 1000 patients to bothtests and calculate the number of patients with differentcorrect and incorrect test results based on summary esti-

mates of sensitivity and specificity and a reasonable esti-mate of the expected prevalence

The intended role of a test is also helpful in structur-ing the interpretation of results In triage questions thenumber of missed cases (eg false-negative test results) isthe key concern so sensitivity or the negative predictivevalue are the key accuracy measures The desired mini-mum level for these measures will still be a subjectivechoice and depend on the condition at hand In our ex-ample most experts will agree that the clinical decisionrule should not miss more than 5 of the patients withpulmonary embolism so therefore sensitivity should be atleast 95 From the results of the rules alone it is clear thata large part of the confidence ellipse and even the sum-mary estimate of sensitivity do not meet this criterion (Ta-ble 4) This observation leads to a firm conclusion thatclinical decisions alone are not suited for use in the triageof patients suspected to have pulmonary embolismTherefore D-dimer results have been added to the triageof patients suspected for pulmonary embolism In thisscenario patients will not undergo further testing if boththe clinical decision rule AND the D-dimer are negativeThe proportion of patients who had negative results forboth tests but who had a final diagnosis of pulmonaryembolism (failure rate) has been metaanalyzed Adding aqualitative D-dimer to the clinical decision rule led tofailure rates that were lower than 2 (Table 4) Thisfrequency has been considered sufficiently low andtherefore such strategies have been implemented in

Table 4 Mean (95 CI) values of sensitivity and specificity for 3 different clinical decision rules for pulmonaryembolism the impact of prevalence on sensitivity and specificity and failure rate and efficiency of a strategy in

which patients with a low probability of disease and a negative D-dimer receive no further testing

Subgroup (no of studies) Sensitivity (95 CI) Specificity (95 CI)

Type of rule

Wells cutoff value of 2 (n 19) 84 (78ndash89) 58 (52ndash65)

Wells cutoff value of 4 (n 11) 60 (49ndash69) 80 (75ndash84)

Gestalt (n 15) 85 (78ndash90) 51 (39ndash63)

P value Wells 2 vs Wells 4 P 0001 P 0001

P value Wells 2 vs Gestalt P 096 P 031

P value Wells 4 vs Gestalt P 0001 P 0001

Impact prevalence within Wells 2 studies

Prevalence 5 67 (58ndash75) 72 (65ndash79)

Prevalence 15 85 (80ndash89) 58 (52ndash63)

Prevalence 30 91 (88ndash94) 47 (40ndash55)

P value for trend P 0001 P 0001

Adding D-dimer testing to rule Failure rate (95 CI) Efficiency (95 CI)

Wells 4 with quantitative D-dimer (n 4) 05 (02ndash09) 39 (30ndash48)

Wells 2 with qualitative D-dimer (n 5) 09 (05ndash17) 40 (32ndash49)

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1543

practice The efficiencies of such strategies arearound 40 meaning that in 40 of the patients nofurther testing is required

Similar to any other review there is the threat ofpublication bias in DTA reviews (18 ) Publication biasoccurs when studies containing less favorable resultsare less likely to be published Summary results basedon published findings will then generate an overopti-mistic picture of the accuracy of a test Unfortunatelylittle information exists regarding the presence andmagnitude of publication bias in diagnostic accuracystudies Unlike randomized trials there are no registriesfor protocols of diagnostic accuracy studies

Recent Developments

In this section we highlight some recent developmentsthat are relevant for diagnostic accuracy reviews of bio-chemical tests and markers

NETWORK METAANALYSIS

In many diagnostic scenarios there are several alterna-tive tests available which leads to the key questionwhich test is the best Direct comparisons of tests(head-to-head comparison in the same patients by useof a cross-over design or a parallel randomized design)offer the most valid study design but are not alwaysavailable in the literature Systematic reviews focusingon more than one diagnostic test have to incorporateindirect comparisons (accuracy of different tests as-sessed in different populations) Network metaanaly-ses have been developed in the field of intervention tocombine both direct and indirect comparisons within asingle statistical model to allow for ranking of the avail-able treatments (25 ) In addition these models provideestimates of heterogeneity and inconsistency of effectsSuch network metaanalyses would be a welcome addi-tion for ranking and selecting the best test among sev-eral alternatives

IPD METAANALYSIS

IPD metaanalyses use individual patient data ratherthan published summary results of a study In an IPDmetaanalysis there is more flexibility and more statisti-cal power to examine how patientsrsquo characteristics af-fect diagnostic test accuracy (subgroup analyses or ef-fect modification) IPD metaanalysis also offers moreflexibility in handling differences in thresholds for pos-itivity for continuous index test results and for deter-mining the optimal cutoff value (26 )

Concluding Remarks

Many improvements have been made in the method-ology of performing systematic reviews of the accuracy

of diagnostic tests and multivariable diagnostic mod-els Methods have been improved for locating diagnos-tic accuracy studies for assessing the risk for bias andsources of variation and for developing advanced andflexible models to metaanalyze 2 possible correlatedoutcomes However the biggest obstacle for generatinghigh-quality clinically useful diagnostic reviews is thepoor methodological quality of the existing body ofdiagnostic accuracy studies reported in the literatureFortunately interest in the methods for the evaluationof diagnostic tests has grown considerably in the lastdecade Higher-quality and more informative primarystudies will in return generate more informative diag-nostic reviews

Appendix 1

Accuracy of diagnostic decision rules without and withD-dimer assay for the diagnosis of pulmonary embo-lism Pulmonary embolism (PE) is an important condi-tion for physicians to consider because case fatality ishigh if left untreated However diagnosing PE in sus-pected patients is challenging because signs and symp-toms are often nonspecific Physicians constantly facethe dilemma of not wanting to miss a PE while at thesame time wanting to avoid performing too many un-necessary additional diagnostic procedures that can beexpensive burdensome and possibly harmful Diag-nostic strategies in suspected patients therefore focuson identifying patients in whom PE can be safely ruledout on the basis of findings from the patient history andphysical examination Many different diagnostic deci-sion rules for excluding PE on the basis of symptomsand signs with or without D-dimer assay have beendeveloped and validated but there remains uncertaintyas to whether these different rules differ in their accu-racy in a meaningful way In this example we focus on 3rules

bull Wells rule using a cutoff value of 2 for defining apositive (abnormal) test result

bull Wells rule using a cutoff value of 4bull Gestalt rule

In the Wells rules points are scored when certainsigns and symptoms (eg heart rate 100 previousdeep venous thrombosis) are present resulting in a to-tal score In the Gestalt rule physicians provide anoverall empirical assessment of the likelihood of pul-monary embolism being present after examination of apatient To safely exclude pulmonary embolism aD-dimer test can be added to the clinical rule to refrainfrom further testing if both tests (rule D-dimer assay)are negative Further details and more rules can befound in the original review (11 )

Review

1544 Clinical Chemistry 5811 (2012)

REVIEW AIMS

To determine and compare the diagnostic accuracy of 3different clinical decision rules Wells-2 (n 19 stud-ies) Wells-4 (n 11 studies) and Gestalt rule (n 15studies)

To examine whether a negative test from a rule incombination with a negative D-dimer test result is a safeand efficient strategy for excluding PE without referralfor further burdening and invasive imaging

Author Contributions All authors confirmed they have contributed tothe intellectual content of this paper and have met the following 3 re-quirements (a) significant contributions to the conception and design

acquisition of data or analysis and interpretation of data (b) draftingor revising the article for intellectual content and (c) final approval ofthe published article

Authorsrsquo Disclosures or Potential Conflicts of Interest Upon man-uscript submission all authors completed the author disclosure formDisclosures andor potential conflicts of interest

Employment or Leadership None declaredConsultant or Advisory Role None declaredStock Ownership None declaredHonoraria None declaredResearch Funding The Netherlands Organisation for Health Re-search and Development (ZonMW) KGM Moons the Nether-lands Organisation for Scientific Research (projects 91208004 and91810615)Expert Testimony None declared

References

1 Lijmer JG Leeflang M Bossuyt PM Proposals fora phased evaluation of medical tests Med DecisMaking 200929E13ndash21

2 Linnet K Bossuyt PM Moons KG Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

3 Moons KG de Groot JA Linnet K Reitsma JBBossuyt PM Quantifying the added value of adiagnostic test or marker Clin Chem 2012581408ndash17

4 Bossuyt PM Reitsma JB Linnet K Moons KGBeyond diagnostic accuracy the clinical utility ofdiagnostic tests Clin Chem [Epub ahead of print2012 Jun 22]

5 Reitsma JB Glas AS Rutjes AW Scholten RJBossuyt PM Zwinderman AH Bivariate analysisof sensitivity and specificity produces informativesummary measures in diagnostic reviews J ClinEpidemiol 200558982ndash90

6 Whiting P Rutjes AW Reitsma JB Glas ASBossuyt PM Kleijnen J Sources of variation andbias in studies of diagnostic accuracy a system-atic review Ann Intern Med 2004140189ndash202

7 Leeflang MM Deeks JJ Gatsonis C Bossuyt PMon behalf of the Cochrane Diagnostic Test Accu-racy Working Group Systematic reviews of diag-nostic test accuracy Ann Intern Med 2008149889ndash97

8 Whiting PF Rutjes AW Westwood ME Mallett SDeeks JJ Reitsma JB et al QUADAS-2 a revisedtool for the quality assessment of diagnostic ac-curacy studies Ann Intern Med 2011155529ndash36

9 Diagnostic test accuracy working group httpsrdtacochraneorg (Accessed August 2012)

10 Leeflang MM Debets-Ossenkopp YJ Visser CEScholten RJ Hooft L Bijlmer HA et al Galacto-mannan detection for invasive aspergillosis in

immunocompromized patients Cochrane Data-base Syst Rev 2008CD007394

11 Lucassen W Geersing GJ Erkens PM Reitsma JBMoons KG Buller H van Weert HC Clinical de-cision rules for excluding pulmonary embolism ameta-analysis Ann Intern Med 2011155448ndash60

12 Bossuyt PM Irwig L Craig J Glasziou P Com-parative accuracy assessing new tests againstexisting diagnostic pathways BMJ 20063321089ndash92

13 Doust JA Pietrzak E Sanders S Glasziou PPIdentifying studies for systematic reviews of di-agnostic tests was difficult due to the poor sen-sitivity and precision of methodologic filters andthe lack of information in the abstract J ClinEpidemiol 200558444ndash9

14 Leeflang MM Scholten RJ Rutjes AW ReitsmaJB Bossuyt PM Use of methodological searchfilters to identify diagnostic accuracy studies canlead to the omission of relevant studies J ClinEpidemiol 200659234ndash40

15 Savoie I Helmer D Green CJ Kazanjian A Be-yond Medline reducing bias through extendedsystematic review search Int J Technol AssessHealth Care 200319168ndash78

16 Fraser C Mowatt G Siddiqui R Burr J Searchingfor diagnostic test accuracy studies an applica-tion to screening for open angle glaucoma (OAG)[Abstract] Cochrane Colloquium AbstractsJournal 2006 httpwwwimbiuni-freiburgdeOJSccaindexphpjournalccaamppagearticleampopviewamppath[]1980 (Accessed October2012)

17 Whiting P Westwood M Burke M Sterne JGlanville J Systematic reviews of test accuracyshould search a range of databases to identifyprimary studies J Clin Epidemiol 200861

357ndash 6418 Song F Eastwood AJ Gilbody S Duley L Sutton

AJ Publication and related biases Health TechnolAssess 200041ndash115

19 Whiting P Rutjes AW Reitsma JB Bossuyt PMKleijnen J The development of QUADAS a toolfor the quality assessment of studies of diagnos-tic accuracy included in systematic reviews BMCMed Res Methodol 2003325

20 Linnet K Bossuyt PMM Moons KGM Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

21 Moses LE Shapiro D Littenberg B Combiningindependent studies of a diagnostic test into asummary ROC curve data-analytic approachesand some additional considerations Stat Med1993121293ndash316

22 Rutter CM Gatsonis CA A hierarchical regressionapproach to meta-analysis of diagnostic test ac-curacy evaluations Stat Med 2001202865ndash84

23 Harbord RM Deeks JJ Egger M Whiting PSterne JA A unification of models for meta-analysis of diagnostic accuracy studies Biostatis-tics 20078239ndash51

24 Leeflang MM Bossuyt PM Irwig L Diagnostictest accuracy may vary with prevalence implica-tions for evidence-based diagnosis J Clin Epide-miol 2009625ndash12

25 Li T Puhan MA Vedula SS Singh S Dickersin Kthe Ad Hoc Network Meta-analysis MethodsMeeting Working Group Network meta-analysis-highly attractive but more methodological re-search is needed BMC Med 2011 27979

26 Khan KS Bachmann LM ter Riet G Systematicreviews with individual patient data meta-analysis to evaluate diagnostic tests Eur J ObstetGynecol Reprod Biol 2003108121ndash5

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1545

practice The efficiencies of such strategies arearound 40 meaning that in 40 of the patients nofurther testing is required

Similar to any other review there is the threat ofpublication bias in DTA reviews (18 ) Publication biasoccurs when studies containing less favorable resultsare less likely to be published Summary results basedon published findings will then generate an overopti-mistic picture of the accuracy of a test Unfortunatelylittle information exists regarding the presence andmagnitude of publication bias in diagnostic accuracystudies Unlike randomized trials there are no registriesfor protocols of diagnostic accuracy studies

Recent Developments

In this section we highlight some recent developmentsthat are relevant for diagnostic accuracy reviews of bio-chemical tests and markers

NETWORK METAANALYSIS

In many diagnostic scenarios there are several alterna-tive tests available which leads to the key questionwhich test is the best Direct comparisons of tests(head-to-head comparison in the same patients by useof a cross-over design or a parallel randomized design)offer the most valid study design but are not alwaysavailable in the literature Systematic reviews focusingon more than one diagnostic test have to incorporateindirect comparisons (accuracy of different tests as-sessed in different populations) Network metaanaly-ses have been developed in the field of intervention tocombine both direct and indirect comparisons within asingle statistical model to allow for ranking of the avail-able treatments (25 ) In addition these models provideestimates of heterogeneity and inconsistency of effectsSuch network metaanalyses would be a welcome addi-tion for ranking and selecting the best test among sev-eral alternatives

IPD METAANALYSIS

IPD metaanalyses use individual patient data ratherthan published summary results of a study In an IPDmetaanalysis there is more flexibility and more statisti-cal power to examine how patientsrsquo characteristics af-fect diagnostic test accuracy (subgroup analyses or ef-fect modification) IPD metaanalysis also offers moreflexibility in handling differences in thresholds for pos-itivity for continuous index test results and for deter-mining the optimal cutoff value (26 )

Concluding Remarks

Many improvements have been made in the method-ology of performing systematic reviews of the accuracy

of diagnostic tests and multivariable diagnostic mod-els Methods have been improved for locating diagnos-tic accuracy studies for assessing the risk for bias andsources of variation and for developing advanced andflexible models to metaanalyze 2 possible correlatedoutcomes However the biggest obstacle for generatinghigh-quality clinically useful diagnostic reviews is thepoor methodological quality of the existing body ofdiagnostic accuracy studies reported in the literatureFortunately interest in the methods for the evaluationof diagnostic tests has grown considerably in the lastdecade Higher-quality and more informative primarystudies will in return generate more informative diag-nostic reviews

Appendix 1

Accuracy of diagnostic decision rules without and withD-dimer assay for the diagnosis of pulmonary embo-lism Pulmonary embolism (PE) is an important condi-tion for physicians to consider because case fatality ishigh if left untreated However diagnosing PE in sus-pected patients is challenging because signs and symp-toms are often nonspecific Physicians constantly facethe dilemma of not wanting to miss a PE while at thesame time wanting to avoid performing too many un-necessary additional diagnostic procedures that can beexpensive burdensome and possibly harmful Diag-nostic strategies in suspected patients therefore focuson identifying patients in whom PE can be safely ruledout on the basis of findings from the patient history andphysical examination Many different diagnostic deci-sion rules for excluding PE on the basis of symptomsand signs with or without D-dimer assay have beendeveloped and validated but there remains uncertaintyas to whether these different rules differ in their accu-racy in a meaningful way In this example we focus on 3rules

bull Wells rule using a cutoff value of 2 for defining apositive (abnormal) test result

bull Wells rule using a cutoff value of 4bull Gestalt rule

In the Wells rules points are scored when certainsigns and symptoms (eg heart rate 100 previousdeep venous thrombosis) are present resulting in a to-tal score In the Gestalt rule physicians provide anoverall empirical assessment of the likelihood of pul-monary embolism being present after examination of apatient To safely exclude pulmonary embolism aD-dimer test can be added to the clinical rule to refrainfrom further testing if both tests (rule D-dimer assay)are negative Further details and more rules can befound in the original review (11 )

Review

1544 Clinical Chemistry 5811 (2012)

REVIEW AIMS

To determine and compare the diagnostic accuracy of 3different clinical decision rules Wells-2 (n 19 stud-ies) Wells-4 (n 11 studies) and Gestalt rule (n 15studies)

To examine whether a negative test from a rule incombination with a negative D-dimer test result is a safeand efficient strategy for excluding PE without referralfor further burdening and invasive imaging

Author Contributions All authors confirmed they have contributed tothe intellectual content of this paper and have met the following 3 re-quirements (a) significant contributions to the conception and design

acquisition of data or analysis and interpretation of data (b) draftingor revising the article for intellectual content and (c) final approval ofthe published article

Authorsrsquo Disclosures or Potential Conflicts of Interest Upon man-uscript submission all authors completed the author disclosure formDisclosures andor potential conflicts of interest

Employment or Leadership None declaredConsultant or Advisory Role None declaredStock Ownership None declaredHonoraria None declaredResearch Funding The Netherlands Organisation for Health Re-search and Development (ZonMW) KGM Moons the Nether-lands Organisation for Scientific Research (projects 91208004 and91810615)Expert Testimony None declared

References

1 Lijmer JG Leeflang M Bossuyt PM Proposals fora phased evaluation of medical tests Med DecisMaking 200929E13ndash21

2 Linnet K Bossuyt PM Moons KG Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

3 Moons KG de Groot JA Linnet K Reitsma JBBossuyt PM Quantifying the added value of adiagnostic test or marker Clin Chem 2012581408ndash17

4 Bossuyt PM Reitsma JB Linnet K Moons KGBeyond diagnostic accuracy the clinical utility ofdiagnostic tests Clin Chem [Epub ahead of print2012 Jun 22]

5 Reitsma JB Glas AS Rutjes AW Scholten RJBossuyt PM Zwinderman AH Bivariate analysisof sensitivity and specificity produces informativesummary measures in diagnostic reviews J ClinEpidemiol 200558982ndash90

6 Whiting P Rutjes AW Reitsma JB Glas ASBossuyt PM Kleijnen J Sources of variation andbias in studies of diagnostic accuracy a system-atic review Ann Intern Med 2004140189ndash202

7 Leeflang MM Deeks JJ Gatsonis C Bossuyt PMon behalf of the Cochrane Diagnostic Test Accu-racy Working Group Systematic reviews of diag-nostic test accuracy Ann Intern Med 2008149889ndash97

8 Whiting PF Rutjes AW Westwood ME Mallett SDeeks JJ Reitsma JB et al QUADAS-2 a revisedtool for the quality assessment of diagnostic ac-curacy studies Ann Intern Med 2011155529ndash36

9 Diagnostic test accuracy working group httpsrdtacochraneorg (Accessed August 2012)

10 Leeflang MM Debets-Ossenkopp YJ Visser CEScholten RJ Hooft L Bijlmer HA et al Galacto-mannan detection for invasive aspergillosis in

immunocompromized patients Cochrane Data-base Syst Rev 2008CD007394

11 Lucassen W Geersing GJ Erkens PM Reitsma JBMoons KG Buller H van Weert HC Clinical de-cision rules for excluding pulmonary embolism ameta-analysis Ann Intern Med 2011155448ndash60

12 Bossuyt PM Irwig L Craig J Glasziou P Com-parative accuracy assessing new tests againstexisting diagnostic pathways BMJ 20063321089ndash92

13 Doust JA Pietrzak E Sanders S Glasziou PPIdentifying studies for systematic reviews of di-agnostic tests was difficult due to the poor sen-sitivity and precision of methodologic filters andthe lack of information in the abstract J ClinEpidemiol 200558444ndash9

14 Leeflang MM Scholten RJ Rutjes AW ReitsmaJB Bossuyt PM Use of methodological searchfilters to identify diagnostic accuracy studies canlead to the omission of relevant studies J ClinEpidemiol 200659234ndash40

15 Savoie I Helmer D Green CJ Kazanjian A Be-yond Medline reducing bias through extendedsystematic review search Int J Technol AssessHealth Care 200319168ndash78

16 Fraser C Mowatt G Siddiqui R Burr J Searchingfor diagnostic test accuracy studies an applica-tion to screening for open angle glaucoma (OAG)[Abstract] Cochrane Colloquium AbstractsJournal 2006 httpwwwimbiuni-freiburgdeOJSccaindexphpjournalccaamppagearticleampopviewamppath[]1980 (Accessed October2012)

17 Whiting P Westwood M Burke M Sterne JGlanville J Systematic reviews of test accuracyshould search a range of databases to identifyprimary studies J Clin Epidemiol 200861

357ndash 6418 Song F Eastwood AJ Gilbody S Duley L Sutton

AJ Publication and related biases Health TechnolAssess 200041ndash115

19 Whiting P Rutjes AW Reitsma JB Bossuyt PMKleijnen J The development of QUADAS a toolfor the quality assessment of studies of diagnos-tic accuracy included in systematic reviews BMCMed Res Methodol 2003325

20 Linnet K Bossuyt PMM Moons KGM Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

21 Moses LE Shapiro D Littenberg B Combiningindependent studies of a diagnostic test into asummary ROC curve data-analytic approachesand some additional considerations Stat Med1993121293ndash316

22 Rutter CM Gatsonis CA A hierarchical regressionapproach to meta-analysis of diagnostic test ac-curacy evaluations Stat Med 2001202865ndash84

23 Harbord RM Deeks JJ Egger M Whiting PSterne JA A unification of models for meta-analysis of diagnostic accuracy studies Biostatis-tics 20078239ndash51

24 Leeflang MM Bossuyt PM Irwig L Diagnostictest accuracy may vary with prevalence implica-tions for evidence-based diagnosis J Clin Epide-miol 2009625ndash12

25 Li T Puhan MA Vedula SS Singh S Dickersin Kthe Ad Hoc Network Meta-analysis MethodsMeeting Working Group Network meta-analysis-highly attractive but more methodological re-search is needed BMC Med 2011 27979

26 Khan KS Bachmann LM ter Riet G Systematicreviews with individual patient data meta-analysis to evaluate diagnostic tests Eur J ObstetGynecol Reprod Biol 2003108121ndash5

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1545

REVIEW AIMS

To determine and compare the diagnostic accuracy of 3different clinical decision rules Wells-2 (n 19 stud-ies) Wells-4 (n 11 studies) and Gestalt rule (n 15studies)

To examine whether a negative test from a rule incombination with a negative D-dimer test result is a safeand efficient strategy for excluding PE without referralfor further burdening and invasive imaging

Author Contributions All authors confirmed they have contributed tothe intellectual content of this paper and have met the following 3 re-quirements (a) significant contributions to the conception and design

acquisition of data or analysis and interpretation of data (b) draftingor revising the article for intellectual content and (c) final approval ofthe published article

Authorsrsquo Disclosures or Potential Conflicts of Interest Upon man-uscript submission all authors completed the author disclosure formDisclosures andor potential conflicts of interest

Employment or Leadership None declaredConsultant or Advisory Role None declaredStock Ownership None declaredHonoraria None declaredResearch Funding The Netherlands Organisation for Health Re-search and Development (ZonMW) KGM Moons the Nether-lands Organisation for Scientific Research (projects 91208004 and91810615)Expert Testimony None declared

References

1 Lijmer JG Leeflang M Bossuyt PM Proposals fora phased evaluation of medical tests Med DecisMaking 200929E13ndash21

2 Linnet K Bossuyt PM Moons KG Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

3 Moons KG de Groot JA Linnet K Reitsma JBBossuyt PM Quantifying the added value of adiagnostic test or marker Clin Chem 2012581408ndash17

4 Bossuyt PM Reitsma JB Linnet K Moons KGBeyond diagnostic accuracy the clinical utility ofdiagnostic tests Clin Chem [Epub ahead of print2012 Jun 22]

5 Reitsma JB Glas AS Rutjes AW Scholten RJBossuyt PM Zwinderman AH Bivariate analysisof sensitivity and specificity produces informativesummary measures in diagnostic reviews J ClinEpidemiol 200558982ndash90

6 Whiting P Rutjes AW Reitsma JB Glas ASBossuyt PM Kleijnen J Sources of variation andbias in studies of diagnostic accuracy a system-atic review Ann Intern Med 2004140189ndash202

7 Leeflang MM Deeks JJ Gatsonis C Bossuyt PMon behalf of the Cochrane Diagnostic Test Accu-racy Working Group Systematic reviews of diag-nostic test accuracy Ann Intern Med 2008149889ndash97

8 Whiting PF Rutjes AW Westwood ME Mallett SDeeks JJ Reitsma JB et al QUADAS-2 a revisedtool for the quality assessment of diagnostic ac-curacy studies Ann Intern Med 2011155529ndash36

9 Diagnostic test accuracy working group httpsrdtacochraneorg (Accessed August 2012)

10 Leeflang MM Debets-Ossenkopp YJ Visser CEScholten RJ Hooft L Bijlmer HA et al Galacto-mannan detection for invasive aspergillosis in

immunocompromized patients Cochrane Data-base Syst Rev 2008CD007394

11 Lucassen W Geersing GJ Erkens PM Reitsma JBMoons KG Buller H van Weert HC Clinical de-cision rules for excluding pulmonary embolism ameta-analysis Ann Intern Med 2011155448ndash60

12 Bossuyt PM Irwig L Craig J Glasziou P Com-parative accuracy assessing new tests againstexisting diagnostic pathways BMJ 20063321089ndash92

13 Doust JA Pietrzak E Sanders S Glasziou PPIdentifying studies for systematic reviews of di-agnostic tests was difficult due to the poor sen-sitivity and precision of methodologic filters andthe lack of information in the abstract J ClinEpidemiol 200558444ndash9

14 Leeflang MM Scholten RJ Rutjes AW ReitsmaJB Bossuyt PM Use of methodological searchfilters to identify diagnostic accuracy studies canlead to the omission of relevant studies J ClinEpidemiol 200659234ndash40

15 Savoie I Helmer D Green CJ Kazanjian A Be-yond Medline reducing bias through extendedsystematic review search Int J Technol AssessHealth Care 200319168ndash78

16 Fraser C Mowatt G Siddiqui R Burr J Searchingfor diagnostic test accuracy studies an applica-tion to screening for open angle glaucoma (OAG)[Abstract] Cochrane Colloquium AbstractsJournal 2006 httpwwwimbiuni-freiburgdeOJSccaindexphpjournalccaamppagearticleampopviewamppath[]1980 (Accessed October2012)

17 Whiting P Westwood M Burke M Sterne JGlanville J Systematic reviews of test accuracyshould search a range of databases to identifyprimary studies J Clin Epidemiol 200861

357ndash 6418 Song F Eastwood AJ Gilbody S Duley L Sutton

AJ Publication and related biases Health TechnolAssess 200041ndash115

19 Whiting P Rutjes AW Reitsma JB Bossuyt PMKleijnen J The development of QUADAS a toolfor the quality assessment of studies of diagnos-tic accuracy included in systematic reviews BMCMed Res Methodol 2003325

20 Linnet K Bossuyt PMM Moons KGM Reitsma JBQuantifying the accuracy of a diagnostic test ormarker Clin Chem 2012581292ndash301

21 Moses LE Shapiro D Littenberg B Combiningindependent studies of a diagnostic test into asummary ROC curve data-analytic approachesand some additional considerations Stat Med1993121293ndash316

22 Rutter CM Gatsonis CA A hierarchical regressionapproach to meta-analysis of diagnostic test ac-curacy evaluations Stat Med 2001202865ndash84

23 Harbord RM Deeks JJ Egger M Whiting PSterne JA A unification of models for meta-analysis of diagnostic accuracy studies Biostatis-tics 20078239ndash51

24 Leeflang MM Bossuyt PM Irwig L Diagnostictest accuracy may vary with prevalence implica-tions for evidence-based diagnosis J Clin Epide-miol 2009625ndash12

25 Li T Puhan MA Vedula SS Singh S Dickersin Kthe Ad Hoc Network Meta-analysis MethodsMeeting Working Group Network meta-analysis-highly attractive but more methodological re-search is needed BMC Med 2011 27979

26 Khan KS Bachmann LM ter Riet G Systematicreviews with individual patient data meta-analysis to evaluate diagnostic tests Eur J ObstetGynecol Reprod Biol 2003108121ndash5

Systematic Reviews of Diagnostic Accuracy Studies Review

Clinical Chemistry 5811 (2012) 1545