benchmarking techniques to improve neonatal care: uses and abuses

Clin Perinatol 30 (2003) 343–350

Benchmarking techniques to improve neonatal

care: uses and abuses

Michele C. Walsh, MD, MSa,b,*aDepartment of Pediatrics, Case Western Reserve University, USA

bNeonatal Intensive Care Unit, Rainbow Babies & Children’s Hospital, University Hospitals of

Cleveland, 11000 Euclid Avenue, Mailstop 6010, Cleveland, OH 44106–6010, USA

In neonatology, as in health care in general, there is a desire to compare

outcomes across institutions. The goals of these comparisons are diverse and

range from a desire to reduce costs of care to improving patient outcomes. A new

goal of these comparisons is to reward institutions with practices that enhance

patient safety, such as the efforts launched by the Leapfrog Foundation and

recently by the Center for Medicare and Medicaid Services. The pioneering work

by Wennberg et al [1–4] was among the first to highlight the differences in

practice among institutions and regions of the country. Through careful dissection

of large patient care claims data sets, he and his colleagues were able to

demonstrate that variations in the frequency of operations, such as cardiac

catheterization, coronary artery bypass operations, and prostatectomy, could not

be explained by differences in patient characteristics. Instead the differences were

explained by differences in physicians’ preference for treatments. Some of these

practice variations are driven by uncertainty (ie, the lack of sound evidence for

any specific course of treatment). Benchmarking is an approach that capitalizes

on these variations in practice and associated outcome differences in an attempt to

reduce the uncertainty about the effects of specific treatment strategies.

Horbar [5] has proposed a theoretical framework to explain variations in

practice (Fig. 1). In his model, as the strength of evidence increases, agreement

on standard practice increases and the expected variation in a practice should

decrease, and in fact any variation may be considered inappropriate. Conversely,

when the evidence is weak, variation is expected to increase. Horbar and the

Vermont-Oxford Network have shown, however, that differences exist even in

the use of practices that have been shown to be highly effective, such as the

0095-5108/03/$ – see front matter D 2003 Elsevier Inc. All rights reserved.

doi:10.1016/S0095-5108(03)00016-2

* Neonatal Intensive Care Unit, Rainbow Babies & Children’s Hospital, University Hospitals of

Cleveland, 11000 Euclid Avenue, Mailstop 6010, Cleveland, OH 44106–6010.

E-mail address: [email protected]

Fig. 1. Variations in practice can be explained by the interaction of the strength of the evidence

together with the degree of agreement. When the evidence is strong, variation in practice is

inappropriate. When evidence is weak, variation is expected and even desirable. (Modified from

concepts presented by JD Horbar, Variations in Neonatal Care, Hot Topics in Neonatology, 2001;

with permission.)

M.C. Walsh / Clin Perinatol 30 (2003) 343–350344

administration of antenatal corticosteroids and the early administration of

surfactant. As predicted by the model, other less well studied practices in

neonatology vary widely among institutions. Differences have been shown to

exist in strategies for managing chronic lung disease and persistent pulmonary

hypertension, and the use of inotropes, pain medication, analgesic agents, and

blood transfusions [6–9]. This inherent variation in practice constitutes a natural

albeit uncontrolled experiment and allows one to explore the impact of different

practices on outcomes. One tool that is useful in exploiting this variation is

termed ‘‘benchmarking.’’

Uses of benchmarking

Benchmarking, a technique developed in the business world in the 1980s, is

the process of comparing outcomes together with a detailed examination of the

processes responsible for the outcomes (Box 1). In benchmarking, one identifies

an institution as the leader in some outcome, studies in detail the processes of

their care, and then emulates their practices with the goal of improving one’s own

outcomes. For example, a business in the hospitality industry that wished to

improve the satisfaction of their customers might study the practices of industry

leaders, such as the Ritz Carlton Hotel or Disney World. The practices that

employees in those outstanding businesses used are called ‘‘best practices’’ and

are then adopted by businesses that wished to improve.

The terminology of best practices elicits a strong negative reaction in some

physicians. They argue that such terminology within health care implies that the

practices have been exhaustively researched and proved to be superior. They

argue further that the practice of benchmarking runs counter to the movement that

Box 1. Uses of benchmarking techniques

� Comparisons of care processes and outcomes among likepopulations using data collected with similar definitions

� Breaking down provincial tendencies of isolated health careto demonstrate that care is not always given in theaccustomed manner

� Integration of practice variation with evidence-based medicineto accelerate the process of change

� Restoring equipoise that allows a critical examination ofpractices, setting the stage for change

� Generating hypotheses that can be tested in clinical trials

M.C. Walsh / Clin Perinatol 30 (2003) 343–350 345

seeks sound evidence for medical practices. In fact, benchmarking is a tool that

complements evidence-based medicine. Evidence-based medical practices are

based on rigorous scientific standards. Unfortunately, most practices in neo-

natology have not been rigorously studied and evidence of effectiveness does not

exist. Indeed, every physician is faced daily with deciding between alternative

treatment strategies that are essentially unstudied. Hundreds of decisions (when

to begin feedings, how rapidly to advance feedings, goals for PaCO2 and O2

saturation, and so forth) must be made in the absence of high-quality evidence to

guide these decisions. Plesk et al [10] has argued that it might not be possible to

conduct randomized trials in some areas because of the nature of the intervention,

the large number of potential treatments and variations on treatments needing

study, or the rapid pace of technologic change. Benchmarking is one tool that

may be used to learn from the existing natural experiment that is produced by

variations in practice among institutions. To reduce the distrust that the use of the

term ‘‘best practices’’ evokes in many physicians, Plesk et al [10] has proposed

the somewhat awkward but more acceptable and accurate term ‘‘potentially better

practices.’’ Plesk et al [10] states that this terminology ‘‘suggests that we are

looking for improvement ideas that have both the logical appeal and the

experience in practice that suggest that they might be beneficial if implemented

in our institution.’’

Several limitations of benchmarking must be recognized. Benchmarking is

only appropriate in areas where definitive scientific evidence of the superiority of

a treatment does not exist. One must also recognize that one is only studying a

subset of all the possible potentially better practices that exist. There may be even

better practices out there that have yet to be discovered. As with any practice

change, the potential for unanticipated adverse consequences exists and must be

monitored carefully. Benchmarking is one component of a continuous quality

improvement cycle of research, diligent evaluation of one’s own practices and

outcomes, formal visits to better performing sites, implementation of change, and

measurement of expected improvement.


Examples of the use of benchmarking as an improvement tool in health care

are numerous. One of the first and best documented benchmarking examples is

that of the New England Cardiovascular Collaborative [11,12]. This collabora-

tive group included 23 cardiothoracic surgeons from seven centers in Maine,

New Hampshire, and Vermont. Comparison of risk-adjusted mortality rates, from

a registry of coronary artery bypass graft surgery, convinced participants that

differences in mortality rates across institutions could not be explained by

differences in severity of illness, but were caused by unknown, and probably

subtle, differences in aspects of patient care [11]. The group designed a three-

part intervention to reduce mortality associated with coronary artery bypass graft

surgery [12]. The intervention consisted of feedback of outcome data by center,

training in continuous quality techniques, and site visits to other medical centers.

The main outcome measure was a comparison of the observed and expected

mortality ratio in the postintervention period. Mortality at six of the seven

centers declined significantly in the postintervention period for an overall 24%

reduction in mortality. The only center that did not decline was the best

performing center before the intervention. Comparisons with historical control

are notoriously prone to biases toward improvement over time. Comparison of

the New England Cardiovascular Collaborative with the national Health Care

Financing Administration data, however, showed the decline in mortality was

significantly better than that experienced at other centers in the nation during the

same period. Other potential sources of bias include selection bias (centers that

are motivated to improve might implement other changes that improve outcome)

and the Hawthorne effect (outcomes tend to get better when they are observed

more carefully).

In neonatology, the Vermont Oxford Collaborative for Quality Improvement

has conducted two demonstration projects using benchmarking methodology.

One group of six neonatal intensive care units (NICUs) focused on reducing

nosocomial infection, whereas another group of four NICUs focused on reducing

the rate of bronchopulmonary dysplasia [13]. Preliminary analyses have demon-

strated significant improvement in both outcomes between 1994, the year before

the beginning of the project, and 1996, the year after implementation of the

‘‘potentially better practices’’ had begun. The overall rate of nosocomial infection

at the six NICUs in the infection subgroup declined from 26.3% in 1994 to

20.9% in 1996 (P = .007). The rate of supplemental oxygen administration at

36 weeks for infants weighing 501 to 1000 g decreased from 43.5% in 1994 to

31.5% in 1996 (P = .03) at the four NICUs in the bronchopulmonary dysplasia

subgroup. There was significant variation among NICUs with respect to whether

improvement occurred and the magnitude of improvement achieved in each

subgroup. The improvements observed at the NICUs in these subgroups were

significantly larger than were the changes observed at the 66 other North

American Vermont Oxford Network centers not participating in the NICU

project. Benchmarking is a relatively new approach in medicine and further

study with more rigorous designs is needed to demonstrate the effectiveness and

validity of this approach.


Abuses of benchmarking

Techniques of benchmarking can be abused if the basic principles on which

the method is based are violated. Benchmarking begins with an assumption that

patient populations are similar, and that the outcomes studied are clearly defined

and measured in the same way among institutions. Benchmarking proceeds with

a careful evaluation of care processes associated with desired outcomes. A

common but incorrect methodology that is labeled benchmarking seeks to use

the short cut of claims data, such as the Medicare database, or information drawn

from hospital charge data, to compare outcomes among institutions. Such efforts

are fraught with errors because inherently different patient populations are

compared and are not true benchmarking in that they look at outcomes only

and do not provide detailed data on the processes of this care.

A common failure in the use of claims based data is that such data are often

generated from the coding of medical records using ICD-9 codes and diagnosis-

related groups. These coding systems were designed for adult patients and do not

function well for pediatric patients in general and neonates in particular. Errors in

coding are common. Before any such data are used for comparison, neo-

natologists are wise to validate the data generated against other data sources,

such as an existing database or unit logbook to ensure that all patients are

captured and that values generated pass scrutiny at least at this level. For

example, if the ICD-9 or diagnosis-related group report for a large NICU

enumerates five patients per year with a birthweight less than 750 g with an

average length of stay of 15 days, neonatologists immediately recognize that

these numbers are erroneous. More subtle errors can only be detected by detailed

inspection of an internal database to confirm or refute the ICD-9 and diagnosis-

related group data reported. All such data must be verified before it is used as a

basis for comparison among institutions.

Neonatologists must be aware of the potential pitfalls that exist in comparing

such charge-based data. Current neonatal care arises from a regionalized system

of care with planned differences in the level of severity of illness in the patients

cared for in different level units. If the impact of the regionalized system is not

recognized, erroneous conclusions about care are particularly likely to occur. For

example, if two units are compared for the outcome length of stay and unit A is

found to have an average length of stay of 8.5 days, and unit B is found to have

an average length of stay of 12.5 days, an investigator who is unfamiliar with

clinical aspects of neonatology might erroneously conclude that unit A is more

efficient than unit B. Numerous other potential explanations exist, however, for

the difference in length of stay including the following: unit A is a level 2 unit

and unit B is a level 3 unit caring for more complex and ill patients; unit A

transfers all complex patients to another institution, whereas unit B accepts

patients from other units; unit B has equally complex patients as unit A but

transfers stabilized patients back to their original hospitals to complete their

convalescence. Neonatologists must be aware of these limitations and ensure that

all comparative reports are adjusted for severity of illness.


Protecting participants in quality improvement initiatives

In benchmarking and other quality improvement projects, one principle must

be kept in focus: patient protection. Any quality improvement project that

changes practice has the potential for unanticipated adverse consequences just

as that potential exists in randomized clinical trials. Participants in randomized

trials are protected from such consequences by the vigilance of the investigators,

oversight by institutional review boards, and data safety monitoring committees.

Strict regulations have evolved to ensure that subjects in trials are informed of the

risks and consent to participate. Patients who are affected by quality improvement

projects deserve the same protections. Casarett and Sugarman [14] have recently

proposed standards for determining when quality improvement projects should be

considered research. They propose that quality improvement projects be reviewed

and regulated as research if most patients involved are not expected to benefit

directly from the knowledge gained or if additional risks or burdens are imposed

to make the results generalizable. Whether the first of these criteria should require

institutional review as research is controversial. This standard should certainly be

applied in any quality improvement project that meets the second criterion.

Applying benchmarking to improve quality in practice

Benchmarking to improve the outcomes of care can be done at any institution.

But doing benchmarking well can be expensive and time consuming. The

Box 2. Steps in a successful benchmarking initiative

1. Compare outcomes to those of similar institutions.2. Select an outcome to improve.3. Scrutinize practices to become familiar with the details

of care.4. Identify comparator institutions and conduct interviews or

site visits to identify potentially better practices.5. Involve all disciplines of the care team in the project.6. Using a physician champion, build consensus for the prac-

tice change.7. Implement the change.8. Measure the impact of the change on both the process and

the outcomes frequently: weekly or monthly. Disseminatethe results to all members of the care team.

9. Re-evaluate the project within 6 weeks of implementation toassess successes and barriers.

10. Redesign areas in need of additional improvement and begina new cycle of change.


necessary steps are outlined in Box 2. The payoffs for this investment are

improved quality of care. Better quality of care often translates directly into more

efficient care with shorter lengths of stay and a more favorable impact on the

hospital’s financial well being. Health administrators are well aware of this link

between quality and efficiency and often financially support teams that are

motivated to undertake these efforts. Rogowski et al [15] analyzed the economic

implications of collaborative quality improvement and reported improved patient

outcomes and substantial cost savings.

The necessary ingredients for a successful benchmarking initiative include

health team members dedicated to improving care, access to comparative data on

outcomes, and clinicians with a willingness to change current practices. These

characteristics represent the finest traditions in clinical medicine. Most physicians

intuitively recognize benchmarking processes as being similar to the opinions

they collect from respected colleagues when confronted with a challenging case.

True benchmarking expands and formalizes this process and provides access to

data that inform decisions. Some physicians are wary of participating in

benchmarking or other quality improvement processes for fear that they will be

asked to relinquish autonomy or practice ‘‘cookbook’’ medicine. Nothing could

be farther from the truth. Physicians are encouraged to continue to individualize

the treatment of their patients, while having a better understanding of the complex

interplay of processes, personnel, and organizations that impact that care.

Neonatologists are fortunate to have available an organized forum to shepherd

benchmarking and quality improvement efforts. The Vermont-Oxford Network

for Evidence-Based Quality Improvement Collaborative for Neonatology estab-

lished in 1998 provides high-quality support for NICUs beginning their collab-

orative improvement efforts. For a nominal fee, member institutions participate in

2-year programs that facilitate improvement. Quarterly and annual reports allow

institutions to continue their improvement initiatives and monitor their progress

over time. The effectiveness of these processes has not been tested rigorously in

randomized trials, but has been shown to be effective in numerous reported

projects. As evidence of their effectiveness, many health care insurers specifically

reward Vermont-Oxford Network institutions with higher scores on report cards.

Improving the quality of neonatal care and the outcomes of tiny patients are

goals that all neonatologists support. Benchmarking adds a tool to the armamen-

tarium to produce these changes and complements improvements driven by

evidence-based medicine.

References

[1] Wennberg J, Roos N, Sola L, et al. Use of claims data systems to evaluate health outcomes:

mortality and reoperation following prostatectomy. JAMA 1987;257:933–6.

[2] Wennberg J, Malley A, Hanley D, et al. An assessment of prostatectomy for benign urinary tract

obstruction. JAMA 1988;259:3027–30.

[3] Winslow CM, Kosecoff JB, Chassin MR, Knouse D, Brook R. The appropriateness of carotid

endartarectomy. N Engl J Med 1988;18:722–7.


[4] Winslow C, Chassin M, Kanouse DE, Brook RH, et al. The appropriateness of performing

carotid artery bypass surgery. JAMA 1989;260:505–9.

[5] Horbar JD. Variations in neonatal intensive care. In: Lucey J, editor. Hot topics in neonatology.

Washington: 2001.

[6] Horbar JD. Hospital and patient characteristics associated with variation in 28 day mortality rates

for VLBW infants. Pediatrics 1997;99:149–56.

[7] Kahn DJ, Richardson DK, Gray JE, et al. Variation in neonatal intensive care units in narcotic

administration. Arch Pediatr Adolesc Med 1998;152:844–51.

[8] Ringer SA, Richardson DK, Sacher RA, Keszler M, Churchill WH. Variations in transfusion

practice in neonatal intensive care. Pediatrics 1998;101:194–200.

[9] Walsh-Sukys MC, Tyson J, Wright LL, Bauer CR, Korones SB, Stevenson DK, et al. Persistent

pulmonary hypertension of the newborn in the era before nitric oxide: practice variation and

outcomes. Pediatrics 2000;105:14–20.

[10] Plesk PE, Al-Aweel IC, Gray JE, Richardson DK. Characterizing practice style in neonatal

intensive care. Pediatr Res 1998;43:226A. Quality improvement methods in clinical medicine.

Pediatrics 1999;103:203–14.

[11] O’Connor GT, Plume SK, Olmstead EM, et al. A regional prospective study of in-hospital

mortality associated with coronary artery bypass grafting. JAMA 1991;266:803–9.

[12] O’Connor GT, Plume SK, Olmstead EM, et al. A regional intervention to improve the hospital

mortality associated with coronary artery bypass graft surgery. JAMA 1996;275:841–6.

[13] Horbar JD, Rogowski JR, Plesk PE, et al. Collaborative quality improvement for neonatal

intensive care. Pediatrics 2001;107:14–22.

[14] Casarett D, Karlawish JH, Sugarman J. Determining when quality improvement initiatives

should be considered research: proposed criteria and implications. JAMA 2000;283:2275–80.

[15] Rogowski JA, Horbar JD, Plesk PE, et al. Economic implications of neonatal intensive care unit

collaborative quality improvement. Pediatrics 2001;107:23–9.

benchmarking techniques to improve neonatal care: uses and abuses

Documents