a comparison of standardized and narrative letters of recommendation

ACADEMIC EMERGENCY MEDICINE November 1998. Volume 5. Number 11 1101

A Comparison of Standardized and Narrative Letters of Recommendation

DANIEL V. GIRZADAS JR., MD, ROBERT C . HARWOOD, MD, MPH, JOSEPH DEARIE, MD, SHAYLA GARRETT, MD

Abstract. Objective: To compare the Council of Emergency Medicine Residency Directors’ (CORDS) standardized letters of recommendation (SLORs) with traditional narrative letters of recommendation (NLORs) with regard to interrater reliability, consistency, and time of interpretation. Methods: In part I of the study, four members of the residency selection committee each evaluated the same 20 SLORs and 20 NLORs from which all identifying characteristics had been deleted. Using Likert-type scales of the global assessment, each letter was assigned a numeric value from 1 to 7. The interrater reliability was calculated for both types of letters using the Kendall coefficient of concordance. Average time to interpretation of the letters was also determined. In part 11, using the same numeric values as in part I, 207 single-author SLOFUNLOR pairs were evaluated to determine whether the global assessment of the SLOR was consistent with that of its partner NLOR. Inter- pretation of the NLOR was performed blinded to the

SLOR. Statistical analysis was calculated using Spearman correlation coefficients. Results: In part I of the study, the interrater reliability of the SLOR was 0.97, as compared with 0.78 for the NLOR. The average time to interpret the global assessment of the SLOR was 16 seconds, vs 90 seconds for the NLOR. In part I1 of the study, of the 207 SLOR/NLOR pairs, 112 (54%) were assigned the same numeric value, 80 (39%) differed by one, 13 (6%) differed by two, and two (1%) differed by three, for an overall correlation of 0.58. Conclusions: Compared with NLORs, the CORD SLOR offers better interrater reliability with less interpretation time. Single-author SLORMLOR pairs submitted for a single applicant do not correlate well. Residency selection committees must decide whether the added work of interpreting NLORs is beneficial. Key words: letter of recommendation; postgraduate education; emergency medicine; residency; selection. ACADEMIC EMERGENCY MEDI- CINE 1998; 5:1101-1104

RADITIONAL narrative letters of recommen- T dation (NLORs) are a factor of the resident selection process considered to be more influen- tial than U.S. Medical Licensing Examination (USMLE) scores.’ Along with transcripts and the dean’s letter, they are a n important pre-interview source of information about a n applicant’s interpersonal and clinical skills.2 Accurate interpretation of NLORs requires time and a significant amount of experience, and even experienced interpreters find the task d i f f i ~ u l t . ~ Frequently, important information is missing or worded in a manner that is subject to a range of i n t e r p r e t a t i ~ n . ~

With the aim of making data extraction more precise and efficient, the Council of Emergency Medicine Residency Directors (CORD) has developed a standardized letter of recommendation (SLOR). A SLOR would be expected to require less time and experience to interpret than a NLOR. It

From the Department of Emergency Medicine, Christ Hospital and Medical Center, Oak Lawn, IL (DVG, RCH, JD, SG). Received December 26, 1997; revision received June 11, 1998; accepted June 25, 1998. Address for correspondence and reprints: Daniel V. Girzadas Jr., MD. Department of Emergency Medicine, Christ Hospital and Medical Center, 4440 West 95th Street, Oak Lawn, IL 60453.

would ensure tha t information considered important to residency selection committees was not omitted. The experience of the previous application cycle seems to bear this out. A separate problem has developed, however. Frequently an author of a letter of recommendation (LOR) for a single applicant submits both a SLOR and a NLOR. Both letters are usually interpreted because one cannot be certain the same information is conveyed in both formats. This increases the workload of interpreting recommendations. If it can be demonstrated tha t the two types of recommendations convey equivalent information, the more time-consuming NLOR would be unnecessary. This would decrease the workload of resident selection. The first objective of our study was to determine whether the SLOR conveys information equivalent to tha t of the NLOR. We also measured the interrater reliability of both the SLOR and the NLOR. Finally, we determined the time required to make a global assessment of both types of letters.

METHODS

Study Design. This was a retrospective review of LORs received as par t of the standard application

1102 RECOMMENDATION LETTERS Cirzadas et al. STANDARDIZED AND NARRATIVE LETTERS

TABLE 1. Narrative Letter of Recommendation (NLOR) recommendations ranging from poor to outstand- ing. We believed a random selection would have provided mostly letters in the 5-6 range, since

Classification System

Score Classification these were most common.) All identifying characteristics were deleted from each letter. The NLORs were not Paired with the SLORs; the raters were given a set of 20 NLORs and a different set of 20

Includes glowing statements such as “is one of the finest medical students of the year,” “is one of the best medical students I have ever worked with,” “richly deserves the honors awarded in the rotation,” or “receives my highest rec- - - ommendation.”

May include some honors grades, top 15-20%, near honors. “Functions as an intern.”

Contains the obligatory “good fund of knowledge,” “punc- tual,” “hardworking,” “progressed well,” “should be a n excellent candidate for postgraduate training,” along with some superlatives.

Contains mildly complimentary but noncommittal language. Pleasantly describes a n average student and tries to put a good spin on the description.

May be completely neutral as if the writer has never met the student, or have some subtle descriptions of the student’s averageness or contains slightly negative comments.

Contains troublesome or negative comments with little or no balancing superlatives. Almost guarantees “no interview.”

I s hard to come by as most students do not ask someone who dislikes them or who has been disappointed in their performance to write them a letter of recommendation. All by itself guarantees “no interview.”

process between September and December 1996. A LOR could be submitted by a physician from any specialty. Letters reviewed included applicants who were rejected, interviewed, or ranked. Be- cause of the retrospective nature of this project, it was considered exempt from institutional review board review.

Studg Protocol. In part I of our study, we estab- lished seven-point Likert-type scales for the NLOR and the SLOR.5 For the NLOR, statements were classified and were assigned a numeric value according to an unpublished classification system developed by one of the investigators (RCH), and used in the residency selection process of our department (Table 1). The SLOR was also assigned a numeric value of 1-7 (Table 2). If there were inconsistencies, the letter was assigned a numeric value according to the most positive phrase.

To establish our seven-point numeric system as stable or constant, we determined its interrater reliability. Four raters evaluated the same 20 NLORs and 20 SLORs. Two raters were very experienced and two raters were inexperienced evaluators of LORs. The letters were selected nonrandomly to encompass global assessments ranging from most positive to negative. (In part I, we chose letters that would provide a spectrum of

- SLORs. The raters were asked to assess one entire set prior to assessing the remaining set. They were asked to rank each letter according to the estab- lished seven-point Likert-type scale.

In part I1 of our study, we examined 207 SLOW NLOR pairs. Virtually all paired letters that were submitted to our residency program in this application cycle were included. Each pair was written by a single author for a single applicant. The author could be from any specialty. Each NLOW SLOR pair was interpreted by one of the same four raters as in part I using the same two ranking sys- tems described above. Blinded to the correspond- ing SLOR, each NLOR was interpreted first and assigned a numeric value. Immediately after, each SLOR was interpreted and assigned a numeric value.

Data Analysis. Interrater reliability among the four raters for both the NLOR and the SLOR was calculated using the Kendall coefficient of concordance. Time of interpretation was determined by timing one experienced rater and one junior rater for a total of 80 letters.

The numeric assignment of the SLOFUNLOR pair was correlated using the Spearman rank-or- der correlation coefficient.

RESULTS

In part I of our study, we determined an interrater reliability of the SLOR of 0.97. The interrater reliability of the NLOR was 0.78. The average time required to interpret a SLOR was 16 seconds, compared with 90 seconds for the NLOR. (This average time represents the sum of the time it took for a n experienced rater and an inexperienced rater to interpret each packet of 20 letters, divided by 40 total evaluations. We did not measure the time i t took to interpret each letter).

In part I1 of our study, 112 (54%) of the 207 SLOFUNLOR pairs were assigned the same numeric value. Eighty pairs (39%) differed by one point on the scale, 13 (6%) differed by two, and two (1%) differed by three. The overall correlation was 0.58.

DISCUSSION

Accurate interpretation of LORs is essential, since decisions based on these letters can profoundly af-

ACADEMIC EMERGENCY MEDICINE November 1998. Volume 5. Number 11 1103

fect a resident’s future. Evaluative processes must be developed to minimize any error in classification. Ideally, a reliability of more than 0.95 should be achieved.6 Part I of our study showed that the interrater reliability of the SLOR is better than that of the NLOR. We believe that the method of evaluating NLORs developed by Harwood is straightforward. Yet, despite having used i t in our interpretation of every NLOR over the last three residency application cycles, we still found tha t subjectivity played a significant role in final deci- sion making. Interpretation of the SLOR, however, was strictly algorithmic. This left little room for subjectivity and improved the reliability between raters.

In our study, evaluation of a n applicant’s LORs was performed by physicians with a range of experience in letter interpretation. Two of the physicians were senior members of our residency selection committee having a cumulative experience of interpreting tens of thousands of LORs. The other two physicians were resident members of the selection committee who cumulatively had interpreted fewer than 500 LORs. This diversity would be expected to decrease interrater reliability. How- ever, a n analysis of our data found that the interrater reliability of both the SLOR and the NLOR was not affected by level of experience. The SLOR had better interrater reliability than did the NLOR regardless of the interpreter’s experience. As such i t speaks to the strength of the SLOR. It offers a high level of interrater reliability for both experienced and novice interpreters of LORs. It allows residents and junior faculty members to play a greater role in the evaluation of residency appli- cations.

There currently is no reference criterion standard for the interpretation of LORS.~ This is because any assessment of clinical performance is in- herently subjective. Previous studies have shown that NLORs are not valid when compared with the criterion standard of actual resident performance, and that they frequently do not contain the nec- essary information to adequately judge applicants.8 Schaider e t al.9 recently showed that when using actual resident performance as the criterion standard, there was no difference in the predictive value between a preprinted questionnaire and a NLOR if reviewed retrospectively. That study rec- ommended using only the SLOR to evaluate applicants. If i t truly is crucial to have a high reliability for an evaluation tha t determines an applicant’s future, the SLOR is superior to the NLOR. Addi- tionally, i t forces the writer of the recommendation to describe for residency selection committees specific character traits of interest that are frequently not addressed or are worded vaguely in NLORs.

Using our algorithm, the time required to in-

TABLE 2. Standardized Letter of Recommendation (SLOR) Classification System

Score Classification

7 Guaranteed match 6 Outstandinghery likely to match 5 Excellent 4 Very good 3 Good 2 Would not rank 1 Would not rank, plus negative comment

terpret the SLOR is much less than that required to interpret the NLOR. This is a n added benefit of using the SLOR, and can reduce the time neces- sary to evaluate and select.

Part I1 of our study suggests that there is a moderately low correlation between the SLOR and the NLOR written by the same author. Conse- quently, if writers of recommendations continue to submit both formats, residency selection committees must either evaluate both the SLOR and the NLOR and increase their workload, or choose to read only one format. Our results combined with those of Schaider et al.9 suggest that , if there is no significant difference in the predictive value of the two formats, one should choose the more reliable and faster format, the SLOR.

LIMITATIONS AND FUTURE QUESTIONS

We chose not to pair the SLORs and the NLORs in part I of our study but did pair them in part 11. In part I, our main objective was to evaluate the interrater reliability of both formats of letters. We did not directly compare the two types of letters in this part of the study. Thus we believed we could allow the raters to focus on a single format over 20 letters before having to concentrate on the other format. In part 11, we directly compared one format with the other. A residency selection committee would normally interpret letters written by a single author together as a pair. We therefore thought it was relevant to pair the letters for this aspect of the study.

We focused on the global assessment of letters because we believed t h a t is what most interpreters of LORs try to accomplish.2,10 The SLOR consis- tently provided information regarding a n applicant’s commitment to emergency medicine (EM), work ethic, interpersonal skills, and ability to develop a cohesive treatment plan. Frequently NLORs lacked information about these separate characteristics. We therefore could not compare some specific traits between the two letter types.

The correlation of the single author SLOW NLOR pairs would be improved to 0.93 if we al- lowed for a variance of one point on the Likert

1104 RECOMMENDATION LETTERS Girzadae et al. STANDARDIZED AND NARRAT~VE LETTERS

scale. However, we were interested to know whether the SLOR and NLOR conveyed equivalent recommendations. Thus we thought it was important to keep variance between both letter formats to a minimum in our interpretation of their correlation.

Both types of letters suffered from inconsistencies in classification in the scales. Particularly in NLORs, authors used phrases that corresponded to different numeric rankings (i.e., “good fund of knowledge” and “my highest recommendation”). Similar but less common inconsistencies also oc- curred in the SLORs (i.e., “excellent” and “guaranteed match”). To improve reliability, i t would be useful for the CORD to develop a set of guidelines to delineate how to write different levels of the SLOR. This would make interpreting the SLOR even more definite because we would all follow the same standard.

Interpretation of a NLOR was always done prior to evaluating the accompanying SLOR. We believed this would avoid bias because our standard process for ranking NLORs allows for some subjective interpretation. On the other hand, interpretation of the SLOR is strictly algorithmic. I t does not allow for subjectivity and thus it would not be expected to be biased by the NLOR.

The issue of LORs written by non-EM authors is problematic. Our study was undertaken during a year when the CORD SLOR was used by both EM and non-EM authors. Recently the CORD has asked that only EM faculty use the SLOR. It would seem that the SLOR could be adapted with mini- mal changes to suit the needs of non-EM authors. I t is likely that a non-EM SLOR could provide the same benefits to residency selection committees as does the CORD SLOR.

Many questions remain about this comparison. Why do SLORs and NLORs not correspond closely? Is it due to differences in content or to the subjective and inexact nature of the narrative format? Only emergency physicians are using the CORD SLOR this year; will this lead to more consistency between the two formats? To what extent are residency selection committees still interpreting NLORs when a SLOR is also sent by the same author? It would be helpful to have a prospective

study with predetermined outcome measures evaluating the performance of residents who had LORs that did not correlate. This would provide a stronger measure of the predictive power of a specific letter format. Our study suggests that the SLOR had better performance characteristics, but its predictive ability compared with the NLOR has not yet been evaluated.

CONCLUSION

Compared with NLORs, the CORD SLOR offers better interrater reliability with less interpretation time. Single-author SLOFUNLOR pairs submitted for a single applicant do not correlate well. Residency selection committees must decide whether the added work of interpreting NLORs is beneficial.

Thanks to Nancy Cipparrone of Advocate Health Care Re- search and Education Institute for her statistical support and help preparing the manuscript. Thanks also to Joyce Fedeczko, MALS, and Library Staff of Advocate Health Sciences Library Network for research assistance.

References

1. Frankville D, Benumof MI. Relative importance of the fac- tors used to select residents: a survey. Anesthesiology. 1991; 75:A876. 2. Baker DJ, Bailey MK, Brahen NH, Conroy JM, Dorman HB, Haynes GR. Selection of anesthesiology residents. Acad Med. 1993; 68:161-3. 3. Garmel GM. Letters of recommendation: what does good really mean? [letter]. Acad Emerg Med. 1997; 49333-4. 4. O’Halloran CM, Altmaier EM, Smith’WL, Franken EA. Evaluation of resident applicants by letters of recommendation: a comparison of traditional and behavior based formats. Investig Radiol. 1993; 28:274-7. 6. Likert R. Technique for the measurement of attitudes. Arch Psychiatry. 1932; 14O:l-55 6. Cozby PC. Methods in Behavioral Research, 6th ed. Moun- tain View, CA: Mayfield Publishing, 1997. 7. Karras DJ. Statistical methodology: 11. Reliability and validity assessment in study design, Part B. Acad Emerg Med.

8. Leichner P. Eusebio-Torres E, Harper D. The validity of reference letters in predicting resident performance. J Med Educ.

9. Schaider J J , Rydman RJ, Greene CS. Predictive value of letters of recommendation vs questionnaires for emergency medicine resident performance. Acad Emerg Med. 1997; 4:

10. Greenburg AG, Doyle J , McClure DK. Letters of recommendation for surgical residencies: what they say and what they mean. J Surg Res. 1994; 56:192-8.

1997; 4:144-7.

1981; 56:1019-21.

801-5.

a comparison of standardized and narrative letters of recommendation

Documents