empirical keying of situational judgment tests: rationale and some examples

22
1 Running Head: EMPIRICAL KEYING OF SITUATIONAL JUDGMENT TESTS Empirical Keying of Situational Judgment Tests: Rationale and Some Examples Kelley J. Krokos American Institutes for Research Ph: 202.403.5259 Fx: 202.403.5033 [email protected] Adam W. Meade North Carolina State University Ph: 919.513.4857 Fx: 919.515.1716 [email protected] April R. Cantwell North Carolina State University Ph: 919.515.2251 Fx: 919.515.1716 [email protected] Samuel B. Pond, III North Carolina State University Ph: 919.515.2251 Fx: 919.515.1716 [email protected] Mark A. Wilson North Carolina State University Ph: 919.515.2251 Fx: 919.515.1716 [email protected]

Upload: bpreger

Post on 23-Oct-2015

28 views

Category:

Documents


4 download

DESCRIPTION

Recently there has been increased interest in the use of situational judgment tests (SJTs) for

TRANSCRIPT

Page 1: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

1

Running Head: EMPIRICAL KEYING OF SITUATIONAL JUDGMENT TESTS

Empirical Keying of Situational Judgment Tests:

Rationale and Some Examples

Kelley J. Krokos American Institutes for Research

Ph: 202.403.5259 Fx: 202.403.5033 [email protected]

Adam W. Meade

North Carolina State University Ph: 919.513.4857 Fx: 919.515.1716

[email protected]

April R. Cantwell North Carolina State University

Ph: 919.515.2251 Fx: 919.515.1716

[email protected]

Samuel B. Pond, III North Carolina State University

Ph: 919.515.2251 Fx: 919.515.1716 [email protected]

Mark A. Wilson

North Carolina State University Ph: 919.515.2251 Fx: 919.515.1716

[email protected]

Page 2: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

2

Press Paragraph

Recently there has been increased interest in the use of situational judgment tests (SJTs) for

employee selection and promotion. SJTs have respectable validity coefficients with performance

criteria though validity coefficients vary from study to study. We propose the use of empirical

keying in order to help maximize the utility of SJTs. Though others have used such methods, we

provide a much needed theoretical rationale for such scoring procedures by illustrating the

distinction among SJTs, cognitive ability, and biodata. Results indicate that some empirical

keying approaches are advantageous for predicting a leadership criterion compared to traditional

subject matter expert SJT scoring.

Abstract

There has been increased interest in the use of situational judgment tests (SJTs) for employee

selection and promotion. We provide a much needed theoretical rationale for empirical keying

of SJTs. Empirical results indicate that some empirical keying approaches are more

advantageous than subject matter expert SJT scoring.

Page 3: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

3

Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

SJTs are becoming increasingly popular as selection, promotion, and developmental tools

(Clevenger, Pereira, Wiechmann, Schmitt, & Harvey, 2001; Hanson & Ramos, 1996; McDaniel,

Finnegan, Morgeson, Campion, & Braverman, 1997), and with good reason; several researchers

have had considerable success in predicting performance with SJTs (McDaniel, Morgeson,

Finnegan, Campion, & Braverman, 2001; Phillips, 1993; Weekley & Jones, 1999); with less

adverse impact than is typically found in measures of cognitive ability (Hanson & Ramos, 1996;

Motowidlo & Tippins, 1993; Weekley & Jones, 1997).

Despite these promising findings, one persistent problem with the use of SJTs is that

validity coefficients often vary widely. For example, some authors have found no significant

correlation between SJT scores and employee performance (Smiderle, Perry, & Cronshaw,

1994), while others have found validity coefficients of .45 (Phillips, 1993) and .56 (Stevens &

Campion, 1999). Still others have found widely divergent results for men and women (Phillips,

1992) or by construct examined (Motowidlo, Dunnette, & Carter, 1990). Undoubtedly, there are

also many unpublished studies of SJTs showing varying or non-significant results as well.

One possible explanation for these problems may lie in the way SJTs are scored.

Traditionally, subject matter experts (SMEs) have determined the “correct” responses to SJT

items. However, we propose that an empirical approach to item scoring has several theoretical

and practical advantages over the SME approach. Though we are not the first to suggest the use

of empirical keys for SJTs, we do provide a much needed theoretical rationale for their use that

has not previously been discussed. In this study, we discuss the advantages of empirical

approaches and illustrate their use with an SJT predicting a leadership criterion.

Page 4: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

4

Scoring of SJTs. In a recent review, McDaniel and Nguyen (2001) describe approaches

to scoring SJTs. The first and most common approach is to ask subject matter experts (SMEs) to

decide which response alternative is best for each item. With this approach, items with little or

no SME agreement are deleted or rewritten. Results with the SME scoring approach vary though

results are generally positive. A second scoring approach identified by McDaniel and Nguyen

(2001) involves pilot testing an SJT and to identify the “correct” responses based on central

tendency statistics, though no example or explanation of how this should be implemented was

given. The last approach discussed by McDaniel and Nguyen (2001) is the use of empirical

methods to determine the scoring key.

Although empirical scoring approaches are rarely used for SJTs, some research evidence

suggests that SJTs scored in this way can yield moderate validity coefficients. Dalessio (1994)

successfully used an empirical keying technique for an SJT to predict turnover among insurance

agents. Weekley and Jones (1997) used empirical scoring based on mean criterion performance

of service workers and found a cross-validity coefficient of .22. Finally, although the

relationship among the SJT scores and performance criteria were not assessed, Lievens (2000)

developed an empirical scoring key for an SJT using correspondence analysis and discriminant

analysis.

In contrast to the paucity of studies examining empirical scoring procedures of SJTs,

biodata has a long history of using empirical scoring procedures. These procedures are easily

adaptable to SJT items. However, Hogan (1994) briefly reviewed the entire history of empirical

keying methods and found that few studies had compared different empirical keying procedures.

In one of the few studies comparing multiple empirical keying techniques, Devlin et al. (1992)

Page 5: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

5

found that the vertical percent method (application blank method; England, 1961) was among the

best at predicting academic performance for college freshman with cross-validities typically in

the .4-.5 range. The horizontal percent (Stead & Shartle, 1940) and phi coefficient methods (c.f.

the Lecznar & Dailey (1950) correlational method), also proved useful in their study with

validities only slightly lower than those of the vertical percent methods they investigated (Devlin

et al., 1992). The mean criterion method had greater variation in cross validities across different

time spans, though the cross-validities were between .2 and .5. However, there was more

shrinkage in the cross-validation for this method than most others.

Rationale for Empirical Keying of SJTs

Though use of empirical keying for biodata has been criticized as “dust-bowl

empiricism” (Dunnette, 1962; Mumford & Owens, 1987; Owens, 1976), we believe that it may

actually be preferable to SME based scoring procedures for some SJTs. On the surface, SJTs

seem to be somewhat closely related to both cognitive ability tests and biodata. However, we

contend that SJTs are unique measurement methods (Hanson, Horgen, & Borman, 1998) and

thus have unique properties that make them particularly well suited for empirical keying.

First, we explicitly reject the notion of “correct” and “incorrect” answers for most SJTs.

The notion that SJTs should have correct and incorrect responses likely stems at least in part

from the relationships between SJTs and cognitive ability tests, which generally do have correct

and incorrect answers. First, research suggests that SJT scores are highly related to scores on

tests of general cognitive ability McDaniel et al. (1997, 2001). In addition, SJTs are used in

ways and contexts that are typical for the use of cognitive ability tests, such as personnel

selection. However, despite these relationships and temporal connections, SJT items, unlike

Page 6: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

6

typical academic or cognitive ability test items, are not designed to have a single irrefutable

correct answer. In contrast, SJT items are typically designed to capture the more complex, social

or practical aspects of performance in work situations. McDaniel et al. (1997) suggest that SJTs

are indistinguishable from tests of tacit knowledge. To the extent that this is true, SJTs measure

something different than general cognitive or academic intelligence (Sternberg, Wagner,

Williams, & Horvath, 1995). To capture this type of knowledge, test items pose problems that

are not well defined and may have more than one correct response (Sternberg & Wagner, 1993;

Sternberg, Wagner, & Okagaki, 1993).

Finally, an examination of typical SJT items reveals that there is generally no clear right

or wrong answer. This is actually a desirable feature of SJTs as transparent items would quickly

lead to ceiling effects that would fail to discriminate between high and low performers. Note,

however, that this limitation is not an issue in many biodata scales where items can be based on

external, objective, and verifiable previous life experiences (Mael, 1991).

We believe that all response options for an SJT item vary along a continuum of best to

worst. The exact location of an option on this continuum is difficult to determine and will vary

by item and perhaps also by the job for which the applicant is applying. Some items may be

written with one, clearly best option while others may be written with less distinct response

alternatives. Transparent items lead to ceiling effects while ambiguous items make it

exceedingly difficult for SMEs to achieve consensus about the appropriateness of each option.

When SJT scores are based on a scoring key that was developed by SMEs, an SJT score

represents the extent to which each respondent agrees with the judgments of the SMEs. By

requiring a high degree of consensus among SMEs, researchers can increase the likelihood that

Page 7: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

7

answers will not be too specific to the opinions of the particular group of SMEs. Unfortunately,

however, this procedure also increases the likelihood that correct answers will be the most

transparent options. Thus, more transparent items rather than less transparent ones are likely to

be retained when SMEs are employed to determine the keyed answer. In addition, the option

ultimately determined to be best by the SMEs will depend to some extent upon the unique

perspective of a particular SME group and the group dynamics involved in obtaining consensus.

Deciding between a SME based key and an empirical key is really a question of who will

serve as the SMEs. When traditional SME scoring is used, an SJT score is an index of

agreement among respondents and SMEs. The extent to which these scores are construct valid is

dependent upon both the validity of the SMEs’ conceptualization of the construct and the validity

of the SMEs assessment of the relationship between the response options and the construct. As

such, low validity coefficients for SME scored SJTs could be due to differences in perceptions of

the construct among respondents (e.g., job applicants) and SMEs (e.g., a small group of

supervisors); poor SME judgment as to which response option is most indicative of the construct;

or overly transparent “best” answers chosen not only by SMEs, but also by both high and low

performing respondents.

In contrast, when empirical keying is used, the de-facto “SMEs” are the high performing

respondents as measured by the criterion of interest. More specifically, response options that

best differentiate between high and low performing incumbents are given more weight than other

options, even though those options may in many ways seem to be better responses. Using

empirical keying, the most transparent option (and seemingly the best option) may be endorsed

by a majority of respondents; however if high and low performing respondents equally endorse

Page 8: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

8

the response option, it will not differentiate between criterion groups and consequently will not,

in effect, be weighted. In contrast, a response option that is not endorsed frequently but is

endorsed much more often by high performing respondents than low performing respondents

will be weighted much more heavily with many of the empirical scoring methods. In general,

this will be desirable so long as the number of respondents endorsing the response option is not

so small as to endanger severe shrinkage in cross-validation. We should point out, though, that if

a criterion does not fully capture the performance domain, it might be preferable to use SME

judgment to determine the correct answers to SJT items. In such cases, however, attention to

better criterion development would be a pressing concern.

While much of the criticism leveled at empirical keying methods of biodata scoring

concerns the lack of theory behind the choice of predictors (Mumford & Owens, 1987; Owens,

1976), such criticism is not necessarily relevant to SJTs. SJT items are typically written based

on job analysis data and are thus believed to be related to job relevant behaviors and criteria from

their inception. As a result, empirical keying of these items serves to merely define the optimal

relationship between those items and the criterion.

In sum, we believe that there may be some utility in investigating empirical keying as an

alternative for SJT scoring. In this research, we investigated the use of empirical keying as an

alternative to the traditional SME based scoring procedures for an SJT developed to select

students receiving a highly competitive four-year scholarship at a major university.

Method

Participants

Participants were 219 undergraduate students (scholars) from a large university who were

Page 9: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

9

recipients of a highly competitive four-year academic scholarship. Roughly 55% were female

and 45% were male while approximately 36% were freshmen, 26% sophomores, 22% juniors,

and 16% seniors. Note that while the sample is composed of students, this was not a lab study or

a sample of convenience; the students were the target population for the SJT.

Measures

Phase one of the project involved criterion development during which the appropriate

behaviors associated with four performance dimensions (Leadership, Scholarship, Service, and

Character) were identified. For example, the Leadership dimension included behaviors such as

knowing when to take a leadership versus a support role, being comfortable in ambiguous

situations, developing cooperative relationships, and handling conflict appropriately. Results for

the Leadership dimension are reported in this study in order to simplify the presentation of

results and because leadership is most readily generalizable to other organizational settings.

Phase two involved the development of the SJT item stems and response options. The

SJT item stems were developed by the program research team using the data gathered in the

criterion development phase. The response options were developed by a group of SMEs

including university faculty and the scholarship program directorate. The items and response

options subsequently underwent additional rigorous reviews and modifications by SMEs and the

research team. The final SJT was composed of three detailed scenarios that describe situations

that scholarship recipients may encounter. Each scenario was comprised of several multiple

choice items. Respondents were instructed to indicate which of the five response options they

would most likely do and which they would least likely do.

Phase three involved developing the SME based scoring key. SMEs who were both

Page 10: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

10

intimately familiar with the scholarship program and who had advanced training in assessment

methodology determined the most effective answers for each item. For the most part, only

response options with more than 70% agreement by the SMEs were retained as the correct

option. However, in some cases there was less agreement among SMEs, and in these cases

preferential weighting was given to a core group of SMEs (i.e., the program’s director and one

key faculty advisor). In this study, we analyzed responses to the most likely questions in order to

simplify analyses and presentation of results.

Performance Criteria. Performance rating content and materials were developed in

phase four based on the data gathered during criterion development phase. Performance ratings

were made primarily by the scholarship program director. When clarification was needed, a

mentor or other program director was consulted for further information. Two dimensions of

leadership were rated independently: Effectiveness of leadership skills and actively seeks a

leadership role. Initial analyses indicated that these two ratings correlated highly (r=.79, p<.01),

thus these two were combined into a single index.

Procedure

SJT scores for the leadership dimension were calculated using the traditional SME

scoring approach and several empirical keying methods shown previously to be of some utility in

either SJT or biodata research. Small calibration sample size is the biggest determinant of

shrinkage in cross-validation (Hough & Paullin, 1994), therefore two-thirds of the total sample

was randomly assigned to the calibration sample while the remaining one-third was retained for

the cross-validation sample.

More specifically, six empirical keying techniques were investigated. Each technique

Page 11: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

11

results in numeric values or weights for each option that when combined with the individual’s

score on the option (0 if the option was not selected, 1 if it was) the result was then used in a

regression equation that sought to predict performance on the leadership criterion from the

weighted options. The empirical techniques employed are described below.

Vertical and horizontal percent methods. In order to compute weights via the vertical

and horizontal percent methods, the calibration sample was first divided into high and low

performance groups with respect to the criterion. The sample was split into thirds based on

criterion scores and only the lowest and highest third of the sample was used for weighting.

Vertical percent weights were computed by taking the percentage of person in the high group

choosing each option and subtracting the percentage of persons in the low performing group

choosing that option. Horizontal weights were computed by taking the number of persons in the

high performance group choosing each response option and dividing this number by the total

number of people in the sample choosing that option. We then multiplied this number by ten to

derive the final horizontal weights (see Devlin et al., 1992).

Correlational Methods 1 and 2. The dichotomously scored response options were

correlated with the leadership performance criteria. The resulting zero-order correlation was

treated as the weight. For this study, we chose two alpha levels to retain response options as

predictors. In Correlational Method 1, we used α=.25 level which corresponded to zero-order

correlations of magnitude of roughly r=.10. For Correlational Method 2, we keyed only item

responses significant at the α=.10 level (roughly r=.14).

Mean criterion method. In order to generate the empirical scoring key for the mean

criterion method, we computed mean criterion performance scores associated with each response

Page 12: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

12

option. These mean scores were then used as the empirical weights in computing predictor

scores for persons choosing each response option.

Unit weighting method. With the unit weighting method, response options associated

with the highest mean criterion scores were assigned a value of 1.0 while other responses were

assigned a value of 0. However, options that were associated with the highest mean criterion

were subject to the restriction that at least 10% of the sample must have chosen that option in

order to reduce the risk of significant results by chance alone (see Weekly & Jones, 1999).

Each of the empirical keying technique results in a numeric value for each response

option. This value was used as the beta weight in a regression equation that seeks to predict

performance on the leadership criterion using the weighted response options as predictors.

Results

Descriptive statistics for SME and empirical keying methods of scoring as well as the

criterion measure of performance are presented in Table 1. Table 2 contains correlations

between the predictor scores and criteria ratings.

As can be seen in Table 2, the SME-based scoring of the leadership dimension left much

to be desired. The SME-based leadership scores had only a marginally significant relationship

with performance for the calibration sample. However, the SME-based predictor was not

significantly related to performance for the cross-validation sample. The results of the empirical

keying approaches were decidedly mixed. Though all empirical keying approaches had large

significant correlations with performance in the calibration sample, only the correlational method

was significantly related to performance in the cross-validation sample.

Discussion

Page 13: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

13

In this study, we found that the predictive validity of SJT could be improved by utilizing

several types of empirical keying procedures. In addition, we have detailed several theoretical

reasons why empirical keying may be preferable to SME scoring for some SJTs. However, we

also found that empirical keying is not a panacea for all that ails a predictor. Instead, we found

many techniques shown to be predictive of performance in biodata contexts of little use for our

SJT measure. Our study also illustrates the most pervasive problem of empirical scoring

procedures – a general lack of cross-validation. Validities in our sample shrank considerably

between calibration and cross-validation despite our best efforts to split the sample so that the

majority of the data was used to derive stable empirical keying weights.

Though previous authors have discussed some advantages of the correlational method

(Lecznar & Dailey, 1950; Weekley & Jones, 1999), we were somewhat surprised by the clearly

superior behavior of this type of empirical keying in our study. Perhaps this is because fewer

(but higher quality) predictors were used with the correlational method. The more selective of

the two correlational methods (Correlational Method 2) enjoyed considerably higher cross-

validities than did the less restrictive of the two. This is to be expected to some extent.

However, as item responses are included that have a weaker relationship with the criterion, these

responses are weighted less as well. Thus, these results were somewhat surprising.

Though the performance of the empirical keying approaches examined in this study often

fared no better than the SME approach, we stress some of the positive aspects of empirical

keying. First, empirical keying can serve as a validity check on SME ratings of the “correct”

response option. If the response option chosen by SMEs as the correct response does not

distinguish between high and low performers with respect to the criterion measure, then perhaps

Page 14: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

14

that option is not correct after all. The unit weighting scoring procedure used in this study

exemplified this function. Researchers who reject the use of external keying procedures on

philosophical grounds may still derive benefit from its use as a validity check and as part of the

SJT development process. Requiring 75% agreement among SMEs to keep an item is a stringent

but common criteria (Legree, 1994; Lievens, 2000). Using the unit weighting approach in

combination with the SME approach may allow researchers to relax the agreement criteria

slightly given empirical information offered by the unit weighting empirical keying approach.

Secondly, we believe that the empirical keying approaches significantly improve upon

the SME scoring approach because they counter a number of its weaknesses and introduce

important information into the scoring process. Empirical keying approaches inherently reject

the notion of right and wrong answers to SJT items. That is, (most) empirical keying approaches

reward “partial credit” of sorts to a person choosing a response option that differentiates between

higher and lower performance on the criterion. For example, two options that relate highly with

the criterion would both be weighted strongly, rather than just one in traditional correct/incorrect

scoring. Conversely, negative weighting penalizes choices associated with poorer performance.

Also, practitioners not terribly comfortable with pure empirical keying could also

consider using a hybrid approach in which only items written to measure specific competencies

are used to predict criteria deemed relevant. With this approach, practitioners can maintain a

theoretical link between competencies and criterion if only items written to measure those

competencies are used as predictors, rather than all SJT items. This approach is not purely

empirical but instead is more akin to the family of approaches in biodata research known as

construct-based rational scoring (Hough & Paullin, 1994). With this type of scoring, the exact

Page 15: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

15

nature of the relationship between the items and the construct (i.e., the scoring) is determined

empirically though the theoretical link between predictor and criteria remains.

As with any research, there are some potential limitations associated with this study. One

limitation is the use of a student sample and an SJT designed for use in this sample. However,

note that the SJT, while designed for use with a student sample, was rigorously developed. In

addition, we attempted to choose the construct most relevant to organizations in our

investigation. Though it is somewhat unlikely that an organization would hire an employee

based on an SJT designed to measure leadership, it is entirely possible that such an SJT might be

used as one factor in promotion decisions or for personal/career development purposes. Also,

although a student sample was used in this study, these students are among the best in the

country with remarkable standardized test scores, clear leadership in extracurricular activities,

and great promise and in fact expectations for future leadership positions.

A further limitation of the study was the relatively small sample used to derive the

external weights. When deriving weights that reflect the relationship between the item response

options and the criterion for the population as a whole, the larger the sample used to derive these

weights, the better (Hough & Paullin, 1994). Also, the SJT contained a relatively low number of

items. This was a function of both pragmatic concerns over test length and other design

considerations outside the control of the researchers. In general, large sample sizes and a large

number of items will lead to the best and most stable prediction of performance.

Another limitation was the low initial validity coefficient associated with the SME-based

approach. These low initial coefficients do not set a very high bar over which empirical keying

approaches must excel.

Page 16: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

16

Despite these limitations, we feel that this study provides promising results. By

combining well-developed and content valid items with externally derived empirical scoring for

the item response options, we feel that an optimal balance can be struck for scoring an SJT.

Page 17: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

17

References

Chan, D. & Schmitt, N. (1997). Video-based versus paper-and-pencil method of assessment in

situational judgment tests: Subgroup differences in test performance and face validity

perceptions. Journal of Applied Psychology, 82, 143-159.

Clevenger, J., Pereira, G. M., Wiechmann, D., Schmitt, N., & Harvey, V. S. (2001). Incremental

validity of situational judgment tests. Journal of Applied Psychology, 86, 410-417.

Dalessio, A. T. (1994). Predicting insurance agent turnover using a video-based situational

judgment test. Journal of Business and Psychology, 9, 23-32.

Devlin, S. E., Abrahams, N. M., & Edwards, J. E. (1992). Empirical keying of biographical data:

Cross-validity as a function of scaling procedure and sample size. Military Psychology, 4,

119-136.

Dunnette, M. D. (1962). Personnel management. Annual Review of Psychology, 13, 285-314.

England, G. W. (1961). Development and Use of Weighted Application Blanks. Dubuque, IA:

Brown.

Hanson, M. A., Horgen, K. E., & Borman, W. C. (1998). Situational judgment tests as measures

of knowledge/expertise. Paper presented at the Society for Industrial Organizational

Psychology, Dallas, TX.

Hanson, M. A., & Ramos, R. A. (1996). Situational judgment tests. In R. S. Barrett (Ed.), (1996).

Fair employment strategies in human resource management (pp. 119-124). Westport,

CT: Quorum Books/Greenwood Publishing Group, Inc.

Hogan, J. B. (1994). Empirical keying of background data measures. In G. S. Stokes & M. D.

Mumford (Eds.), Biodata handbook: Theory, research, and use of biographical

Page 18: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

18

information in selection and performance prediction (pp. 69-107). Palo Alto, CA: CPP

Books.

Hough, L., & Paullin, C. (1994). Construct-oriented scale construction: The rational approach. In

G. S. Stokes & M. D. Mumford (Eds.), Biodata handbook: Theory, research, and use of

biographical information in selection and performance prediction (pp. 109-145). Palo

Alto, CA: CPP Books.

Lecznar, W. B., & Dailey, J. T. (1950). Keying biographical inventories in classification test

batteries. American Psychologist, 5, 279.

Legree, P. J. (1994). The effect of response format on reliability estimates for tacit knowledge

scales (No. ARI Research Note 94-25). Alexandria, VA: U.S. Army Research Institute

for the Behavioral and Social Sciences.

Lievens, F. (2000). Development of an empirical scoring scheme for situational inventories.

European Review of Applied Psychology/Revue Europeenne de Psychologie Appliquee,

50, 117-125.

Mael, F. A. (1991). A conceptual rationale for the domain and attributes of biodata items.

Personnel Psychology, 44, 763-792.

McDaniel, M. A., Finnegan, E. B., Morgeson, F. P., Campion, M. A., & Braverman, E. P.

(1997). Predicting job performance from common sense. Paper presented at the 12th

annual Society of Industrial Organizational Psychology, St. Louis, MO.

McDaniel, M. A., Morgeson, F. P., Finnegan, E. B., Campion, M. A., & Braverman, E. P.

(2001). Use of situational judgment tests to predict job performance: A clarification of

the literature. Journal of Applied Psychology, 86, 730-740.

Page 19: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

19

McDaniel, M. A., & Nguyen, N. T. (2001). Situational judgment tests: A review of practice and

constructs assessed. International Journal of Selection and Assessment, 9, 103-113.

Mead, A. D., & Drasgow, F. (2003). Examination of a resampling procedure for empirical

keying. Paper presented at the 18th Annual Meeting of the Society for Industrial and

Organizational Psychology, Orlando, FL.

Motowidlo, S. J., Dunnette, M. D., & Carter, G. W. (1990). An alternative selection procedure:

The low-fidelity simulation. Journal of Applied Psychology, 75, 640-647.

Motowidlo, S. J., & Tippins, N. (1993). Further studies of the low-fidelity simulation in the form

of a situational inventory. Journal of Occupational and Organizational Psychology, 66,

337-344.

Mumford, M. D., & Owens, W. A. (1987). Methodology review: Principles, procedures, and

findings in the application of background data measures. Applied Psychological

Measurement, 11, 1-31.

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric Theory (3rd ed.). New York: McGraw-

Hill, Inc.

Owens, W. A. (1976). Background data. In M. D. Dunnette (Ed.), Handbook of Industrial and

Organizational Psychology (1st ed., pp. 609-644). Chicago: Rand McNally.

Phillips, J. F. (1992). Predicting sales skills. Journal of Business and Psychology, 7, 151-160.

Phillips, J. F. (1993). Predicting negotiation skills. Journal of Business and Psychology, 7, 403-

411.

Russell, C. J., & Klein, S. R. (2003). Toward optimization and insight: Bootstrapping a

situational judgment empirical key. Paper presented at the 18th Annual Meeting of the

Page 20: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

20

Society for Industrial and Organizational Psychology, Orlando, FL.

Smiderle, D., Perry, B. A., & Cronshaw, S. F. (1994). Evaluation of video-based assessment in

transit operator selection. Journal of Business and Psychology, 9, 3-22.

Stead, N. H., & Shartle, C. L. (1940). Occupational counseling techniques. New York: American

Book.

Sternberg, R. J., & Wagner, R. K. (1993). The g-ocentric view of intelligence and job

performance is wrong. Current Directions in Psychological Science, 2, 1-5.

Sternberg, R. J., Wagner, R. K., & Okagaki, L. (1993). Practical intelligence: The nature and role

of tacit knowledge in work and at school. In J. M. Puckett (Ed.), (1993). Mechanisms of

everyday cognition (pp. 205-227). Hillsdale, NJ, England: Lawrence Erlbaum Associates,

Inc.

Sternberg, R. J., Wagner, R. K., Williams, W. M., & Horvath, J. A. (1995). Testing common

sense. American Psychologist, 50, 912-927.

Stevens, M. J., & Campion, M. A. (1999). Staffing work teams: Development and validation of a

selection test for teamwork settings. Journal of Management, 25, 207-228.

Weekley, J. A., & Jones, C. (1997). Video-based situational testing. Personnel Psychology, 50,

25-49.

Weekley, J. A., & Jones, C. (1999). Further studies of situational tests. Personnel Psychology,

52, 679-700.

Page 21: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

21

Table 1

Descriptive Statistics for Predictors and Leadership Criteria Performance Ratings

Calibration

Sample N=144

Cross-Validation Sample N=75

Variable Mean Std. Dev. Mean Std.

Dev. SME Method 4.28 1.39 3.99 1.24

Correlational Method 1 0.20 0.37 .14 .30

Correlational Method 2 0.26 0.27 .22 .22

Vertical % -42.61 58.55 53.79 36.67

Horizontal % 41.06 4.87 128.33 6.65

Mean Criterion 92.72 1.03 93.31 1.46

Unit Weighting 16.49 4.30 16.03 4.66

Leadership Performance Rating 4.42 0.93 4.58 0.98

Note: Correlational Method 1 used predictors significant at the p<.25 level.

Correlational Method 2 used predictors significant at the p<.10 level.

Page 22: Empirical Keying of Situational Judgment Tests: Rationale and Some Examples

22

Table 2

Correlations between Predictors and Leadership Criteria Performance Ratings

Predictor Calibration Sample N=144

Cross-Validation

Sample N=75

SME Method .15* -.15

Correlational Method 1 .52** .21*

Correlational Method 2 .49** .28**

Vertical % .42** .06

Horizontal % .51** .12

Mean Criterion .61** .07

Unit Weighting .43** .05

Note: *p<.10** p<.05. Correlational Method 1 used predictors significant at the p<.25

level. Correlational Method 2 used predictors significant at the p<.10 level.