hunter & hunter (1984) - validity and utility of alternative predictors of job performance

27
Psychological Bulletin 1984, Vol96, No. 1,72-98 Copyright 1984 by the American Psychological Association, Inc. Validity and Utility of Alternative Predictors of Job Performance John E. Hunter Michigan State University Ronda F. Hunter College of Education Michigan State University Meta-analysis of the cumulative research on various predictors of job performance shows that for entry-level jobs there is no predictor with validity equal to that of ability, which has a mean validity of .53. For selection on the basis of current job performance, the work sample test, with mean validity of .54, is slightly better. For federal entry-level jobs, substitution of an alternative predictor would cost from $3.12 billion (job tryout) to $15.89 billion per year (age). Hiring on ability has a utility of $15.61 billion per year, but affects minority groups adversely. Hiring on ability by quotas would decrease this utility by 5%. A third strategy—using a low cutoff score—would decrease utility by 83%. Using other predictors in conjunction with ability tests might improve validity and reduce adverse impact, but there is as yet no data base for studying this possibility. A crucial element in the maintenance of high productivity in both government and pri- vate industry is the selection of people with high ability for their jobs. For most jobs, the only presently known predictive devices with high validity are cognitive and psychomotor ability tests. Recent work on validity gener- alization (Schmidt, Gast-Rosenberg, & Hunter, 1980; Schmidt & Hunter, 1977; Schmidt & Hunter, 1980; Schmidt, Hunter, Pearlman, & Shane, 1979) has shown that most findings of low validity are due to artifacts, mainly sta- tistical error due to small sample size. Similar and broader conclusions follow from reana- lyses of past meta-analyses by Lent, Aurbach, and Levin (1971), as noted by Hunter (1980b) and from inspection of unpublished large- sample studies done in the United States Army by Helm, Gibson, and Brogden (1957), in Hunter (1980c), and in Schmidt, Hunter, and Pearlman (1981). High selection validity translates into considerable financial savings for most organizations. Hunter (1979a) esti- mated that if the Philadelphia, Pennsylvania, Police Department were to drop its use of a cognitive ability test to select entry-level of- ficers, the cost to the city would be more than $170 million over a 10-year period. Schmidt, Hunter, McKenzie, and Muldrow (1979) show Requests for reprints should be sent to John E. Hunter, Department of Psychology, Michigan State University, East Lansing, Michigan 48824. that over a 10-year period, with a 50%selection ratio, the federal government could save $376 million by using a cognitive ability test to select computer programmers rather than by se- lecting at random. 1 Hunter and Schmidt (1982b) applied a util- ity model to the entire national work force. Gains from using cognitive tests rather than selecting at random are not as spectacular at the national level as might be predicted by findings from single organizations. This is be- cause high-ability people who are selected for crucial top-level jobs will not be available for lower-level jobs, where they would bring higher productivity. However, even considering this cancellation of effect, Hunter and Schmidt es- timated in 1980 that productivity differences between complete use of cognitive ability tests and no use of the tests would amount to a minimum of $80 billion per year. That is, pro- ductivity differences due to use or nonuse of present tests would be, in the current state of the economy, about as great as total corporate profits, or about 20% of the total federal budget. To replace cognitive ability tests with any instrument of lower validity would incur very high costs, because even minute differences in 1 Random selection, though rarely if ever practiced, provides a base line for comparison of utility gains and losses that best demonstrates the potential magnitude of the differences. 72

Upload: oana-baloi

Post on 07-Nov-2014

107 views

Category:

Documents


0 download

DESCRIPTION

validitate - predictori job performance

TRANSCRIPT

Page 1: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

Psychological Bulletin1984, Vol96, No. 1,72-98

Copyright 1984 by theAmerican Psychological Association, Inc.

Validity and Utility of Alternative Predictors of Job Performance

John E. HunterMichigan State University

Ronda F. HunterCollege of Education

Michigan State University

Meta-analysis of the cumulative research on various predictors of job performanceshows that for entry-level jobs there is no predictor with validity equal to that ofability, which has a mean validity of .53. For selection on the basis of current jobperformance, the work sample test, with mean validity of .54, is slightly better. Forfederal entry-level jobs, substitution of an alternative predictor would cost from$3.12 billion (job tryout) to $15.89 billion per year (age). Hiring on ability has autility of $15.61 billion per year, but affects minority groups adversely. Hiring onability by quotas would decrease this utility by 5%. A third strategy—using a lowcutoff score—would decrease utility by 83%. Using other predictors in conjunctionwith ability tests might improve validity and reduce adverse impact, but there isas yet no data base for studying this possibility.

A crucial element in the maintenance ofhigh productivity in both government and pri-vate industry is the selection of people withhigh ability for their jobs. For most jobs, theonly presently known predictive devices withhigh validity are cognitive and psychomotorability tests. Recent work on validity gener-alization (Schmidt, Gast-Rosenberg, & Hunter,1980; Schmidt & Hunter, 1977; Schmidt &Hunter, 1980; Schmidt, Hunter, Pearlman, &Shane, 1979) has shown that most findings oflow validity are due to artifacts, mainly sta-tistical error due to small sample size. Similarand broader conclusions follow from reana-lyses of past meta-analyses by Lent, Aurbach,and Levin (1971), as noted by Hunter (1980b)and from inspection of unpublished large-sample studies done in the United States Armyby Helm, Gibson, and Brogden (1957), inHunter (1980c), and in Schmidt, Hunter, andPearlman (1981). High selection validitytranslates into considerable financial savingsfor most organizations. Hunter (1979a) esti-mated that if the Philadelphia, Pennsylvania,Police Department were to drop its use of acognitive ability test to select entry-level of-ficers, the cost to the city would be more than$170 million over a 10-year period. Schmidt,Hunter, McKenzie, and Muldrow (1979) show

Requests for reprints should be sent to John E. Hunter,Department of Psychology, Michigan State University, EastLansing, Michigan 48824.

that over a 10-year period, with a 50% selectionratio, the federal government could save $376million by using a cognitive ability test to selectcomputer programmers rather than by se-lecting at random.1

Hunter and Schmidt (1982b) applied a util-ity model to the entire national work force.Gains from using cognitive tests rather thanselecting at random are not as spectacular atthe national level as might be predicted byfindings from single organizations. This is be-cause high-ability people who are selected forcrucial top-level jobs will not be available forlower-level jobs, where they would bring higherproductivity. However, even considering thiscancellation of effect, Hunter and Schmidt es-timated in 1980 that productivity differencesbetween complete use of cognitive ability testsand no use of the tests would amount to aminimum of $80 billion per year. That is, pro-ductivity differences due to use or nonuse ofpresent tests would be, in the current state ofthe economy, about as great as total corporateprofits, or about 20% of the total federalbudget.

To replace cognitive ability tests with anyinstrument of lower validity would incur veryhigh costs, because even minute differences in

1 Random selection, though rarely if ever practiced,provides a base line for comparison of utility gains andlosses that best demonstrates the potential magnitude ofthe differences.

72

Page 2: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

ALTERNATIVE PREDICTORS OF JOB PERFORMANCE 73

validity translate into large dollar amounts.These costs would be shared by everyone, re-gardless of sex or group affiliation, becausefailure to increase productivity enough tocompensate for inflation and to compete inforeign markets affects the entire economy.

Adverse Impact and Test Fairness

Unfortunately, the use of cognitive abilitytests presents a serious problem for Americansociety; there are differences in the mean abil-ity scores of different racial and ethnic groupsthat are large enough to affect selection out-comes. In particular, blacks score, on the av-erage, about one standard deviation lower thanwhites, not only on tests of verbal ability, buton tests of numerical and spatial ability as well(e.g., see U.S. Employment Service, 1970).Because fewer black applicants achieve highscores on cognitive ability tests, they are morelikely to fall below selection cutoff scores thanare white applicants. For example, if a test isused to select at a level equivalent to the tophalf of white applicants, it will select only thetop 16% of the black applicants. This differenceis adverse impact as denned by the courts.

Fifteen years ago the elimination of adverseimpact seemed a straightforward but arduoustask: It was a question of merely amendingthe tests. Testing professionals reasoned thatif there were no mean difference between racialgroups in actual ability, the mean differencein test scores implied that tests were unfair toblack applicants. If tests were unfair to blackapplicants, then it was only necessary to re-move all test content that was culturally biased.Not only would adverse impact then vanish,but also the validity of the test would increase.Moreover, if a test were culturally unfair toblacks, it would probably prove to be culturallyunfair to disadvantaged whites as well, andreforming the tests would eliminate that biasalso.

However, the empirical evidence of the last15 years has not supported this reasoning(Schmidt & Hunter, 1981). Evidence showingthat single-group validity (i.e., validity for onegroup but not for others) is an artifact of smallsample sizes (Boehm, 1977; Katzell & Dyer,1977; O'Connor, Wexley, & Alexander, 1975;Schmidt, Berner, & Hunter, 1973) has shownthat any test that is valid for one racial group

is valid for the other. Evidence of differentialvalidity (i.e., validity differences betweensubgroups) to be an artifact of small samplesize (Bartlett, Bobko, Mosier, & Hannan, 1978;Hunter, Schmidt, & Hunter, 1979) suggeststhat validity is actually equal for blacks andwhites. Finally, there is an accumulation ofevidence directly testing for cultural bias,showing results that are consistently oppositeof that predicted by the test bias hypothesis.If test scores for blacks were lower than theirtrue ability scores, then their job performancewould be higher than that predicted by theirtest scores. In fact, however, regression linesfor black applicants are either below or equalto the regression lines for white applicants (seereview studies cited in Hunter, Schmidt, &Rauschenberger, in press; Schmidt & Hunter,1980, 1981). This finding holds whether jobperformance measures are ratings or objectivemeasures of performance, such as productionrecords or job sample measures.

The evidence is clear: The difference inability test scores is mirrored by a correspond-ing difference in academic achievement andin performance on the job. Thus, the differencein mean test scores reflects a real differencein mean developed ability. If the difference isthe result of poverty and hardship, then it willvanish as poverty and hardship are eliminated.However, because the difference currently rep-resents real differences in ability, constructionof better tests will not reduce adverse impact.In fact, better tests, being somewhat more re-liable, will have slightly more adverse impact.

Adverse impact can be eliminated from theuse of ability tests, but only by sacrificing theprinciple that hiring should be in order ofmerit. If we hire solely on the basis of probablejob performance, then we must hire on abilityfrom the top down, without regard to racialor ethnic identification. The result will be lowerhiring rates for minority applicants. If we de-cide to moderate adherence to the merit prin-ciple so as to apply the principle of racial andethnic balance, then the method that loses theleast productivity in the resulting work forceis that of hiring on the basis of ability fromthe top down within each ethnic group sep-arately. This method produces exact quotaswhile maximizing the productivity of theworkers selected within each ethnic group.

Many organizations are presently using a

Page 3: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

74 JOHN E. HUNTER AND RONDA F. HUNTER

third method. They use ability tests, but witha very low cutoff score. This method does re-duce adverse impact somewhat, but it reducesthe labor savings to the organization by farmore (Hunter, 1979a, 198 la; Mack, Schmidt,& Hunter, undated). The low-cutoff methodis vastly inferior to quotas in terms of eitherproductivity or ethnic balance. This point willbe considered in detail in the section on eco-nomic analysis.

Adverse Impact Gap and the Promise ofAlternative Predictors

As long as there is a true difference in abilitybetween groups, there will be a gap in theirrelative performance that can never be com-pletely closed. The fact that tests are fair meansthat differences in mean test score betweengroups are matched by differences in meanjob performance. However, differences in testscore are not matched by equivalent differencesin job performance, because the differencesare affected by the validity of the test. Figure1 illustrates this point: If the predicted criterionscore (e.g., performance rating) for any testscore is the same for all applicants, and if testvalidity is .50 (the validity for most jobs aftercorrection for restriction in range and unre-liability of criterion measures), then a differ-ence of one standard deviation on the meantest score corresponds to half a standard de-viation's difference on performance.

If the test score cutoff is set to select the tophalf of the white applicants, 50% will performabove the mean on the criterion. At the samecutoff, only 16% of the black applicants willbe selected, although 31% would perform ator above the mean for white applicants. Im-proving the test cannot entirely overcome thisdifference. If the test is improved so that thereis half a standard deviation difference on bothtest and criterion, 50% of whites as against31% of blacks will be selected. (This is thetrue meaning of Thorndike's, 1971, claim thata test that is fair to individuals might be unfairto groups.)

Although the gap cannot be entirely closed,it can be narrowed by increasing the validityof the selection process. If we could find pre-dictors of determinants of performance otherthan ability to add to the prediction suppliedby ability tests, then we could simultaneously

increase validity and decrease adverse impact.That is, the most likely successful approachto reduced adverse impact is not through thediscovery of substitutes for ability tests—thereis no known test that approaches cognitivetests in terms of validity for most jobs—butthrough the discovery of how best to use al-ternative measures in conjunction with cog-nitive ability tests.

Distinguishing Method From Content

In considering the use of alternatives, espe-cially their use in conjunction with ability tests,it is helpful to distinguish the means of mea-surement (method) from the specification ofwhat is to be measured (content). Most con-temporary discussion of alternative predictorsof job performance is confused by the failureto make this distinction. Much contemporarywork is oriented toward using ability to predictperformance but using something other thana test to assess ability. The presumption is thatan alternative measure of ability might findless adverse impact. The existing cumulativeliterature on test fairness shows this to be afalse hope. The much older and larger literatureon the measurement of abilities also suggestedthat this would be a false hope. All large-samplestudies through the years have shown that pa-per-and-pencil tests are excellent measures ofabilities and that other kinds of tests are usuallymore expensive and less valid. Perhaps an in-vestigation of characteristics other than abilitywould break new ground, if content were tobe considered apart from method.

An example of confusion between contentand method is shown in a list of alternativepredictors in an otherwise excellent recent re-view published by the Office of PersonnelManagement (Personnel Research and De-velopment Center, 1981). This review listspredictors such as reference checks, self-as-sessments, and interviews as if they were mu-tually exclusive. The same is true for worksamples, assessment centers, and the behav-ioral consistency approach of Schmidt, Cap-Ian, et al., 1979). However, reference checksoften ask questions about ability, as tests do;about social skills, as assessment centers do;or about past performance, as experience rat-ings do. Self-assessments, interviews, and testsmay all be used to measure the same char-

Page 4: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

ALTERNATIVE PREDICTORS OF JOB PERFORMANCE 75

acteristics. A biographical application blankmay seek to assess job knowledge by askingabout credentials, or it may seek to assess socialskill by asking about elective offices held.

What characteristics could be assessed thatmight be relevant to the prediction of futurejob performance? The list includes: past per-formance on related jobs; job knowledge; psy-chomotor skills; cognitive abilities; social skills;job-related attitudes such as need for achieve-ment (Hannan, 1979), locus of control (Han-nan, 1979), or bureaucratic value orientation

Criterion

(Gordon, 1973); emotional traits such as re-sistance to fear or stress or to enthusiasm. Suchcharacteristics form the content to be mea-sured in predicting performance.

What are existing measurement methods?They come in two categories: methods en-tailing observation of behavior, and methodsentailing expert judgment. Behavior is ob-served by setting up a situation in which theapplicant's response can be observed andmeasured. The observation is valid to the ex-tent that the count or measurement is related

Figure 1. This shows that a fair test may still have an adverse impact on the minority group. (From "Conceptsof Culture Fairness" by R. L. Thorndike, 1971, Journal of Educational Measurement, 8, pp. 63-70. Copyright1971 by the National Council on Measurement in Education. Reprinted by permission.)

Page 5: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

76 JOHN E. HUNTER AND RONDA F. HUNTER

to the characteristic. Observation of behaviorincludes most tests, that is, situations in whichthe person is told to act in such a way so asto produce the most desirable outcome. Theresponse may be assessed for correctness,speed, accuracy, or quality. Behavior may alsobe observed by asking a person to express achoice or preference, or to make a judgmentabout some person, act, or event. A person'sreports about memories or emotional statescan also be considered as observations, al-though their validity may be more difficult todemonstrate. If judgment is to be used as themeasurement mode, then the key question is"Who is to judge?" There are three categoriesfor judges: self-assessments, judgments byknowledgeable others such as supervisor, par-ent, or peer, and judgments by strangers suchas interviewers or assessment center panels.Each has its advantages and disadvantages. The

data base for studies of judgment is largest forself-judgments, second largest for judgmentsby knowledgeable others, and smallest forjudgments by strangers. However, the likeli-hood of distortion and bias in such studies isin the same order.

Figure 2 presents a scheme for the classi-fication of potential predictors of job perfor-mance. Any of the measurement proceduresalong the side of the diagram can be used toassess any of the characteristics along the top.Of course the validity of the measurement mayvary from one characteristic to another, andthe relevance of the characteristics to particularjobs will vary as well.

Meta-Analysis and Validity GeneralizationIn the present study of alternative predictors

of job performance, the results of hundredsof studies were analyzed using methods called

METHOD(Means of Measurement)

Category

Obser-vation

Informal,on thejob

ControlledMeasure-ment(Tests)

MethodInspection &AppraisalIncidenceCountsCoding of Com-pany RecordsSpeed

Accuracy

Quality

Adaptiveness& InnovationChoice

CONTENT(Aspect of Behavior or Thought)

Perfor-mance

Know-ledge Ability

Psycho-motorSkill

SocialSkill

Atti-tude

EmotionalTrait

Judg-ment

Judgmentof

Traits

Judgmentof Rele-vant Ex-perience

Self

InformedOtherStranger

Self

InformedOtherStranger

Figure 2. Measurement content and method: A two-dimensional structure in which various procedures areapplied to the assessments of various aspects of thought and behavior.

Page 6: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

ALTERNATIVE PREDICTORS OF JOB PERFORMANCE 77

validity generalization in the area of personnelselection and called meta-analysis in educationor by people using the particular method rec-ommended by Glass (1976). The simplest formof meta-analysis is to average correlationsacross studies, and this was our primary an-alytic tool. However, we have also used themore advanced formulas from personnel se-lection (Hunter, Schmidt, & Jackson, 1982),because they correct the variance across studiesfor sampling error. Where possible we havecorrected for the effects of error of measure-ment and range restriction. Because validitygeneralization formulas are not well knownoutside the area of personnel selection, we re-view the basis for such corrections in this sec-tion.

The typical method of analyzing a largenumber of studies that bear on the same issueis the narrative review. The narrative reviewhas serious drawbacks, however (Glass,McGaw, & Smith, 1981; Hunter, Schmidt, &Jackson, 1982; Jackson, 1980). Much betterresults can be obtained by quantitative analysisof results across studies. Glass (1976) coinedthe word meta-analysis for such quantitativeanalysis. A variety of meta-analysis methodshave been devised, including the counting ofppsitive and negative results, the counting ofsignificance tests, and the computing of meansand variances of results across studies. Thestrengths and weaknesses of the various meth-ods are reviewed in Hunter, Schmidt, andJackson (1982).

The most common method of meta-analysisis that which was proposed by Glass (1976),and was presented in detail in Glass,McGaw, and Smith (1981). The steps in thismethod are:

1. Gather as many studies on the issue aspossible.

2. Code the data so that the outcome mea-sure in each study is assessed by the samestatistic.

3. Assess the general tendency in the find-ings by averaging the outcome statistic acrossstudies.

4. Analyze the variance across studies bybreaking the studies down into relevantsubgroups and comparing the means of thesubgroups.

In the area of personnel selection, the Glass(1976) method has been in use for at least 50

years (Dunnette, 1972; Ghiselli, 1966, 1973;Thorndike, 1933). Ghiselli gathered hundredsof validation studies on all methods of pre-dicting job performance. Each validation studywas coded to yield a validity coefficient, thatis, a Pearson correlation coefficient betweenthe predictor of job performance (such as anability test score) and the measure of job per-formance (such as a rating of performance bythe worker's immediate supervisor). Ghisellithen averaged validity coefficients across stud-ies (although he used the median instead ofthe mean). He analyzed the variance of validityacross studies by breaking the studies downinto families of jobs and comparing averagevalidity across job families.

The Glass (1976) method has one main de-ficiency. If the variance across studies is takenat face value, then the problem of samplingerror is neither recognized nor dealt with ad-equately (Hedges, 1983; Hunter, 1979b;Hunter,-Schmidt, & Jackson, 1982; Schmidt& Hunter, 1977). Schmidt and Hunter (1977)showed that ignoring sampling error leads todisastrous results in the area of personnel se-lection. Because he ignored the effect of sam-pling error on the variance of findings acrossstudies, Ghiselli (1966, 1973) concluded thattests are only valid on a sporadic basis, thatvalidity varies from one setting to another be-cause of subtle differences in job requirementsthat have not yet been discovered. He thereforeconcluded that no predictor could be used forselection in any given setting without justifyingits use with a validation study conducted inthat exact setting.

The following example shows how differentconclusions may be drawn when sampling er-ror is considered. In the present study we foundthat the average population correlation for theinterview as a predictor of supervisor ratingsof performance is. 11 (. 14 if corrected for errorof measurement in ratings). For simplicity, letus suppose that the population correlation forthe interview in every setting is .11. Let usalso suppose that all studies on the interviewwere done with a sample size of 68 (Lent,Aurbach, & Levin, 1971, found that the av-erage sample size across more than 1,500studies was 68), Then the average observedcorrelation across studies would be close to. 11 (depending on the total sample size acrossstudies), but the standard deviation would be

Page 7: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

78 JOHN E. HUNTER AND RONDA F. HUNTER

far from zero; the observed standard deviationwould be. 12. The correlations would be spreadover a range from —.25 to +.47. A little over16% of the correlations would be negative, anda little over 16% would be .20 or greater.

The correct conclusion from this exampleis, "The validity of the interview is . 11 in allsettings." However, if the variance of resultsacross studies is taken at face value, as it is inGlass's (1976) method, then the conclusion is,"The validity of the interview varies enor-mously from one setting to the next. In aboutone sixth of settings, however, the validityreaches a reasonable level of .20 or better." Ifthe number of studies is small, then some fea-ture of the studies would by chance be cor-related with the variation in observed corre-lations, and the reviewer would say somethinglike, "Studies done with female interviewerswere more likely to produce useful levels ofvalidity." If the number of studies is large, thenno study feature would be correlated with out-come and the conclusion would probably read,"The validity of the interview varies accordingto some subtle feature of the organizationwhich is not yet known. Further research isneeded to find out what these moderator vari-ables are."

Use of the significance test in this contextadds to the confusion. The correlation wouldbe significant in only 14% of the studies (thatis, it would be wrong in 86% of the studies,reflecting the usual low level of power is psy-chological studies in which the null hypothesisis false), leading to the false conclusion, "Theinterview was found to be valid in only 14%of studies—9% greater than chance—and thuscannot be used except in that 9% of settingsin which a local validation study has shownit to be valid."

It is possible to correct the variance acrossstudies to eliminate the effect of sampling error.Schmidt and Hunter (1977) devised a formulafor doing so under the rubric of validity gen-eralization, though that first formula had anerror that was corrected in Schmidt, Hunter,Pearlman, and Shane (1979). Schmidt andHunter and others (Callender & Osburn, 1980;Hedges, 1983; Raju & Burke, in press) havedeveloped a variety of formulas since then.The historical basis for these formulas is re-viewed in Hunter, Schmidt, and Pearlman(1982), and the mathematical basis for com-

paring them is given in Hunter, Schmidt, andJackson (1982). In the context of personnelselection, the formulas all give essentiallyidentical results.

Schmidt and Hunter (1977) noted that studyoutcomes are distorted by a number of artifactsother than sampling error. They presentedmeta-analysis formulas that correct for errorof measurement and range restriction as wellas sampling error. The comparable formulasof Hunter, Schmidt, and Jackson (1982) wereused when applicable in this study. However,there are artifacts that are not controlled byeven these formulas, the most important ofwhich are bad data, computational errors,transcriptional errors, and typographical er-rors. Thus, even the corrected standard de-viations reported in the present study areoverestimates of the actual variation in validityacross jobs.

Productivity Gains and Losses

A further advantage of meta-analysis is thatthe validity information from the meta-anal-ysis can be combined with estimated costs toform a utility analysis for the use of alternativepredictors in place of written tests. Dollar im-pact figures can be generated in a form ap-propriate for use by individual firms or or-ganizations. The cost of testing is a significantpart of this impact. Because the cost of paper-and-pencil tests is negligible in comparisonwith the gains to be obtained from a selectionprocess, this factor has usually been ignoredin the literature. Cost is a more significantfactor, however, for assessment centers, min-iature training and evaluation, and proba-tionary periods; taking cost into considerationadds to the accuracy of utility estimates forall predictors.

Design of This Study

This study was begun with the intent ofusing meta-analyses to study the cumulativeresearch on alternative predictors in four ways:validity of the alternative predictor consideredalone and in conjunction with ability tests,and adverse impact of the predictor consideredalone and in conjunction with ability tests.However, we could discover a data base foronly one of the four: validity of the predictor

Page 8: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

ALTERNATIVE PREDICTORS OF JOB PERFORMANCE 79

considered alone. Few studies consider morethan one predictor on the same data set. Fur-thermore, most of the studies of multiple pre-dictors do not present the correlations betweenthe predictors, so that multiple regression can-not be done. The lack of data on adverse im-pact has to do with the age of the studies.Because early studies found low validity foralternative predictors, research on alternativepredictors tended to die out by the time thatanalysis by racial group began to become rou-tine. The sparse literature that does exist issummarized in Reilly and Chao (1982).

In all the new meta-analyses conducted forthis study, it was possible to correct the vari-ance of validity for sampling error. Correctionsfor reliability and range restriction were madewhen the necessary .data were available, asnoted below. Correction for error of mea-surement in job performance was done usingthe upper bound interrater reliability of .60for ratings (King, Hunter, & Schmidt, 1980)and a reliability of .80 for training success asmeasured by tests. Except for ability, corre-lations are not corrected for restriction inrange. Empirical values for range restrictionare known only for ability tests, and most ofthe alternative predictors considered here arerelatively uncorrelated with cognitive ability.Also, when Hunter (1980c) analyzed theUnited States Employment Service data base,he found incumbent-to-applicant standarddeviation ratios of .67, .82, and .90 for cog-nitive, perceptual, and psychomotor ability,respectively. The figure of .90 for psychomotorability suggests that restriction in range on al-ternative predictors stems largely from indirectrestriction due to selection using ability.

There are two predictors for which restric-tion in range may be a problem: the interviewand ratings of training and experience. Forexample, some of the studies of the interviewmight have been done in firms using the in-terview for selection, in which case restrictionin range was present. The key finding for theinterview is a mean validity of .14 (in pre-dicting supervisor ratings of performance,corrected for error of measurement). If therestriction in range is as high as that for cog-nitive ability, then the true mean validity ofthe interview is .21. The observed average cor-relation for ratings of training and experienceis .13. If the extent of restriction in range

matches that for cognitive ability, then the cor-rect correlation would be .19.

No correction for restriction in range wasneeded for biographical inventories because itwas always clear from context whether thestudy was reporting tryout or follow-up valid-ity, and follow-up correlations were not usedin the meta-analyses.

Studies analyzed. This study summarizesresults from thousands of validity studies.Twenty-three new meta-analyses were done forthis review. These analyses, reported in Table8, review work on six predictors: peer ratings,biodata, reference checks, college grade pointaverage (GPA), the interview, and the StrongInterest Blank. These meta-analyses werebased on a heterogeneous set of 202 studies:83 predicting supervisor ratings, 53 predictingpromotion, 33 predicting training success, and33 predicting job tenure.

Table 1 reanalyzes the-meta-analysis doneby Ghiselli (1966, 1973). Ghiselli summarizedthousands of studies done over the first 60years of this century. His data were not avail-able for analysis using state-of-the-art methods.

Tables 1 through 7 review meta-analysesdone by others. Table 2 contains a state-of-the-art meta-analysis using ability to predictsupervisor ratings and training success. Thisstudy analyzed the 515 validation studies car-ried out by the United States EmploymentService using its General Aptitude Test Battery(GATE). Tables 3 through 6 review compar-ative meta-analyses by Dunnette (1972), Reillyand Chao (1982), and Vineberg and Joyner(1982), who reviewed 883, 103, and 246 stud-ies, respectively. Few of the studies reviewedby Reilly and Chao date as far back as 1972;thus, their review is essentially independent ofDunnette's. Vineberg and Joyner reviewedonly military studies, and cited few studiesfound by Reilly and Chao. Thus, the Vinebergand Joyner meta-analysis is largely indepen-dent of the other two. Reilly's studies are in-cluded in the new meta-analyses presented inTable 8.

Tables 6 and 7 review meta-analyses of amore limited sort done by others. Table 6 pre-sents unpublished results on age, education,and experience from Hunter's (1980c) analysisof the 515 studies done by the United StatesEmployment Service. Table 6 also presentsmeta-analyses on job knowledge (Hunter,

Page 9: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

80 JOHN E. HUNTER AND RONDA F. HUNTER

1982), work sample tests (Asher & Sciarrino,1974), and assessment centers (Cohen, Moses,& Byham, 1974), reviewing 16, 60, and 21studies, respectively.

Validity of Alternative Predictors

This section presents the meta-analyses thatcumulate the findings of hundreds of valida-tion studies. First, the findings on ability testsare reviewed. The review relies primarily ona reanalysis of GhiseUi's lifelong work (Ghiselli,1973), and on the first author's recent meta-analyses of the United States EmploymentService data base of 515 validation studies(Hunter, 1980c). Second, past meta-analysesare reviewed, including major comparative re-views by Dunnette (1972), Reilly and Chao(1982), and Vineberg and Joyner (1982).Third, new meta-analyses are presented usingthe methods of validity generalization. Finally,there is a summary comparison of all the pre-dictors studied, using supervisor ratings as themeasure of job performance.

Predictive Power of Ability Tests

There have been thousands of studies as-sessing the validity of cognitive ability tests.Validity generalization studies have now beenrun for over 150 test-job combinations (Brown,1981; Callender & Osburn, 1980, 1981; Lil-ienthal & Pearlman, 1983; Linn, Harnisch, &Dunbar, 1981; Pearlman, Schmidt, & Hunter,1980; Schmidt et al., 1980; Schmidt & Hunter,1977; Schmidt, Hunter, & Caplan, 198la,1981b; Schmidt, Hunter, Pearlman, & Shane,1979; Sharf, 1981). The results are clear: Mostof the variance in results across studies is dueto sampling error. Most of the residual varianceis probably due to errors in typing, compu-tation, or transcription. Thus, for a given test-job combination, there is essentially no vari-ation in validity across settings or time. Thismeans that differences in tasks must be largeenough to change the major job title beforethere can be any possibility of a difference invalidity for different testing situations.

The results taken as a whole show far lessvariability in validity across jobs than had beenanticipated (Hunter, 1980b; Pearlman, 1982;Schmidt, Hunter, & Pearlman, 1981). In factthree major studies applied most known

methods of job classification to data on train-ing success; they found no method of job anal-ysis that would construct job families for whichcognitive ability tests would show a signifi-cantly different mean validity. Cognitive abilityhas a mean validity for training success ofabout .55 across all known job families(Hunter, 1980c; Pearlman, 1982; Pearlman &Schmidt, 1981). There is no job for whichcognitive ability does not predict training suc-cess. Hunter (1980c) did find, however, thatpsychomotor ability has mean validities fortraining success that vary from .09 to .40 acrossjob families. Thus, validity for psychomotorability can be very low under some circum-stances.

Recent studies have also shown that abilitytests are valid across all jobs in predicting jobproficiency. Hunter (1981c) recently reana-lyzed GhiseUi's (1973) work on mean validity.The major results are presented in Table 1,for GhiseUi's job families. The families havebeen arranged in order of decreasing cognitivecomplexity of job requirements. The validityof cognitive ability decreases correspondingly,with a range from .61 to .27. However, eventhe smallest mean validity is large enough togenerate substantial labor savings as a resultof selection. Except for the job of sales clerk,the validity of tests of psychomotor ability in-creases as job complexity decreases. That is,the validity of psychomotor abUity tends to behigh on just those job families where the va-lidity of cognitive ability is lowest. Thus, mul-tiple correlations derived from an optimalcombination of cognitive and psychomotorability scores tend to be high across all jobfamilies. Except for the job of sales clerk, wherethe multiple correlation is only .28, the mul-tiple correlations for ability range from .43to .62.

There is a more uniform data base forstudying validity across job families than Ghi-seUi's heterogeneous collection of validationstudies. Over the last 35 years, the UnitedStates Employment Service has done 515 val-idation studies all using the same aptitude bat-tery. The GATE has 12 tests and can be scoredfor 9 specific aptitudes (U.S. Employment Ser-vice, 1970) or 3 general abilities. Hunter(1980a, 1980b; 1981b, 1981c) has analyzedthis data base and the main results are pre-sented in Table 2. After considering 6 major

Page 10: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

ALTERNATIVE PREDICTORS OF JOB PERFORMANCE 81

Table 1Hunter's (1981c) Reanalysis ofGhiselli's (1973) Work on Mean Validity

Mean validity

Job families

ManagerClerkSalespersonProtective professions workerService workerTrades and crafts workerElementary industrial workerVehicle operatorSales clerk

Cog

.53

.54

.61

.42

.48

.46

.37

.28

.27

Per

.43

.46

.40

.37

.20

.43

.37

.31

.22

Mot

.26

.29

.29,26.27.34.40.44.17

Beta weight

Cog

.50

.50

.58

.37

.44

.39

.26

.14

.24

Mot

.08

.1-2

.09

.13

.12

.20

.31

.39

.09

R

.53

.55

.62

.43

.49

.50

.47

.46

.28

Note. Cog = general cognitive ability; Per = general perceptual ability; Mot = general psychomotor ability; R = multiplecorrelation. Mean validities have been corrected for criterion unreliability and for range restriction using mean figuresfor each predictor from Hunter (1980c) and King, Hunter, and Schmidt (1980).

methods of job analysis, Hunter found thatthe key dimension in all methods was the di-mension of job complexity. This dimension islargely captured by Fine's (1955) data dimen-sion, though Fine's things dimension did define2 small but specialized industrial job families:set-up work and jobs involving feeding/off-bearing. (Fine developed scales for rating alljob according to their demands for dealingwith data, things, and people.) The presentanalysis therefore groups job families by levelof complexity rather than by similarity of tasks.Five such job families are listed in Table 2,along with the percentage of U.S. workers ineach category. The principal finding for theprediction of job proficiency was that althoughthe validity of the GATE tests varies acrossjob families, it never approaches zero. Thevalidity of cognitive ability as a predictor de-creases as job complexity decreases. The va-lidity of psychomotor ability as a predictorincreases as job complexity decreases. Thereis corresponding variation in the beta weightsfor the regression equation for each job familyas shown in Table 2. The beta weights forcognitive ability vary from .07 to .58; the betaweights for psychomotor ability vary from .00to .46. The multiple correlation varies littleacross job families, from .49 to .59 with anaverage of .53.

The results of all cumulative analyses arecongruent. If general cognitive ability alone isused as a predictor, the average validity acrossall jobs is .54 for a training success criterion

and .45 for a job proficiency criterion. Re-garding the variability of validities across jobs,for a normal distribution there is no limit asto how far a value can depart from the mean.The probability of its occurrence just decreasesfrom, for example, 1 in 100 to 1 in 1,000 to1 in 10,000, and so forth. The phrase effectiverange is used for the interval that captures themost values. In the Bayesian literature, theconvention is 10% of cases. For tests of cog-nitive ability used alone, then, the effectiverange of validity is .32 to .76 for training suc-cess and .29 to .61 for job proficiency. However,if cognitive ability tests are combined withpsychomotor ability tests, then the average va-lidity is .53, with little variability across jobfamilies. This very high level of validity is thestandard against which all alternative predic-tors must be judged.

Past Comparisons of Predictors

Three meta-analytic studies (Reilly & Chao,1982; Dunnette, 1972; Vineberg & Joyner,1982) have compared the validity of differentpredictors; two studies were completed afterthe present study began. All three studies havebeen reanalyzed for summary here.

Table 3 presents a summary of the findingsof Dunnette (1972). The predictors are pre-sented in three groups. Group 1 shows Dun-nette's data as reanalyzed by the present au-thors in terms of three general abilities testsrather than in terms of many specific aptitudes.

Page 11: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

82 JOHN E. HUNTER AND RONDA F. HUNTER

H

S

g

£

1

Teste

d

«X

2

o

o

£•- 8a a•s »»

O

\ oo fi o ON

oTt

i — o <i in <4- <

S m —-^ m

m <N ^t m

o a

2

1I

o

cE

03

a st!

s?a-°.w ^

«.&

The resulting validities are comparable withthose found in the United States EmploymentService data base as shown in Table 2. Group2 includes three predictors that might be usedfor entry-level jobs. Of these, only biodata hasa validity high enough to compare with thatof ability. The interview is a poor predictor,and amount of education as a predictor didnot work at all for the jobs Dunnette was in-terested in. Group 3 includes two predictorsthat can be used only if examinees have beenspecifically trained for the job in question: jobknowledge and job tryout. The validity of .51for job knowledge is comparable with the va-lidity of .53 for ability tests. The validity of.44 for job tryout is lower than one mightexpect. The most likely cause of the low va-lidity of job tryout is the halo effect in su-pervisor ratings. Because of the halo effect, ifthe performance of the worker on tryout wasevaluated by a supervisor rating, then the up-per bound on the reliability of the tryout mea-surement would be .60 (King et al., 1980). Inthat case, error in supervisor ratings reducesthe reliability of the predictor rather than thatof the criterion. If so, then job tryout withperfect measurement of job performancemight have a validity of .57 (corrected usinga reliability of .60) rather than the observedvalidity of .44.

Table 4 presents a summary of the findingsof Reilly and Chao (1982). Most of their find-ings are included in the meta-analyses pre-sented in Table 8. The validity of .38 for bio-data is comparable with the .34 found byDunnette (1972), and the finding of .23 forthe interview is roughly comparable with the. 16 found by Dunnette. The finding of. 17 foracademic achievement, however, is muchhigher than the zero for education Dunnettefound. Dunnette used amount of educationas the predictor, however, whereas Reilly andChao used grades. One would expect gradesto be more highly correlated with cognitiveability than amount of education. For thatreason, grades would have higher validity thanamount of education.

Reilly and Qiao's (1982) estimate of somevalidity for self-assessments is probably anoverestimate. Under research conditions peo-ple can provide reasonable self-assessments ondimensions—for example typing speed (Ash,1978, r = .59), spelling (Levine, Flory, & Ash,

Page 12: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

ALTERNATIVE PREDICTORS OF JOB PERFORMANCE 83

Table3Meta-Analysis Derived From Dunnette (1972)

Predictors

Group 1Cognitive abilityPerceptual abilityPsychomotor ability

Group 2Biographical

inventoriesInterviewsEducation

Group 3Job knowledgeJob tryout

No. corre-lations

2159795

1153015

29620

Averagevalidity

.45

.34

.35

.34

.16

.00

.51,44

1977, r = .58), and intelligence (DeNisi &Shaw, 1977, r = .35)—on which they receivedmultiple formal external assessments. How-ever, there is a drastic drop in validity for self-assessments of dimensions on which personshave not received external feedback. Some ex-amples are typing statistical tables (Ash, 1978,r = .07), filing (Levine, Flory, & Ash, 1977,r = .08), and manual speed and accuracy(DeNisi & Shaw, 1977, r - .19). As the as-sessment shifts from task in school to task onjob, the validity drops to very low levels. Fur-thermore, these correlations are misleadinglyhigh because they are only indirect estimatesof validity. For example, if a person's self-as-sessment of intelligence correlates .35 with hisor her actual intelligence, and if intelligencecorrelates .45 with job performance, then ac-cording to path analysis the correlation be-tween self-assessment and job performancewould be about .35 X .45 = .16. This corre-lation would be even lower if self-assessmentsin hiring contexts proved to be less accuratethan self-assessments under research condi-tions.

We consider four studies that actually usedself-assessments in validation research. Bassand Burger (1979) presented internationalcomparisons of the validity of self-assessmentsof performance in management-oriented sit-uational tests measuring such traits as lead-ership style and attitude toward subordinates.The average correlation with advancement,across a total sample size of 8,493, was .05,with no variation after sampling error waseliminated. Johnson, Guffey, and Perry (1980)

correlated the self-assessments of welfarecounselors with supervisor ratings and founda correlation of-.02 (sample size = 104). VanRijn and Payne (1980) correlated self-estimateson 14 dimensions of fire fighting with objectivemeasures such as work sample tests. Only 2of the 14 correlations were significant, andboth were negative (sample size = 209). Farleyand Mayfield (1976) correlated self-assess-ments with later performance ratings for 1,128insurance salespersons and found no signifi-cant relation (hence, r = .05 for this samplesize). The average validity of self-assessmentsacross these 4 studies is zero. Self-assessmentsappear to have no validity in operational set-tings.

Reilly and Chao (1982) also found that pro-jective tests and handwriting analysis are notvalid predictors of job performance.

Table 5 presents the summary results of theVineberg and Joyner (1982) study, correctedfor error of measurement in ratings using areliability of .60. These researchers reviewedonly military studies. The most striking featureof Table 5 is the low level of the validity coef-ficients. The rank order of the correlations forsupervisor rating criteria is about the same asin other studies (i.e., training performance,aptitude, biodata, education, interest), but thelevel is far lower for all variables except ed-ucation. This fact may reflect a problem withmilitary operational performance ratings, ev-idence for which can be found in one of theVineberg and Joyner tables. They found thatif a performance test was used as the criterion,then the validity was more than twice as highas when performance ratings were used as thecriterion. Many of the ratings were actually

Table 4Meta-Analysis Derived From Reillyand Chao (1982)

Predictor

Biographical inventoryInterviewExpert recommendationReference checkAcademic achievementSelf-assessmentProjective testsHandwriting analysis

No. corre-lations

4411167

10753

Averagevalidity

.38

.23

.21

.17

.17SomeLittleNone

Page 13: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

84 JOHN E. HUNTER AND RONDA F. HUNTER

Table 5Meta-Analysis of Military Studies UsingSupervisory Ratings as the Criterion

Performance measures

GlobalPredictor rating

Suit-ability

Allratings

Number of correlations

AptitudeTraining performanceEducationBiographical inventoryAgeInterest

Average

AptitudeTraining performanceEducationBiographical inventoryAgeInterest

101512512

15

validity"

.21

.27

.14

.20

.13

117

104

10

.35

.29

.36

.29

.21

1125835161015

.28

.28

.25

.24

.21

.13

Note. Derived from Vineberg and Joyner (1982)." Global ratings, suitability, and all ratings were correctedfor error of measurement using a reliability of .60.

suitability ratings (the military term for ratingsof potential ability). The difference betweensuitability ratings and performance ratings issimilar to the differences between ratings ofpotential and ratings of performance that hasshown up in assessment center studies, and isdiscussed later.

Table 6 presents the results for four othermeta-analytic studies. The data on age, edu-cation, and experience were taken from theUnited States Employment Service data base(Hunter, 1980c). The validity results for agein Table 6 were gathered on samples of thegeneral working population. Thus, these sam-ples would have included few teenagers. How-ever, there have been military studies that re-ported noticeably lower performance in re-cruits aged 17 and younger. That is, althoughthere were no differences as a function of agefor those 18 years and older, those 17 years ofage and younger were less likely to completetraining courses. This was usually attributedto immaturity rather than to lack of ability.

Hunter's (1982) job knowledge study largelyinvolved studies in which job knowledge wasused as a criterion measure along with worksample performance or supervisor ratings. The

validity for job knowledge tests is high—.48with supervisor ratings and .78 with worksample performance tests. However, the useof job knowledge tests is limited by the factthat they can only be used for prediction ifthe examinees are persons already trained forthe job.

The same can be said for the work sampletests considered by Asher and Sciarrino (1974).These tests are built around the assumptionthat the examinee has been trained to the task.In fact, the verbal work sample tests in theirstudy differ little from job knowledge tests,especially if knowledge is broadly denned toinclude items assessing application of knowl-edge to the tasks in the job. The motor worksamples, from jobs that include such tasks ascomputer programming and map reading, are

Table 6Other Meta-Analysis Studies

Study/predictor

Hunter (1980c)Age

Education

Experience

Hunter (1982)Content valid

job knowl-edge test

Asher &Sciarrino(1974)

Verbal worksample

Motor worksample

Cohen et al.(1974)

Assessmentcenter

Criterion

TrainingProficiencyOverallTrainingProficiencyOverallTrainingProficiencyOverall

Worksample

Performanceratings

TrainingProficiencyTrainingProficiency

PromotionPerformance

Meanvalidity

-.01-.01-.01

.20

.10

.12

.01

.18

.15

.78

.48

.55

.45.45.62

.63'

.43"

No. corre-lations

9042551590

42551590

425515

11

10

30 verbal

30 motor

Note. For Hunter (1980c), 90 studies used the trainingcriterion, 425 studies used the proficiency criterion; forHunter (1982), a total of 16 studies were used; for Cohenet al. (1974), a total of 21 studies were used.1 Median correlation.b Corrected for attenuation.

Page 14: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

ALTERNATIVE PREDICTORS OF JOB PERFORMANCE 85

Table 7Meta-Analyses of Three Predictors

Study /predictor

Kane &Lawler( 1978)Peer ratings

O'Leary (1980)College grade point average

Schmidt, Caplan, et al. (1979)and Johnson et al. (1980)

Traditional t & e ratingBehavioral consistency

e rating

Criterion

PromotionSupervisor ratingsTraining success

PromotionSupervisor ratings'TenureTraining success

No.studies

13317

171123

No.subjects

6,9098,2021,406

6,0141,089

181837

Meanvalidity

.49

.49

.36

.21

.11

.05

.30

.13»

.49

No.correlations

13317

171123

655

Note, t = training; e = experience." Corrected for unreliability of supervisor ratings.

harder to describe, but are apparently distin-guished by a subjective assessment of the over-lap between the test and the job.

The assessment center meta-analysis is nowdiscussed.

New Meta-Analyses

Table 7 presents new meta-analyses by theauthors that are little more than reanalyses ofstudies gathered in three comprehensive nar-rative reviews. Kane & Lawler (1978) missedfew studies done on peer ratings, and O'Leary(1980) was similarly comprehensive in regardto studies using college grade point average.Schmidt, Caplan, et al. (1979) completely re-viewed studies of training and experience rat-ings.

Peer ratings have high validity, but it shouldbe noted that they can be used only in veryspecial contexts. First, the applicant must al-ready be trained for the job. Second, the ap-plicant must work in a context in which hisor her performance can be observed by others.Most peer rating studies have been done onmilitary personnel and in police academies forthose reasons. The second requirement is wellillustrated in a study by Ricciuti (1955), con-ducted on 324 United States Navy midship-men. On board the men worked in differentsections of the vessel and had only limitedopportunity to observe each other at work. In

the classroom each person was observed byall his classmates. Because of this fact, eventhough the behavior observed on summercruises is more similar to eventual naval dutythan behavior during the academic term, thepeer ratings for the cruise had an average va-lidity of only .23 (uncorrected), whereas theaverage validity for peer ratings during aca-demic terms was .30 (uncorrected).

Schmidt, Caplan, et al. (1979) reviewedstudies of training and experience ratings. Theresearchers contrasted two types of procedures:traditional ratings, which consider onlyamounts of training, education, and experi-ence; and behavioral consistency measures,which try to assess the quality of performancein past work experience. They reported ontwo meta-analytic studies on traditional train-ing and experience ratings: Mosel (1952) foundan average correlation of .09 for 13 jobs, andMolyneaux (1953) found an average correla-tion of .10 for 52 jobs. If we correct for theunreliability of supervisor ratings, then the av-erage for 65 jobs is .13. Schmidt, Caplan, etal. (1979) reported the results of four studiesthat used procedures similar to their pro-posed behavioral consistency method: Primoff(1958), Haynes (undated), and two studies byAcuff (1965). In addition, there has since beena study that used their procedure—Johnsonet al. (1980). The meta-analysis of these fivestudies shows an average validity of .49, which

Page 15: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

86 JOHN E. HUNTER AND RONDA F. HUNTER

is far higher than the validity of traditionalratings.

However, the comparison of traditionaltraining and experience ratings with the be-havioral consistency method is to some extentunfair. The behavioral consistency methoddata reviewed here were from situations inwhich the applicants had already worked inthe job for which they were applying. Tradi-tional ratings have often been applied in entry-level situations where the person is to betrained for the job after hiring. Generalizedbehavioral consistency scales have been con-structed and used, but there have been no cri-

terion-related validation studies of such scalesto date.

Table 8 presents the new meta-analyses donefor this report. Supervisor ratings were cor-rected for error of measurement. Promotionin most studies meant job level attained (i.e.,salary corrected for years with the companyor number of promotions), although severalstudies used sales. Training success refers totraining class grades, although most studiesused a success-failure criterion. Tenure refersto length of time until termination.

There is a theoretically interesting compar-ison between training success findings and su-

Table 8New Meta-Analyses of the Validity of Some Commonly Used Alternatives to Ability Tests

Alternative measureAveragevalidity

SDofvalidity

No.correlations

Totalsample size

Criterion: supervisor ratings

Peer ratingsBiodata"Reference checksCollege GPAInterviewStrong Interest Inventory

.49

.37

.26

.11

.14

.10

.15

.10

.09

.00

.05

.11

31121011103

8,2024,4295,3891,0892,6941,789

Criterion: promotion

Peer ratingsBiodata"Reference checksCollege GPAInterviewStrong Interest Inventory

.49

.26

.16

.21

.08

.25

.06

.10

.02

.07

.00

.00

13173

1754

6,9099,024

4156,0141,744

603

Criterion: training success

Peer ratingsBiodata'Reference checksCollege GPAInterviewStrong Interest Inventory

.36

.30

.23

.30

.10

.18

.20

.11

.16

.07

.00

711

1392

1,4066,1391,553

8373,544

383

Criterion: tenure

Peer ratingsb

Biodata"Reference checksCollege GPAInterviewStrong Interest Inventory

.26

.27

.05

.03

.22

.15

.00

.07

.00

.11

232233

10,8002,018

1811,9253,475

Note. GPA = grade point average.1 Only cross validities are reported.b No studies were found.

Page 16: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

ALTERNATIVE PREDICTORS OF JOB PERFORMANCE 87

pervisor rating findings. Table 2 shows thatfor cognitive ability, the validity is usuallyhigher for training success than for supervisorratings. This is not generally true for the otherpredictors in Table 8 (or for psychomotor abil-ity either). This fits with the higher emphasison learning and remembering in training.

In Table 8, it is interesting to note that acrossall jobs, the average validity of college GPAin the prediction of promotion was .21 witha standard deviation of .07. A standard de-viation of .07 is large in comparison with amean of .21. A further analysis reveals thereason for this variation. Additional analysisby the authors shows an average validity ofGPA for managers of .23 across 13 studieswith a total sample size of 5,644. The averagevalidity for people in sales, engineering, andtechnical work was -.02 across four studieswith a sample size of 370. This might meanthat the motivational aspects of achieving highcollege grades are more predictive of moti-vation in management than in other profes-sions. It might also mean that college gradesare considered part of the management eval-uation process. That is, it may be that collegegrades are a criterion contaminate in man-agement promotion. These hypotheses couldbe tested if there were more studies comparingcollege grades with supervisor ratings, but onlyone such study was found for managers andit had a sample size of only 29.

The validity coefficients in Table 8 are notfree of sampling error, especially those wherethe total sample size is below 1,000. For ex-ample, the average correlation of .05 for collegeGPA as a predictor of tenure is based on atotal sample size of only 181. The 95% con-fidence interval is .05 + .15. However, the av-erage correlations for supervisor ratings areall based on total sample sizes of 1,789 ormore, with an average of 4,753.

Problems Associated With Two Widely UsedAlternative Predictors

Biographical data. For entry-level jobs,biodata predictors have been known to havevalidity second in rank to that of measures ofability. The average validity of biodata mea-sures is .37 in comparison with .53 for abilitymeasures (with supervisor ratings as the jobperformance measure). Thus, to substitute

biodata for ability would mean a loss of 30%of the saving gained by using ability tests ratherthan random selection, whereas other predic-tors would cause larger losses. However, theoperational validity of biodata instrumentsmay be lower than the research validity forseveral reasons.

Biodata predictors are not fixed in theircontent, but rather constitute a method forconstructing fixed predictors. The data re-ported in the literature are based on specializedbiodata instruments; each study used a keyedbiographical form where the key is specific tothe organization in which the key is tested.There is evidence to suggest that biodata keysare not transportable.

Biodata keys also appear to be specific tothe criterion measure used to develop the key.For example, Tucker, Cline, and Schmitt(1967) developed separate keys for supervisorratings and for tenure that were both checkedin the same population of pharmaceutical sci-entists. A cross validation (as described below)showed that the ratings key predicted ratings(r = .32), and that the tenure key predictedtenure (r = .50). However, the cross correla-tions between different criteria were differ-ent—they were both negative (r = —.07 forthe ratings key predicting tenure, and r = -.09for the tenure key predicting ratings). That is,in their study the key predicting one criterionis actually negatively predictive of the other.

Biodata keys often suffer decay in validityover time. Schuh (1967) reviewed biodatastudies in which follow-up validity studies hadbeen done. Within these studies, the meantryout validity is .66, but in successive follow-up validity studies this value drops from .66to .52 to .36 to .07. Brown (1978) claimed tohave found an organization in which this decaydid not occur. However, Brown evaluated hisdata on the basis of statistical significancerather than on the basis of validity. Becausethe data were collected from a sample of14,000, even small correlations were highlysignificant. Thus, all his correlations registeredsignificant regardless of how small they wereand how much they differed in magnitude.His data showed the validity of a key developedfor insurance salespersons in 1933 on follow-up studies conducted in 1939 and 1970. If westudy validity rather than significance level,the validities were .22 in 1939 and .08 in 1970;

Page 17: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

88 JOHN E. HUNTER AND RONDA F. HUNTER

this shows the same decay as found in otherstudies.

There is a severe feasibility problem withbiodata. The empirical process of developingbiodata keys is subject to massive capitalizationon chance. That is, evidence may be built fromwhat is only a chance relation. The typicalkey is developed on a sample of persons forwhom both biodata and performance ratingsare known. Each biographical item is sepa-rately correlated with the criterion, and thebest items are selected to make up the key.However, on a small sample this procedurepicks up not only items that genuinely predictthe criterion for all persons, but also itemsthat by chance work only for the particularsample studied. Thus, the correlation betweenthe keyed items and the criterion will be muchhigher on the sample used to develop the keythan on the subsequent applicant population.This statistical error can be eliminated by usinga cross-validation research design, in whichthe sample of data is broken into two subsam-ples. The key is then developed on one sample,but the correlation that estimates the validityfor the key is computed on the other sample.Cross-validity coefficients are not subject tocapitalization on chance. Only cross validitiesare reported in Table 8.

Although the cross-validation research de-sign eliminates the bias in the estimate of thevalidity that results from capitalization onchance, it does not eliminate the faulty itemselection that results from capitalization onchance. Sampling error hurts item selectionin two ways: Good items are missed, and pooritems are included. For example, suppose thatwe start with an item pool of 130 items madeup of 30 good items, with a population cor-relation of. 13, and 100 bad items, with a pop-ulation correlation of zero. Let the averageinteritem correlation be .20. Then, using asample of 100, we would on the average select11 of the good items and 5 of the bad items,resulting in a test with a validity of. 18. Usinga sample of 400, we would select 25 of thegood items and 5 of the bad items, resultingin a test with a validity of .23. That is, thevalidity of the biodata key increases by 28%when the sample size is increased from 100to 400.

Thus, the use of biodata requires a largesample of workers, say 400 to 1,000, for the

tryout study, and more large samples approx-imately every 3 years to check for decay invalidity. Only very large organizations haveany jobs with that many workers, so that thevalid use of biodata tends to be restricted toorganizations that can form a consortium.

One such consortium deserves special men-tion, that organized by Richardson, Henry,and Bellows for the development of the Su-pervisory Profile Record (SPR). The latestmanual shows that the data base for the SPRhas grown to include input from 39 organi-zations. The validity of the SPR for predictingsupervisor ratings is .40, with no variation overtime, and little variation across organizationsonce sampling error is controlled. These twophenomena may be related. The cross-orga-nization development of the key may tend toeliminate many of the idiosyncratic and trivialassociations between items and performanceratings. As a result, the same key applies tonew organizations and also to new supervisorsover time.

Richardson, Henry, and Bellows are alsovalidating a similar instrument called theManagement Profile Record (MPR). This in-strument is used with managers above the levelof first-line supervisor. It has an average validityof .40 across 11 organizations, with a totalsample size of 9,826, for the prediction of joblevel attained. The correlation is higher forgroups with more than 5 years of experience.

Two problems might pose legal threats tothe use of biographical items to predict workbehavior: Some items might be challenged forinvasion of privacy, and some might be chal-lenged as indirect indicators of race or sex (seeHunter & Schmidt, 1976, pp. 1067-1068).

Assessment centers. Assessment centershave been cited for high validity in the pre-diction of managerial success (Huck, 1973),but a detracting element was found by Kli-moski and Strickland (1977). They noted thatassessment centers have very high validity forpredicting promotion but only a moderate va-lidity for predicting actual performance.Cohen et al. (1974) reviewed 21 studies andfound a median correlation of .63 predictingpotential or promotion but only .33 (.43 cor-rected for attenuation) predicting supervisorratings of performance. Klimoski and Strick-land, interpreted this to mean that assessmentcenters were acting as policy-capturing devices

Page 18: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

ALTERNATIVE PREDICTORS OF JOB PERFORMANCE 89

which are sensitive to the personal mannerismsthat top management tends to use in pro-motion. To the extent that these mannerismsare unrelated to actual performance, the highcorrelation between assessment centers andpromotion represents a shared error in thestereotype of a good manager.

In a follow-up study, Klimoski and Strick-land (1981) gathered data on 140 managers.Their predictors were preassessment supervisorratings of promotion potential, supervisor rat-ings of performance, and the assessment centerrating. Three years later the criterion data weremanagement grade level attained, supervisorrating of potential, and supervisor rating ofperformance. The assessment center ratingpredicted grade level with a correlation of .34and predicted future potential rating with acorrelation of .37, but predicted future per-formance with a correlation of —.02. Preas-sessment supervisor ratings of potential pre-dicted future ratings of potential with a cor-relation of .51 and grade level with acorrelation of .14, but predicted future per-formance ratings with a correlation of only.08. Preassessment performance ratings pre-dicted performance ratings 3 years later witha correlation of .38, but predicted ratings ofpotential and grade level with correlations of.10 and .06. Thus, assessment center ratingsand supervisor ratings of potential predict eachother quite highly, but both are much poorerpredictors of performance. Klimoski andStrickland interpreted these data to mean thatmanagerial performance could be better pre-dicted by alternatives to assessment centers,such as current performance ratings.

Klimoski and Strickland's (1981) correla-tions of .37 and .02 between assessment centerratings and ratings of potential and perfor-mance are lower, though in the same direction,than the average correlations found by Cohenet al. (1974), which were .63 and .33, respec-tively. (Note that the sample size in the Kli-moski and Strickland study was only 140.)One might argue that in the Klimoski andStrickland study there was a big change incontent over the 3-year period, which madefuture performance unpredictable. However,current job performance ratings predict futurejob performance ratings in the Klimoski andStrickland study with a correlation of .38. Thatis, current ratings of job performance predict

future ratings of job performance as well asthe assessment center predicts future potentialratings.

There appear to be two alternative hy-potheses to explain the data on assessmentcenters. First, it may be that supervisors ac-tually generate better ratings of performancewhen asked to judge potential for promotionthan when asked to judge current performance.Alternatively, it may be that assessment cen-ters, supervisors, and upper-level managersshare similar stereotypes of the good manager.To the extent that such stereotypes are valid,the assessment center will predict later per-formance, as in the Cohen et al. (1974) studyshowing an average correlation of .43 in pre-dicting later supervisor performance ratings.The higher correlations for ratings of potentialagainst promotion record would then measurethe extent to which they share erroneous aswell as valid stereotypic attributes.

Comparison of PredictorsIf predictors are to be compared, the cri-

terion for job performance must be the samefor all. This necessity inexorably leads to thechoice of supervisor ratings (with correctionfor error of measurement) as the criterion be-cause there are prediction studies for super-visor ratings for all predictors. The followingcomparisons are all for studies using supervisorratings as the criterion. The comparisons aremade separately for two sets of predictors—those that can be used for entry-level jobswhere training will follow hiring, and thoseused for promotion or certification. Table 9shows the mean validity coefficient across alljobs for 11 predictors suitable for predictingperformance in entry-level jobs. Predictors arearranged in descending order of validity. Theonly predictor with a mean validity of essen-tially zero is age. The validities for experience,the interview, training and experience ratings,education, or the Strong Interest Inventory aregreater than zero but much lower than thevalidity for ability. Later analysis shows thateven the smallest drop in validity would meansubstantial losses in productivity if any of thesepredictors were to be substituted for ability.In fact, the loss is directly proportional to thedifference in validity.

Table 10 presents the average validity acrossall jobs for the predictors that can be used in

Page 19: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

90 JOHN E. HUNTER AND RONDA F. HUNTER

Table 9Mean Validities and Standard Deviations of Various Predictors for Entry-Level Jobsfor Which Training Will Occur After Hiring

Predictor Mean validity SD No. correlations No. subjects

Ability compositeJob tryoutBiographical inventoryReference checkExperienceInterviewTraining and experience ratingsAcademic achievementEducationInterestAge

.53

.44

.37

.26

.18

.14

.13

.11

.10

.10-.01

.15

.10

.09

.05

.00

.11

425201210

425106511

4253

425

32,124

4,4295,389

32,1242,694

1,08932,124

1,78932,124

situations where the persons selected will bedoing the same or substantially the same jobsas they were doing before promotion. Indeed,five of the six predictors (all but ability) areessentially ratings of one or more aspects orcomponents of current performance. Thus,these predictors are essentially predicting fu-ture performance from present or past per-formance. By a small margin, ability is on theaverage not the best predictor; the work sampleis best. The validity of all these predictors isrelatively close in magnitude. The drop fromwork sample test to assessment center is aboutthe same as the drop from ability to the secondbest predictor for entry-level jobs (job tryout).Even so, the following economic analysis showsthat the differences are not trivial.

Combinations of Predictors

For entry-level hiring, it is clear that abilityis the best predictor in all but unusual situ-ations (those in which biodata might predictas well as ability). However, there is an im-portant second question: Can other predictorsbe used in combination with ability to increasevalidity? There are two few studies with mul-tiple predictors to answer this question directlywith meta-analysis. However, there are math-ematical formulas that place severe limits onthe extent to which the currently known al-ternatives to ability can improve prediction.This section discusses two such limits.

First, there is a mathematical inequality thatshows that the increase in validity for an al-ternate predictor used in combination with

ability is at most the square of its validity.Second, we note that if a second predictor isused with ability, then it will increase validityonly if the second predictor is given the smallweight that it deserves.

Consider first the maximum extent to whichthe multiple correlation for ability and an al-ternate predictor used together can exceed thevalidity of ability used alone (i.e., can exceedan average of .53). In the few studies usingmore than one type of predictor, the corre-lations between alternative predictors tend tobe positive. More importantly, the beta weightsare all positive. That is, there is no evidenceof suppressor effects in predictor combina-tions. If we can assume that there are no sup-pressor effects, then we can derive an extremelyuseful inequality relating the multiple corre-lation R to the validity of the better predictor(which we will arbitrarily label Variable 1),that is, a comparison between R and r\. As-suming R is the multiple correlation, and Hand r2 are the zero-order correlations for theindividual predictors, then if there are no sup-pressor effects,

R

We can then use Taylor's series to obtain aninequality in which the square root is elimi-nated:

Page 20: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

ALTERNATIVE PREDICTORS OF JOB PERFORMANCE 91

Table 10Mean Validities and Standard Deviations of Predictors to be Used for Promotion or Certification,Where Current Performance on the Job is the Basis for Selection

Predictor Mean validity SD No. correlations No. subjects

Work sample testAbility compositePeer ratingsBehavioral consistency experience

ratingsJob knowledge testAssessment center

.54

.53

.49

.49

.48

.43

.15

.15

.08

.08

42531

510

32,1248,202

3,078

Combining these inequalities yields

We denote the validity of the ability compositeas rA and the validity of the alternate predictoras rx. Then the validity of the composite is atmost

For the average case in which rA = .53, thismeans that the composite validity is at most

That is, the increase in validity due to thealternate predictor is at most the square of itsvalidity. For example, the validity of the in-terview for an average job is .14. If the inter-view were used in a multiple regression equa-tion with ability, the increase in validity wouldbe at most from .53 to .55.

Second, if an additional predictor is usedwith ability, it will increase validity only if thesecond predictor is given the small weight itdeserves. Many companies now use severalpredictors for selection. However, the predic-tors are not being combined in accordancewith the actual validities of the predictors butare weighted equally. Under those conditionsthe validity of the total procedure can be lowerthan the validity of the best single predictor.For example, consider an average job in whichthe interview is given equal weight with theappropriate ability composite score, instead ofbeing given its beta weight. In that case, al-though the validity of ability alone would be.53, the validity of the inappropriate combi-nation of ability and interview would be atmost .47.

Economic Benefits of VariousSelection Strategies

Differences in validity among predictorsonly become meaningful when they are trans-lated into concrete terms. The practical resultof comparing the validities of predictors is anumber of possible selection strategies. Thenthe benefit to be gained from each strategymay be calculated. This section presents ananalysis of the dollar value of the work-forceproductivity that results from various selectionstrategies.

Three strategies are now in use for selectionbased on ability: hiring from the top down,hiring from the top down within racial or eth-nic groups using quotas, and hiring at randomafter elimination of those with very low abilityscores. There is also the possibility of replacingability measures by other predictors. Thesestrategies can be compared using existing data.The comparison is presented here using ex-amples in which the federal government is theemployer; but equations are presented fordoing the same utility analysis in any orga-nization.

An analysis of the results of using alternativepredictors in combination cannot be made be-cause current data on correlations betweenpredictors are too sparse. However, in the caseof the weaker predictors only severely limitedimprovement is possible.

Computation of Utility

Any procedure that predicts job perfor-mance can be used to select candidates forhiring or promotion. The work force selectedby a valid procedure will be more productivethan a work force selected by a less valid pro-cedure or selected randomly. The extent of

Page 21: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

92 JOHN E. HUNTER AND RONDA F. HUNTER

improvement in productivity depends on theextent of accufacy in prediction. If two pre-dictors differ in validity, then the more validpredictor will generate better selection andhence produce a work force with higher pro-ductivity. The purpose of this section is toexpress this difference in productivity inmeaningful quantitative terms.

A number of methods have been devised toquantify the value of valid selection (reviewedin Hunter, 198la). One method is to expressincreased productivity in dollar terms. Thebasic equations for doing this were first derivedmore than 30 years ago (Brogden, 1949; Cron-bach & Gleser, 1965), but the knowledge nec-essary for general application of these equa-tions was not developed until recently (Hunter,198 la; Hunter & Schmidt, 1982a, 1982b;Schmidt, Hunter, McKenzie, & Muldrow,1979). The principal equation compares thedollar production of those hired by a particularselection strategy with the dollar productionof the same number of workers hired ran-domly. This number is called the (marginal)utility of the selection procedure. Alternativepredictors can be compared by computing thedifference in utility of the two predictors. Forexample, if selection using ability would meanan increased production of $ 15.61 billion andselection using biodata would mean an in-creased production of $10.50 billion, then theopportunity cost of substituting biodata forability in selection would be $5.11 billion.

The utility of a selection strategy dependson the specific context in which the strategyis used. First, a prediction of length of service(tenure) must be considered. The usual base-line for calculating utility is the increase inproductivity of people hired in one year. As-sume that N persons are to be hired in thatyear. The gains or losses from a selection de-cision are realized not just in that year, butover the tenure of each person hired. Moreover,this tenure figure must include not only thetime each person spent in their particular jobs,but also the time they spent in subsequent jobsif they were promoted. Persons who can servein higher jobs are more valuable to the or-ganization than persons who spend their entiretenure in the jobs for which they were initiallyselected. In the utility equation, we denote theaverage tenure for those hired as T years.

Second, the effect of selection depends onthe dollar value of individual differences in

production. This is measured by the standarddeviation of performance in dollar terms, de-noted here as ay.

Third, the effect of selection depends on theextent to which the organization can be se-lective; the greater the number of potentialcandidates, the greater the difference betweenthose selected and those rejected. If everyoneis hired, then prediction of performance is ir-relevant. If only one in a hundred of thosewho apply is hired, then the organization canhire only those with the highest qualifications.Selectiveness is usually measured by the se-lection ratio, the percentage of those consid-ered who are selected. In utility equations, theeffect of the selection ratio is entered indirectly,as the average predictor score of those selected:The more extreme the selection ratio, thehigher the predictor scores of those hired, andhence the higher the average predictor score.In the utility equation, M denotes the averagepredictor score of those selected, whereas foralgebraic simplicity the predictor is expressedin standard score form (i.e., scored so that forapplicants the mean predictor score is 0 andthe standard deviation is 1). If the predictorscores have a normal distribution, then thenumber M can be directly calculated from theselection ratio. Conversion values are pre-sented in Table 11.

This completes the list of administrativefactors that enter the utility formula: N = thenumber of persons hired, T = the average ten-ure of those hired, ay = the standard deviationof performance in dollar terms, and M = theexpression of the selection ratio as the averagepredictor score in standard score form.

One psychological measurement factor en-ters the utility equation: the validity of thepredictor, rxy, which is the correlation betweenthe predictor scores and job performance, forthe applicant or candidate population. If thisvalidity is to be estimated by the correlationfrom a criterion-related validation study, thenthe observed correlation must be corrected forerror of measurement in job performance andcorrected also for restriction in range. Validityestimates from validity generalization studiesincorporate these corrections.

The formula for total utility then computesthe gain in productivity due to selection forone year of hiring:

U = NTrxyffyx.

Page 22: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

ALTERNATIVE PREDICTORS OF JOB PERFORMANCE 93

Table 11Average Standard Score on the Predictor Variablefor Those Selected at a Given Selection RatioAssuming a Normal Distribution on the Predictor

Selection ratio

1009080706050403020105

Average predictorscored/

0.000.200.350.500.640.800.971.171.401.762.08

The standard deviation of job performance indollars, ay, is a difficult figure to estimate. Mostaccounting procedures consider only totalproduction figures, and totals reflect only meanvalues; they give no information on individualdifferences in output. Hunter and Schmidt(1982a) solved this problem by their discoveryof an empirical relation between ay and annualwage; ay is usually at least 40% of annual wage.This empirical baseline has subsequently beenexplained (Schmidt & Hunter, 1983) in termsof the relation of the value of output to pay,which is typically almost 2 to 1 (because ofoverhead). Thus, a standard deviation of per-formance of 40% of wages derives from a stan-dard deviation of 20% of output. This figurecan be tested against any study in the literaturethat reports the mean and standard deviationof performance on a ratio scale. Schmidt andHunter (1983) recently compiled the resultsof several dozen such studies, and the 20%figure came out almost on target.

To illustrate, consider Hunter's (1981a)analysis of the federal government as an em-ployer. According to somewhat dated figuresfrom the Bureau of Labor Statistics, as of mid-1980 the average number of people hired inone year is 460,000, the average tenure is 6,52years, and the average annual wage is $ 13,598.The average validity of ability measures forthe government work force is .55 (slightlyhigher than the .53 reported for all jobs, be-cause there are fewer low-complexity jobs ingovernment than in the economy as a whole).The federal government can usually hire the

top 10% of applicants (i.e., from Table 11,M- 1.76). Thus, the productivity gain forone year due to hiring on the basis of abilityrather than at random is

U = NTrxy<ryx = (460,000)(6.52)(.55)

X (40% of $13,598)(1.76) = $15.61 billion.

More recent figures from the Office of Per-sonnel Management show higher tenure anda corresponding lower rate of hiring, but gen-erate essentially the same utility estimate.

The utility formula contains the validity ofthe predictor rxy as a multiplicative factor.Thus, if one predictor is substituted for an-other, the utility is changed in direct proportionto the difference in validity. For example, ifbiodata measures are to be substituted forability measures, then the utility is reduced inthe same proportion as the validities, that is,.37/.5S or .67. The productivity increase overrandom selection for biodata would be.67(15.61), or $10.50 billion. The loss in sub-stituting biodata for ability measures wouldbe the difference between $15.61 and $10.50billion—a loss of $5.11 billion per year.

Table 12 presents utility and opportunitycost figures for the federal government for allthe predictors suitable for entry-level jobs. Theutility estimate for job tryout is quite unreal-istic because (a) tryouts cannot be used withjobs that require extensive training, and (b)the cost of the job tryout has not been sub-tracted from the utility figure given. Thus, formost practical purposes, the next best predictorafter ability is biodata, with a cost for substi-tuting this predictor of $5.11 billion per year.The cost of substituting other predictors is evenhigher.

Table 13 presents utility and opportunitycost figures for the federal government for pre-dictors based on current job performance.Again, because of the higher average com-plexity of jobs in the government, the validityof ability has been entered in the utility for-mula as .55 rather than .53. A shift from abilityto any of the other predictors means a loss inutility. The opportunity cost for work sampletests ($280 million per year) is unrealisticallylow because it does pot take into account thehigh cost of developing each work sample test,the high cost of redeveloping the test everytime the job changes, or the high cost of ad-ministering the work sample test. Because the

Page 23: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

94 JOHN E. HUNTER AND RONDA F. HUNTER

range of ability is smaller for predictors basedon current performance than for entry-levelpredictors, the range of opportunity costs issmaller as well.

Productivity and Racial Balance

The best known predictor for entry-leveljobs is ability. Therefore, any departure fromthe optimal selection strategy of hiring fromthe top down on ability will necessarily leadto loss of productivity. It has been shown thatability tests are fair to minority group membersin that they correctly predict job performance,indicating that the differences between racialgroups in mean performance on ability tests

Table 12Utility of Alternative Predictors That can be Used

for Selection in Entry-Level Jobs ComparedWith Utility of Ability as a Predictor

Predictor/strategy

Ability compositeAbility used with quotasJob tryoutBiographical inventoryReference checkExperienceInterviewTraining and experience

ratingAcademic achievement

(college)EducationInterestAbility with low cutoff

score"Age

Utility"

15.6114.8312.49b

10.507.385.113.97

3.69

3.122.842.84

2.50-0.28

Opportunitycosts"

-0.78-3.12-5.11-8.23

-10.50-11.64

-11.92

-12.49-12.77-12.77

-13.11-15.89

Note. Economic benefits or losses in productivity are shownfor 1 year of hiring, with the federal government as theemployer, if all personnel decisions are based on the useof a single predictor. Utility indicates amount saved overrandom selection; opportunity costs indicate amount lostif a predictor other than ability is used. The use of abilitywith quotas and ability as a screen (i.e., the use of abilitymeasures with very low cutoff scores) are shown as if theywere alternative predictors rather than alternative strategiesfor using ability as a predictor.* In billions of dollars.b Does not include the cost of transporting applicant toworkplace, paying for training and tryout period, or socialcost of firing some applicants after they have been on thejob.c Cutoff score was set to reject only the bottom 20% ofapplicants.

Table 13Utility of Predictors That can be Used forPromotion or Certification ComparedWith the Utility of Ability as a Predictor

Predictor/strategy Utility"Opportunity

costs"

Ability: hiring top down 15.61Work sample test 15.33 -.28Ability: using quotas 14.83 -.78Peer ratings 13.91 -1.70Behavioral consistency

experience ratings 13.91 -1.70Job knowledge test 13.62 -1.99Assessmentvcenter 12.20 -3.41Ability with low cutoff

score" 2.50 -13.11

Note. Economic benefits or losses in productivity are shownfor 1 year of hiring or promoting, with the federal gov-ernment as the employer, if all personnel decisions arebased on the use of a single predictor. Utility indicatesamount saved over random selection; opportunity costsindicate amount lost if a predictor other than ability isused. The use of ability with quotas and the use of abilityas a screen (i.e., the use of ability measures with very lowcutoff scores) are shown as if they were alternative predictorsrather than alternative strategies for using ability as a pre-dictor." In billions of dollars.b Cutoff score was set to reject only the bottom 20% ofapplicants.

are real. Therefore, the use of ability tests forselection will inevitably mean a lower hiringrate for minority applicants. Many would ar-gue that racial balance in the work force is soimportant that the merit principle should beabandoned in favor of affirmative action. Wedo not debate this ethical position here. (SeeHunter & Schmidt, 1976, for a review of suchdiscussions.) However, there are utility impli-cations for hiring on a basis other than merit,which is considered here.

There are two strategies that can be usedwhich base hiring on ability but adjust hiringrates in the direction of affirmative action: us-ing quotas and using ability tests with verylow cutoff scores. The hiring strategy thatmeets affirmative action goals with the leastloss in productivity is hiring on the basis ofability with quotas. For example, if an orga-nization is to hire 10% of all applicants, it willhire from the top down on ability within eachgroup separately until it has hired the top 10%of white applicants, the top 10% of black ap-

Page 24: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

ALTERNATIVE PREDICTORS OF JOB PERFORMANCE 95

plicants, the top 10% of Hispanic applicants,and so on.

The other affirmative action strategy is touse ability to select, but to set a very low cutoffscore and hire randomly from above that cutoffpoint. This strategy has two disadvantages incomparison with quotas: It leads to the selec-tion of low-ability applicants from all groups,and it does not completely satisfy affirmativeaction goals.

The utility implications of affirmative-actionstrategies can be quantified (Hunter, Schmidt,& Rauschenberger, 1977). Once this had beendone for the low-cutoff strategy, the disas-trous implications were immediately apparent(Hunter, 1979a, 198la; Mack, Schmidt, &Hunter, undated). These comparisons were il-lustrated in Tables 12 and 13. Table 12 showsthat for entry-level jobs, using ability withquotas is the second-best strategy, whereas thelow-cutoff strategy is the worst predictor exceptfor age, which has an average validity of lessthan zero. Table 13 shows that for promotionor certification the quota method is third andthe low-cutoff method is far poorer than anyother.

It is evident that the use of low cutoff scoresnegates most of the gain derived from the pre-dictive power of ability measures. This methodis also vastly inferior to quotas for purposesof racial balance. Quotas guarantee that thehiring rate will be the same in each group, butlow cutoff scores do not equalize hiring rates.Consider those jobs in which the discrepancyin ability is greatest, jobs of high complexityfor which cognitive ability is the sole predictor.If the cutoff is set to hire 10% of the majorityof white people, it will select only 1.1% of theblack applicants—a relative minority hiringrate of 11%. However, if the cutoff is set sothat 80% of applicants pass, the relative mi-nority hiring rate is still only 52%. The use ofa low cutoff score raises the relative hiring ratefrom 11% to 52%, but this is far short of the100% that would be produced by the use ofquotas.

Comparison of the quota method with thelow-cutoff method reveals a stark difference.It would cost the government only $780 mil-lion to achieve racial balance using ability withquotas. It would cost $13.11 billion (17 timesas much) to use the low-cutoff method, whichachieves only a poor approximation of racialbalance.

Current interest in predictors other thanability has been spurred primarily by the desireto find an equally valid predictor with muchless adverse impact. The alternative predictorsfor entry-level jobs are all much less valid thanare ability measures, and their use impliesconsiderable economic loss. Also, they arenot without adverse impact. For example,Schmidt, Greenthal, Hunter, Berner, and Sea-ton (1977) constructed a work sample test tobe used for apprentice machinists in the autoindustry. They found racial differences in meanproduction time as large as the racial differ-ences in mean cognitive ability. The sparseliterature on predictors other than ability isreviewed in Reilly and Chao (1982).

Future Research

We have shown that, for entry-level jobs,predictors other than ability have validity somuch lower that substitution would meanmassive economic loss. It is likely, however,that some of the alternative predictors measuresocial skills or personality traits that are rel-evant to job performance and are not assessedby cognitive, perceptual, or psychdmotor abil-ity tests. If so, then validity could be increasedby using such predictors in conjunction withability tests. However, the current research baseis completely inadequate for setting up suchprograms, because alternative predictors havebeen studied in isolation. Only a handful ofof studies have considered more than one pre-dictor at a time. In particular, the correlationsbetween alternative predictors are unknown;hence, generalized multiple regression cannotbe done.

In the introduction to this article, we notedthat adverse impact would be reduced if thevalidity of prediction could be increased. Va-lidity cannot be increased by replacing abilitytests by any of the now known alternative pre-dictors. However, validity could be increasedin some jobs by adding the appropriate secondpredictor and using it with the appropriateweight. If that second predictor has less adverseimpact, then the composite selection strategywould have more validity and less adverse im-pact than the use of ability alone. Unfortu-nately, there is no research base for suggestinga generally useful strategy of that type.-Instead,the use of a second predictor with ability wouldrequire a large sample local or consortium-based multivariate validation study.

Page 25: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

96 JOHN E. HUNTER AND RONDA F. HUNTER

References

Acuff, R. (1965). A validity study of an aircraft electricianelectrical worker job element examination. Pasadena,CA: U.S. Naval Laboratories in California, Test De-velopment and Personnel Research Section.

Ash, R. A. (1978, August). Self assessments of five typesof typing ability. Paper presented at the annual meetingof the American Psychological Association, Montreal.

Asher, J. J., & Sciarrino, J. A. (1974). Realistic work sampletests: A review. Personnel Psychology, 27, 519-533.

Bartlett, C. J., Bobko, P., Mosier, S. B., & Hannan, R.(1978). Testing for fairness with a moderated multipleregression strategy: An alternative to differential analysis.Personnel Psychology, 31, 233-241.

Bass, B. M., & Burger, P. (1979). Assessment of managers:An international comparison. New York: Free Press.

Boehm, V. R. (1977). Differential prediction: A meth-odological artifact? Journal of Applied Psychology, 62,146-154.

Brogden, H. E. (1949). When testing pays off. PersonnelPsychology, 2, 171-183.

Brown, S. H. (1978). Long-term validity of a personnelhistory item scoring procedure. Journal of Applied Psy-chology, 63, 673-676.

Brown, S. H. (1981). Validity generalization in the lifeinsurance industry. Journal of Applied Psychology, 66,664-670.

Callender, J. C., & Osburn, H. G. (1980). Developmentand test of a new model for validity generalization.Journal of Applied Psychology, 65, 543-558.

Callender, J. C., & Osburn, H. G. (1981). Testing the con-sistency of validity with computer-generated samplingdistributions of the multiplicative model variance es-timate; Results of petroleum industry validation re-search. Journal of Applied Psychology, 66, 274-281.

Cohen, B., Moses, J. L., & Byham, W. C. (1974). Thevalidity of assessment centers: A literature review. Pitts-burgh, PA: Development Dimensions Press.

Cronbach, L. J., & Gleser, G. C. (1965). Psychologicaltests and personnel decisions. Urbana, IL: University ofIllinois Press.

DeNisi, A. S., & Shaw, J. B, (1977). Investigation of theuses of self-reports of abilities. Journal of Applied Psy-chology, 62, 641-644.

Dunnette, M. D. (1972). Validity study results for jobsrelevant to the petroleum refining industry. Washington,DC: American Petroleum Institute.

Farley, J. A., & Mayfield, E. C. (1976). Peer nominationswithout peers? Journal of Applied Psychology, 61, 109-111.

Fine, S. A. (1955). A structure of worker functions. Per-sonnel and Guidance Journal, 34, 66-73.

Ghiselli, E. E. (1966), The validity of occupational aptitudetests. New York: Wiley.

Ghiselli, E. E. (1973). The validity of aptitude tests inpersonnel selection. Personnel Psychology, 26, 461-477.

Glass, G. V. (1976). Primary, secondary, and meta-analysisof research. Educational Researcher, 5, 3-8.

Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills, CA: Sage.

Gordon, L. V. (1973). Work environment preference sched-ule manual. New \brk: Psychological Corp.

Hannan, R. L. (1979). Work performance as a function

of the interaction of ability, work values, and the perceivedenvironment (Research Report No. 22). College Park:University of Maryland.

Haynes, E. (undated). Tryout of job element examinationfor shipfitter. Washington, DC: U.S. Civil Service Com-mission, Personnel Research and Development Center.

Hedges, L. V. (1983). A random effects model for effectsizes. Psychological Bulletin, 93, 388-395.

Helm, W. E., Gibson, W. A., & Brogden, H. E. (1957).An empirical test of shrinkage problems in personnelclassification research (Personnel Board Technical Re-search Note 84). Arlington, VA: Adjutant General's Of-fice, Personnel Research Branch, U.S. Army.

Huck, J. R. (1973). Assessment centers: A review of theexternal and internal validities. Personnel Psychology,26, 191-212.

Hunter, J. E. (1979a). An analysis of validity, differentialvalidity, test fairness, and utility for the PhiladelphiaPolice Officers Selection Examination prepared by theEducational Testing Service. (Report to the PhiladelphiaFederal District Court, Alvarez v. City of Philadelphia)

Hunter, J. E. (1979b). Cumulating results across studies:A critique of factor analysis, canonical correlation, MAN-OVA, and statistical significance testing. Paper presentedat the annual meeting of the American PsychologicalAssociation, New York City.

Hunter, J. E. (1980a). The dimensionality of the GeneralAptitude Test Battery (GATB) and the dominance ofgeneral factors over specific factors in the prediction ofjob performance. Washington, DC: U.S. EmploymentService, U.S. Department of Labor.

Hunter, J. E. (1980b). Validity generalization and constructvalidity. In Construct validity in psychological mea-surement: Proceedings of a colloquium on theory andapplication in education and measurement (pp. 119-130). Princeton, NJ: Educational Testing Service.

Hunter, J. E. (1980c). Test validation for 12,000 jobs: Anapplication of synthetic validity and validity general-ization to the General Aptitude Test Battery (GATB).Washington, DC: U.S. Employment Service, U.S. De-partment of Labor.

Hunter, J. E. (1981a). The economic benefits of personnelselection using ability tests: A state-of-the-art review in-cluding a detailed analysis of the dollar benefit of U.S.Employment Service placements and a critique of thelow-cutoff method of test use. Washington, DC: U.S.Employment Service, U.S. Department of Labor.

Hunter, J. E. (1981b). Fairness of the General AptitudeTest Battery (GATB): Ability differences and their impacton minority hiring rates. Washington, DC: U.S. Em-ployment Service, U.S. Dept. of Labor.

Hunter, J. E. (1981c, April). False premises underlyingthe 1978 Uniform Guidelines on Employee SelectionProcedures: The myth of test invalidity. Paper presentedto the Personnel Testing Council of Metropolitan Wash-ington, Washington, DC.

Hunter, J. E. (1982, May). The validity of content validtests and the basis for ranking. Paper presented at theInternational Personnel Management Association Con-ference, Minneapolis, MN.

Hunter, J. E., & Schmidt, F. L. (1976). A critical analysisof the statistical and ethical implications of five defi-nitions of test fairness. Psychological Bulletin, 83, 1053-1071.

Page 26: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

ALTERNATIVE PREDICTORS OF JOB PERFORMANCE 97

Hunter, J. E., & Schmidt, F. L. (1982a)v The economicbenefits of personnel selection using psychological abilitytests. Industrial Relations, 21, 293-308.

Hunter, J. E., & Schmidt, F. L. (1982b). Fitting people tojobs: The impact of personnel selection on national pro-ductivity. In M. D. Dunnette & E. A. Fleishman (Eds.),Human performance and productivity: Human capabilityassessment (pp. 233-284). Hillsdale, NJ: Erlbaum.

Hunter, J. E., Schmidt, F. L., & Hunter, R, (1979). Dif-ferential validity of employment tests by race: A com-prehensive review and analysis. Psychological Bulletin,86, 721-735.

Hunter, J. E., Schmidt, F. L., & Jackson, G. B. (1982).Advanced meta-analysis: Quantitative methods for cu-mulating research findings across studies. Beverly Hills,CA: Sage.

Hunter, J. E., Schmidt, F. L., & Pearlman, K. (1982). Thehistory and accuracy of validity generalization equations:A response to the Callender and Osburn reply. Journalof Applied Psychology, 67, 853-858.

Hunter, J. E., Schmidt, F. L., & Rauschenberger, J. M.(1977). Fairness of psychological tests: Implications offour definitions for selection utility and minority hiring.Journal of Applied Psychology, 62, 245-260.

Hunter, J. E., Schmidt, F. L., & Rauschenberger, J. M. (inpress). Methodological, statistical, and ethical issues inthe study of bias in psychological tests. In C. R. Reynolds(Ed.), Perspectives on bias in mental testing. New \brk:Plenum Press.

Jackson, G. B. (1980). Methods for integrative reviews.Review of Educational Research, 50, 438-460.

Johnson, J. C., Guffey, W. L., & Perry, R. A. (1980, July).When is a T and E rating valid? Paper presented at theannual meeting of the International Personnel Man-agement Association Assessment Council, Boston, MA.

Kane, J. S., & Lawler, E. E. (1978). Methods of peer as-sessment. Psychological Bulletin, 85, 555-586.

Katzell, R. A., & Dyer, F. J. (1977). Differential validityrevived. Journal of Applied Psychology, 62, 137-145.

King, L. M., Hunter, J. E., & Schmidt, F. L. (1980). Haloin a multidimensional forced-choice performance eval-uation scale. Journal of Applied Psychology, 65, 507-516.

Klimoski, R, J., & Strickland, W. J. (1977). Assessmentcenters: Valid or merely prescient? Personnel Psychology,30, 353-361.

Klimoski, R. J., & Strickland, W. J. (1981). The com-parative view of assessment centers. Unpublishedmanuscript, Department of Psychology, Ohio StateUniversity.

Lent, R. H., Aurbach, H. A., & Levin, L. S. (1971). Re-search design and validity assessment. Personnel Psy-chology, 24, 247-274.

Levine, E. L., Flory, A. P., & Ash, R. A. (1977). Self-assessment in personnel selection. Journal of AppliedPsychology, 62, 428-435,

Lilienthal, R. A., & Pearlman, K. (1983). The validity ofFederal selection tests for aid technicians in the health,science, and engineering fields (OPRD 83-1). Washing-ton, DC: U.S. Office of Personnel Management, Officeof Personnel Research and Development. (NTIS No.PB83-202051)

Linn, R. L., Harnisch, D. L., & Dunbar, S. B. (1981).Validity generalization and situational specificity: An

analysis of the prediction of first year grades in lawschool. Applied Psychological Measurement, 5, 281-289.

Mack, M. J., Schmidt, F. L., & Hunter, J. E. (undated).Dollar implications of alternative models of selection:A case study of park rangers. Unpublished manuscript.(Available from F. Schmidt, Office of Personnel Man-agement, Washington, DC 20415)

Molyneaux, J. W. (1953). An evaluation of unassembledexaminations. Unpublished master's thesis, GeorgeWashington University, Washington, DC.

Mosel, J. N. (1952). The validity of rational ratings ofexperience and training. Personnel Psychology, I , 1-10.

O'Connor, E. J., Wexley, K. N., & Alexander, R. A. (1975).Single-group validity: Fact or fallacy? Journal of AppliedPsychology, 60, 352-355. ,

O'Leary, B. S. (1980). College grade point average as anindicator of occupational success: An update (PRR-80-23). Washington, DC: U.S. Office of Personnel Man-agement, Personnel Research and Development Center.(NTIS No. PBS 1-121329)

Pearlman, K. (1982). The Bayesian approach to validitygeneralization: A systematic examination of the ro-bustness of procedures and conclusions. (Doctoral dis-sertation, George Washington University, 1982). Dis-sertation Abstracts International, 42, 4960-B.

Pearlman, K., & Schmidt, F. L. (1981, August). Effects ofalternate job grouping methods on selection procedurevalidity. In E. L. Levine (Chair), Job analysis/job for-mulas: Current perspectives on research and application.Symposium conducted at the annual meeting of theAmerican Psychological Association, Los Angeles.

Pearlman, K., Schmidt, F. L., & Hunter, J. E. (1980).Validity generalization results for tests used to predicttraining success and job proficiency in clerical occu-pations. Journal of Applied Psychology, 65, 373-406.

Personnel Research and Development Center. (1981). Al-ternative selection procedures. Federal Personnel ManualBulletin 331-3. Washington, DC: U.S. Office of PersonnelManagement.

Primoff, E. S. (1958). Report on validation of an exami-nation for electrical repairer, McClellan Field, California.Washington, DC: U.S. Civil Service Commission, Per-sonnel Research and Development Center.

Raju, N. S., & Burke, M. J. (in press). Two new proceduresfor studying validity generalization. Journal of AppliedPsychology.

Reilly, R. R., & Chao, G. T. (1982). Validity and fairnessof some alternative employee selection procedures. Per-sonnel Psychology, 35, 1-62.

Ricciuti, H. N. (1955). Ratings of leadership potential atthe United States Naval Academy and subsequent officerperformance. Journal of Applied Psychology, 39, 194-199.

Schmidt, F. L., Berner, J G., & Hunter, J. E. (1973).Racial differences in validity of employment tests: Real-ity or illusion? Journal of Applied Psychology, 58, 5-9.

Schmidt, F. L., Caplan, J. R., Bemis, S. E., Decuir, R.,Dunn, L., & Antone, L. (1979). The behavioral con-sistency method of unassembled examining (TM-79-21). Washington, DC: U.S. Civil Service Commission,Personnel Research and Development Center. (NTISNo. PB80-139942)

Schmidt, F. L., Gast-Rosenberg, I., & Hunter, J. E. (1980).

Page 27: Hunter & Hunter (1984) - Validity and Utility of Alternative Predictors of Job Performance

JOHN E. HUNTER AND RONDA F. HUNTER

Validity generalization results for computer program-mers. Journal of Applied Psychology, 65, 643-661.

Schmidt, F. L., Greenthal, A. L., Hunter, J. E., Berner,J. G., & Seaton, F. W. (1977). Job sample vs. paper-and-pencil trades and technical tests: Adverse impactand employee attitudes. Personnel Psychology, 30, 187-197.

Schmidt, F. L., & Hunter, J. E. (1977). Development ofa general solution to the problem of validity general-ization. Journal of Applied Psychology, 62, 529-540.

Schmidt, F. L., & Hunter, J. E. (1980). The future ofcriterion-related validity. Personnel Psychology, 33,41-60.

Schmidt, F. L., & Hunter, J. E. (1981). Employment testing:Old theories and new research findings. American Psy-chologist, 36, 1128-1137.

Schmidt, F. L., & Hunter, J. E. (1983). Individual differ-ences in productivity: An empirical test of the estimatederived from studies of selection procedure utility. Jour-nal of Applied Psychology, 68, 407-414.

Schmidt, F. L., Hunter, J. E., & Caplan, J. R. (1981a).Selection procedure validity generalization (transport-ability) results for three job groups in the petroleumindustry. Washington, DC: American Petroleum Insti-tute.

Schmidt, F. L., Hunter, J. E., & Caplan, J. R. (1981b).Validity generalization results for jobs in the petroleumindustry. Journal of Applied Psychology, 66, 261-273.

Schmidt, F. L., Hunter, J. E., McKenzie, R. C., & Muldrow,T. (1979). The impact of valid selection procedures onwork force productivity. Journal of Applied Psychology,64, 609-626.

Schmidt, F. L., Hunter, J. E., & Pearlman, K. (1981). Taskdifferences and validity of aptitude tests in selection: Ared herring. Journal of Applied Psychology, 66, 166-185.

Schmidt, F. L., Hunter, J. E., Pearlman, K., & Shane,G. S. (1979). Further tests of the Schmidt-HunterBayesian validity generalization procedure. PersonnelPsychology, 32, 257-281.

Schuh, A. J. (1967). The predictability of employee tenure:A review of the literature. Personnel Psychology, 20,133-152.

Sharf, J. C. (1981, April). Recent developments in the fieldof industrial and personnel psychology. Paper presentedat the conference sponsored by the Personnel TestingCouncil of Metropolitan Washington and BNA Systems,Washington, DC.

Thorndike, R. L. (1933). The effect of the interval betweentest and retest on the constancy of the IQ. Journal ofEducational Psychology, 25, 543-549.

Thorndike, R. L. (1971). Concepts of culture fairness.Journal of Educational Measurement, 8, 63-70.

Tucker, M. F., Cline, V. B., & Schmitt, J. R. (1967). Pre-diction of creativity and other performance measuresfrom biographical information among pharmaceuticalscientists. Journal of Applied Psychology, 51, 131-138.

U.S. Employment Service. (1970). Manual for the USESGeneral Aptitude Test Battery, Section III: Development.Washington, DC: U.S. Department of Labor, ManpowerAdministration.

van Rijn, P., & Payne, S. S. (1980). Criterion-related validityresearch base for the D. C. firefighter selection test (PRR-80-28). Washington, DC: U.S. Office of Personnel Man-agement, Office of Personnel Research and Development.(NTIS No. PBS 1-122087)

Vineberg, R., & Joyner, J. N. (1982). Prediction of jobperformance: Review of military studies. Alexandria,VA: Human Resources Research Organization.

Received January 11, 1984