identifying differentially population lawrence j

13
261 Identifying Test Items That Perform Differentially in Population Subgroups: A Partial Correlation Index Lawrence J. Stricker Educational Testing Service Verbal items on the GRE Aptitude Test were analyzed for race (white vs. black) and sex differ- ences in their functioning, using a new proce- dure—item partial correlations with subgroup standing (race or sex), controlling for total score—as well as two standard methods—compari- sons of subgroups’ item characteristic curves and item difficulties. The partial correlation index agreed with the item characteristic curve index in the proportions of items identified as performing differentially for each race and sex. These two in- dexes also agreed in the particular items that they identified as functioning differentially for the sexes, but not in the items that they identified as perform- ing differently for the races. The partial correlation index consistently disagreed with the item difficulty index in the proportions of items identified as func- tioning differentially and in the particular items in- volved. The items identified by the partial correla- tion index as performing differentially, like the items identified by the other indexes, generally did not differ in type or content from items not so identified, with one major exception: this index identified items with female content as functioning differently for the sexes. Test bias, whether associated with race, sex, or other population subgroups, is a highly complex issue. Bias, in principle, can manifest itself in a variety of ways, and little consensus exists about what constitutes acceptable evidence of its exis- tence or about what are the best means of deal- ing with it (Flaugher, 1978). The problem of minimizing test bias has been attacked at both the score and the item level. One promising ap- proach is to identify, during the course of test construction, items that perform differently for the relevant subgroups, the presumption being that these items are biased. Such items can then be inspected and, if need be, excluded from the test. Item Indexes A variety of procedures has been proposed for this purpose (see the reviews by Jensen, 1980; Merz, 1978; Petersen, 1980; Rudner, Getson, & Knight, 1980; Shepard, 1981). All these methods have two major features in common: (1) they as- sume that the test is unidimensional and that the total score (or the typical item) is relatively free of differential functioning, and (2) they rely on internal analyses of the test to identify de- viant items. Item difficulty. The best known and most widely used procedure involves plots of the cor- responding item difficulty values for subgroups, outlying items being considered as performing differently (e.g., Angoff & Ford, 1973). This method controls for overall differences in the ability of the subgroups by comparing the items with each other; items are identified that diverge Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227 . May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Upload: others

Post on 05-Apr-2022

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Identifying Differentially Population Lawrence J

261

Identifying Test Items That Perform Differentiallyin Population Subgroups: A Partial Correlation Index

Lawrence J. StrickerEducational Testing Service

Verbal items on the GRE Aptitude Test wereanalyzed for race (white vs. black) and sex differ-ences in their functioning, using a new proce-dure—item partial correlations with subgroupstanding (race or sex), controlling for totalscore—as well as two standard methods—compari-sons of subgroups’ item characteristic curves anditem difficulties. The partial correlation indexagreed with the item characteristic curve index inthe proportions of items identified as performingdifferentially for each race and sex. These two in-dexes also agreed in the particular items that theyidentified as functioning differentially for the sexes,but not in the items that they identified as perform-ing differently for the races. The partial correlationindex consistently disagreed with the item difficultyindex in the proportions of items identified as func-tioning differentially and in the particular items in-volved. The items identified by the partial correla-tion index as performing differentially, like theitems identified by the other indexes, generally didnot differ in type or content from items not soidentified, with one major exception: this indexidentified items with female content as functioningdifferently for the sexes.

Test bias, whether associated with race, sex, orother population subgroups, is a highly complexissue. Bias, in principle, can manifest itself in avariety of ways, and little consensus exists aboutwhat constitutes acceptable evidence of its exis-

tence or about what are the best means of deal-

ing with it (Flaugher, 1978). The problem ofminimizing test bias has been attacked at boththe score and the item level. One promising ap-proach is to identify, during the course of testconstruction, items that perform differently forthe relevant subgroups, the presumption beingthat these items are biased. Such items can thenbe inspected and, if need be, excluded from thetest.

Item Indexes

A variety of procedures has been proposed forthis purpose (see the reviews by Jensen, 1980;Merz, 1978; Petersen, 1980; Rudner, Getson, &Knight, 1980; Shepard, 1981). All these methodshave two major features in common: (1) they as-sume that the test is unidimensional and thatthe total score (or the typical item) is relativelyfree of differential functioning, and (2) they relyon internal analyses of the test to identify de-viant items.Item difficulty. The best known and most

widely used procedure involves plots of the cor-responding item difficulty values for subgroups,outlying items being considered as performingdifferently (e.g., Angoff & Ford, 1973). Thismethod controls for overall differences in the

ability of the subgroups by comparing the itemswith each other; items are identified that diverge

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 2: Identifying Differentially Population Lawrence J

262

in their difficulty for the subgroups and, hence,favor one of them.The drawback to this approach is that items

are not uniform in every respect; they differ incharacteristics that, in and of themselves, pro-duce divergencies in difficulty. Items with

greater discriminating power will invariably di-verge more; and those with greater difficulty willdiverge more whenever the subgroups differ intheir tendency to guess.This index is computed, as outlined by Angoff

and Ford (1973), by converting item difficultiesfor each subgroup to normal deviates and, inturn, linearly transforming them to &dquo;deltas.&dquo;The pair of delta values for each item are plottedon a graph and a line of best fit, correspondingto the main axis, is calculated. The perpendicu-lar distance of the location of each item fromthis line is then determined. This distance,which represents the item’s divergence, is con-

ventionally considered to be significant if it ex-ceeds a fixed number of standard deviations of

the distance values for the set of items. Thesedistances will tend to have (1) a mean of zero, be-cause they are deviations from a line of best fit;and (2) a normal distribution, providing that thedistributions of deltas for the two subgroups arebivariate normal.Item characteristic curve. Perhaps the

soundest approach, in principle, compares thesubgroups’ item characteristic curves (ICCs) foreach item, the items with divergent curves beingseen as functioning differentially (e.g., Lord,1977a). This method controls for subgroup dif-ferences in overall ability by comparing the diffi-culty of an item for examinees from the differentsubgroups who are at the same ability level;items are identified that consistently favor a sub-group at various levels or that favor one sub-

group at a particular level of ability and favorthe other subgroup at a different level.One problem with this method is that the ex-

treme regions of the curves are unstable whenthe score distributions for the subgroups areconsiderably different. Unreliability in the totalscores also affects this procedure. A practical

limitation is that each subgroup must be verylarge (perhaps 1,000 or more).As described by Lord (1980), this index is ob-

tained by computing the ICC for each item withthe LOGIST program (Wood, Wingersky, &Lord, 1976) for a three-parameter model. Eachitem’s guessing (c) parameter is determinedfrom the pooled data for the two or more

samples being compared, its discrimination (a)and difficulty (b) parameters are estimated sep-arately for each sample, and. all of the b param-eters are put on the same scale. The significanceof the difference between the curves for a pair ofsamples is then assessed by an approximate X2test that simultaneously evaluates discrepanciesin both a and b parameters.Partial correlation. Another index, though

promising, has not been employed previously inthis connection. This index, suggested by Lord(1977b), is analogous to the partial correlationproposed by Darlington (1971) as one definitionof test bias at the score level. Darlington arguedthat a test can be considered as biased if it has a

significant partial correlation with subgroupstanding (e.g., white vs. black) when the criteri-on is held constant. The present index is the

item’s partial correlation with subgroup stand-ing, total score being held constant. Items withsignificant correlations are interpreted as per-forming differentially. This procedure, like theICC index, controls for subgroup differences inoverall ability. However, unlike the ICC index,this index compares the general difficulty of theitem for the subgroups, essentially averaging dif-ferences between them at the various levels of

ability; items are identified that consistentlyfavor one subgroup.This index is free of the major problems asso-

ciated with other methods, though it shares theassumptions about the unidimensionality of thetest and the lack of differential functioning inthe total score. It also has several practical ad-vantages : the index is simple to compute, ap-plies to comparatively small samples (perhaps asfew as 400 if sampling fluctuations in this indexare to be reduced to a minimum), is flexible (any

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 3: Identifying Differentially Population Lawrence J

263

number of different kinds of subgroups can beaccommodated simultaneously; and the sub-

groups can be represented by dummy-coded cat-egories, e.g., race, or quantitative variables, e.g.,socioeconomic status), provides for a signifi-cance test, and permits a straightforwardevaluation of the size of effect. The main draw-

back of this procedure is that, in contrast to theICC index, it cannot identify items which vary intheir functioning in the subgroups at differentlevels of ability, favoring a subgroup at one pointand penalizing it at another.Operationally, the partial correlation index is

computed by the following formula:

whereri, is the correlation between the item

responses (incorrect or omit = 0,correct = 1) and subgroup stand-ing (e.g., for race, white = 0,black = 1; for sex, male = 0,female = 1);

riToo is the correlation between the item

response and the total score, ad-

justed for item overlap (the item isremoved from the score in cal-

culating the correlation) and cor-rected for attenuation in the score;and

rTooS is the correlation between the totalscore and subgroup standing, cor-rected for attenuation in theformer.

All these correlations are product-moment coef-ficients. In the common case where subgroupstanding is a dichotomous variable, and hencethe correlation between the item and subgroupstanding is a phi coefficient, the significance ofthis partial correlation can be determined withthe special test of significance for a phi coeffi-cient, X2 = N+2, evaluated with one degree offreedom (Guilford, 1965). This test is an ap-

proximation, because variables are not only par-

tialed out of this phi coefficient but also cor-rected for attenuation. In other situations, thesignificance of this partial correlation can be ap-proximated with the standard test of signifi-cance for a partial correlation coefficient.Two points should be noted about this index.

First, the indexes for a set of items will tend tobe normally distributed around a mean of zero,providing that the total score represents the onlyvariance common to the items and subgroupstanding, for this partial correlation is akin tothe residual value in a factor matrix of the itemsand the subgroup variable after extraction of afactor representing the total score (Harman,1976). Second, the magnitude of the indexes fordifferent items cannot be precisely compared,because this measure is based on correlations fora dichotomous item, and these, in turn, are lim-ited by the proportions passing and failing it

(Lord & Novick, 1968; Magnusson, 1967). Thisproblem does not affect the significance of theindex, and hence items can be directly comparedwith regard to their significance (see Appendix).

Studies of Items

A number of studies have examined race andsex differences in the performance of ability testitems, using many kinds of statistical indexes(see the review by Jensen, 1980). Although theseinvestigations are necessarily limited by the

problems in the methods that they employed,consistent findings-by virtue of the differencesin item pools, populations, and specific detailsof the analytic procedures involved-can be in-terpreted, at least tentatively, as substantive re-sults not attributable to statistical artifacts.Much of the research has concerned the Pre-

liminary Scholastic Aptitude Test (PSAT; Don-lon & Angoff, 1971), Scholastic Aptitude Test(SAT; Donlon & Angoff, 1971), and GraduateRecord Examinations (GRE) Aptitude Test

(Conrad, Trismen, & Miller, 1977), which aresimilar in the types of items used and their con-tent. Studies of the PSAT are few in number andlimited in scope. Negligible evidence of differ-

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 4: Identifying Differentially Population Lawrence J

264

ential functioning for whites and blacks was ob-served in overall analyses of this test (Angoff &Ford, 1973; Cleary & Hilton, 1968). Individualitems were not examined, and differential per-formance for males and females was not investi-

gated.The differential functioning of items for

whites and blacks as well as for males and fe-males has been extensively studied on the SAT.Most studies identified few items that performeddifferently (e.g., Coffman, 1961). However, a

noteworthy investigation Marco, and Lord(Lord, 1977a), the only one using the ICC index,identified a large number of verbal items thatfunctioned differently for whites and blacks.The findings about the items uncovered in thesestudies are sparse. In general, items identified asperforming differentially for whites and blacksor for males and females did not favor one raceor sex. A consistent exception is that discreteverbal items (e.g., antonyms, analogies) withpractical content favored males and those withhuman relationships content favored females

(e.g., Coffman, 1961).Race and sex differences in item functioning

have also been investigated on the GRE Apti-tude Test, but only to a limited extent. As in thecase of the SAT research, these investigationsidentified few items and uncovered very littleabout their characteristics (e.g., Donlon, Hicks,& Wallmark, 1980). The major study (Donlon etal., 1980) found that discrete verbal items withhuman relationships content and reading cormprehension items with argumentative content

favored females and that reading comprehen-sion items with science content and data inter-

pretation items favored males.

PurposeThe aim of this study was to explore the use-

fulness of the partial correlation index in identi-.fying race (white vs. black) and sex differences inthe performance of verbal items on the GRE Ap-titude Test. Specifically, the goals were (1) to de-termine whether differentially functioning itemscan be identified by this procedure; (2) to assess

whether they are the same as those identified bythe item difficulty and ICC indexes; and (3) toattempt to ascertain characteristics that distin-guish the items identified by the variousmethods.

Method

Test Form and Items

The 80 operational items in the Verbal sectionof the GRE Aptitude Test (Form ZGR2) wereanalyzed. This form was chosen because it con- =tains reading passages with black as well as fe-male content and because data are available fora large number of examinees, the form havingbeen used in several administrations, beginningin October 1977.

Samplers

Samples of whites and blacks (excluding Chi-canos, Puerto Ricans, and other Hispanics) ofboth sexes were drawn from the 57,458 exam-inees who took this form of the test in the Octo-

ber 1977 and January 1978 administrations. Inorder to minimize extraneous influences of edu-cational background and test-taking experience,the samples were limited to examinees with thecharacteristics listed below.1. Never took the GRE Aptitude Test previ-

ously.2. Had no test irregularity (e.g., examinee did

not request cancellation of scores).3. Undergraduate major in social science. 1

4. Never attended graduate school full time.

1The BIQ classification of social science majors was used:American Studies, Anthropology, Business and Commerce,Communications, Economics, Education (including M.A. inTeaching), Educational Administration, Educational Psy-chology, Geography, Government, Guidance and Counsel-ing, History, Industrial Relations and Personnel, Interna-tional Relations, Journalism, Law, Library Science, PhysicalEducation, Political Science, Psychology, Public Adminis-tration, Slavic Studies, Social Psychology, Social Work, So-ciology, Urban Development (Regional Planning), and OtherSocial Sciences.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 5: Identifying Differentially Population Lawrence J

265

5. Currently a senior or an unenrolled collegegraduate.

6. Native of United States (i.e., U.S. citizen,communicates best in English, and has do-mestic address).

This information was obtained from the Bio-

graphical Information Questions (BIQ) and reg-istration form, completed when the examineesapplied to take the test at these administrations,and from other records about these administra-tions and earlier ones.Four basic samples were obtained for use in

the main analyses of race and sex differences initem performance: white males (lit = 1,122),white females (N - 1,471), black males

(N = 284), and black females (IV = 626). The twowhite samples comprised a randomly selectedone-fourth of the qualified white examinees, andthe black samples contained all the qualifiedblack examinees.Two pseudosamples, each the same size as an

actual sample but composed of a different raceor sex, were also obtained for supplementaryanalyses to verify the approximate significancetests for the partial correlation and ICC indexes.These nonoverlapping samples were randomlydrawn from qualified white males not in thebasic sample: (1) a pseudosample of black males(lV = 284), of the same size as the basic sampleof black males, for comparison with the basicsample of white males in the analyses of race dif-ferences in performance (i.e., the analogue of theanalysis for white and black males, using twogroups of white males); and (2) a pseudosampleof black females (N = 626), of the same size asthe basic sample of black females, for compari-son with the pseudosamples of black males in theanalyses of sex differences in performance (i.e.,the analogue of the analysis for black males andfemales, using two groups of white males). Thesetwo pseudosamples were selected because theycorresponded to the smallest basic sample ineach analysis and because this basic sample,when combined with the other sample in theanalysis, had the smallest number of examinees.Verification of the significance tests is most

critical for these relatively small samples.

Itemlndexes

The partial correlation, item difficulty, andICC indexes were computed as described previ-ously. In calculating the partial correlation in-dex, cases with a &dquo;not reached&dquo; response on anitem were excluded in computing the three com-ponents of the index for that item. In calculatingthe ICC index, the data were pooled for the fourbasic samples in determining the item’s c pa-rameter, and all the b parameters were put onthe scale for white males. The .05 level was usedfor the significance tests of the partial correla-tion and ICC indexes. For the item difficulty in-dex, a distance was considered to be significantif it exceeded 1.96 standard deviations of the dis-tances for the set of items in the same analysis.Because this criterion represents the .95 limits ofa normal distribution, .05 of the distances in ananalysis will be significant, providing only thatthe distances are normally distributed.

Classification of Items by ~ ~~a~ Content

The items’ type and content were categorized,using adaptations of the classification schemesemployed in test construction. The items wereclassified into four types (Antonyms, Analogies,Sentence Completions, and Reading Compre-hension) and six content areas (Aesthetic-Phil-osophical, World of Practical Affairs, Science,Human Relationships, Black, and Female).

Statistical Analysis

Parallel analyses of differential functioningwere carried out for race and sex. Race differ-ences in performance were analyzed separatelyfor males and females; sex differences were

analyzed separately for whites and blacks. Thepartial correlation, item difficulty, and ICC in-dexes were calculated in these analyses for allthe items.The correspondence between the items identi-

°

fied by the three indexes was assessed by un-weighted Kappa coefficients (Cohen, 1960;Fleiss, Cohen, & Everitt, 1969). For this pur-

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 6: Identifying Differentially Population Lawrence J

266

pose, items identified by the partial correlationand item difficulty indexes were classified as sig-nificant and positive (i.e., favoring blacks or fe-males), significant and negative (i.e., favoringwhites or males), or nonsignificant; and itemsidentified by the ICC index were classified assignificant or nonsignificant. However, in com-parisons of the partial correlation or item diffi-culty indexes with the ICC index, the items

identified by the first two indexes, like thoseidentified by the last, were simply classified assignificant or nonsignificant.The association between the items identified

by the indexes and the items’ type or content wasassessed by y2 tests based on the classification ofthe items by type or content and by their signifi-cance, using the previously described categories(significant and positive, significant and nega-tive, and nonsignificant for the partial correla-tion and item difficulty indexes; significant andnonsignificant for the ICC index).

Results and Discussion

Main AnalysesThe number of items identified as significant

or nonsignificant by the three indexes in themain analyses of differential functioning appearin Table 1. The frequency distributions for thepartial correlation index are given in Table 2.In the analyses of race differences in perform-

ance, the partial correlation index identified assignificant, for males as well as females, almostone-half of the items. Most of the indexes wereless than .10 in absolute size, and only one ex-ceeded .20. The indexes were about evenly distri-buted in sign, roughly one-half reflecting func-tioning that favors whites and the other repre-senting functioning that favors blacks. The ICCindex identified as significant about one-fourthof the items for each sex, a substantial butsmaller proportion than that identified by thepartial correlation index. The item difficulty in-dex, in contrast to the others, identified as sig-nificant less than one-tenth of the items for

males as well as females. About an equal num-

ber of distances were positive and negative, one-half representing performance favoring whitesand the remainder reflecting performance favor-ing blacks.In the analyses of sex differences in function-

ing, the partial correlation index identified assignificant more items for whites than blacks.For whites, almost one-half the items were

identified; for blacks, in comparison, only aboutone-quarter were identified. Most of the indexeswere less than .10 in absolute size; only two wereabove .20. The indexes were about evenly distri-buted in sign, one-half reflecting functioningthat favors males and one-half representingfunctioning that favors females. The ICC indexalso identified as significant substantially moreitems for whites than for blacks-over one-thirdfor the former and about one-fifth for the latter.The item difficulty index identified as signifi-cant fewer than one-tenth of the items for whitesand blacks. The distances were about equally di-vided in sign, one-half representing performancefavoring males and the other reflecting function-ing that favors females.One striking finding in both the race and sex

analyses of differential functioning was the over-all consistency with which substantial propor-tions of items were identified as significantacross samples and across the partial correlationand ICC indexes. This agreement in the resultsfor the two measures indicates that the unusual-

ly high proportions of items detected as per-forming differently by the indexes in this

staxdy-well above the level observed in all previ-ous research, except Marco and Lord’s (Lord,1977 a) investigation of race differences with theICC index-cannot be attributed to the specificproperties of the particular measure. Equallynoteworthy, though, the results for the partialcorrelation index indicate that the effects in-volved were weak, judging from the magnitudeof the indexes.An important exception to this trend for siz-

able proportions of items to be identified as sig-nificant is the relatively small proportions iden-tified in the analyses of sex differences forblacks. This divergence in the results for blacks

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 7: Identifying Differentially Population Lawrence J

267

Table 1

Identification of Items by Indexes in Main and Pseudo

Analyses of Differential Functioning (n=80)

_ -~_..~-- __.. _..

Note. In the race analyses, positive indexes signify items favoring blacks and

negative indexes items favoring whites; in the sex analyses, positive indexes sig-nify items favoring females and negative indexes items favoring males.

may be at least in by thesmaller size of the black samples, which neces-affects the magnitude of the partial cor-relation and ICC indexes required for signif-icance.The findings about the roughly symmetrical

distribution of partial correlation indexes in

each analysis imply that only one factor, repre-sented by the total score, accounts for most ofthe variance common to the items and subgroupstanding. This outcome does not bear on the is-sue of whether other common factors exist that

only run through the items themselves and notthrough the subgroup variable too. Such a pos-

Table 2

Frequency Distributions of Partial Correlation Index in

- -

Main Analyses of Differential F~a~~~.l~n~ng (~-80)

Note. In the race analyses, positive indexes signify items favor-

ing blacks and negative indexes items favoring whites; in the sex

analyses, positive indexes signify items favoring females and nega-tive indexes items favoring males.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 8: Identifying Differentially Population Lawrence J

268

sibility is consistent with other research that

points to the factorial complexity of the items(Powers & Swinton, 1981).

P ®~~ ~1~ AnalysesThe number of items identified as significant

or nonsignificant by the indexes in the supple-mentary analyses of differential functioning in-volving the pseudosamples appear in Table 1.

In the analyses of race differences in perfor-mance for the basic sample of white males andthe pseudosample of black males, the partialcorrelation indexes identified as significantfewer than one-tenth of the items. This propor-tion was substantially lower than the cor-

responding proportion of significant itemsidentified by this index in the main analyses.The ICC and item difficulty indexes also identi-fied less than one-tenth of the items, appreciablylower than the proportion for the ICC index inthe main analyses but the same as the cor-

responding proportion for the item difficulty in-dex.

Similarly, in the analyses of sex differences infunctioning for the pseudosample of black malesand the pseudosample of black females, the par-tial correlation index identified as significantfewer than one-tenth of the items. This propor-tion was appreciably lower than the correspond-ing one for the items in the main analyses. TheICC and item difficulty indexes also identifiedless than one-tenth of the items, far below thecorresponding proportion for the ICC index inthe main analyses but identical to the proportionfor the item difficulty index.These results indicate that the significance

tests for the partial correlation and ICC indexeswere reasonably accurate, for the proportions ofsignificant indexes in these analyses for pairs ofsamples of the same race and sex (white males)were close to the proportion expected on thebasis of the significance level employed (.05).These findings about the actual proportions ofsignificant indexes obtained by chance under-

score the point that the substantial proportionsof items identified by the partial correlation andICC indexes in the main analyses were well inexcess of chance and the low proportions identi-fied by the item difficulty index in the mainanalyses were at the chance level.

Correspondence between. Indexes

The number of items consistently identified assignificant and nonsignificant by the differentindexes together with the corresponding Kappacoefficients appear in Table 3 for the main

analyses of differential functioning.In the analyses of race differences in perfor-

mance, none of the indexes significantly(p > .05) agreed with each other in the itemsthat they identified as significant. In the analy-ses of sex differences in functioning, in contrast,the partial correlation and ICC indexes closelyagreed, for both races, in the items identified(Kappa = .62 to .79, p < .01), though the itemdifficulty index did not significantly (p > .05)agree with the two others.The striking divergence in agreement between

the partial correlation and ICC indexes in theanalyses of race and sex differences is note-

worthy. It is puzzling that the two indexes large-ly measured the same thing in one kind of analy-sis but not in another. One speculation is thatthe ICC index identified in the analyses of racedifferences, but not in the analyses of sex differ-ences, items that favored a subgroup at one abil-ity level but favored a second subgroup at an-other ability level. Hence, the partial correlationindex, not being sensitive to such effects, identi-fied different items than the ICC index in the

analyses of race differences, but both indexesidentified many of the same items in the analy-ses of sex differences where these effects wereabsent. The complete lack of agreement betweenthe items identified by the item difficulty indexand by the others primarily reflects the dis-

parities in the proportions of items identified assignificant by these indexes.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 9: Identifying Differentially Population Lawrence J

269

Table 3

Agreement in Identification of Items by Different Indexesin Main Analyses of Differential Functioning (n=80)

-~

, Significant at the .01 level.

Association with Item and Content

The number of items identified as significantand nonsignificant by the indexes in the mainanalyses of differential functioning for whitesand blacks are classified by item type and itemcontent in Table 4. The x2s also appear in thistable. (x2s were not computed for the item diffi-culty index because of the small number of itemsidentified as significant.) The corresponding sta-tistics for the analyses of differential perfor-mance for males and females appear in Table 5.In the analyses of race differences in perfor-

mance, the items identified by the partial cor-relation index were significantly (p < .05) asso-ciated, for males only, with item type and con-tent. The relationship with item type was mainlydue to large numbers of Antonyms and smallnumbers of Reading Comprehension items withpositive indexes, which represent functioningthat favors blacks. The link with item content

was primarily due to large numbers of Aesthetic-

Philosophical and Human Relationships itemsas well as small numbers of World of PracticalAffairs items with positive indexes. None of theassociations for items identified by the ICC in-dex were significant (p > .05).

In the analyses of sex differences in function-ing, the items identified by the partial correla-tion index were significantly (p < .05) associ-

ated, for blacks only, with item type and, forwhites as well as blacks, with content. The con-nection with item type was mainly due to largenumbers of Reading Comprehension items withpositive indexes, which reflect performance thatfavors females; and small numbers of ReadingComprehension items and large numbers of

Analogies with negative indexes, which repre-sent functioning that favors males. The relation-ship with item content, for whites, was mainlydue to large numbers of Female items with posi-tive indexes. For blacks, the link with item con-tent was also primarily due to large numbers ofFemale items with positive indexes as well as to

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 10: Identifying Differentially Population Lawrence J

270

~

fi

£U£

C’qC

.~(Jc:

~

.~

&dquo;’&dquo;c:

~t

U6@o

CJ)OJCJ)

roc:c:(

.~

COx

fi<:t CJ)

OJ

~ ’~..0 0COr U

I- 2

cowPc:2

c:8w

1:)c:

OJ

I-

::iCJ)U

1OJ

oS>.

..0

(J)

E

~@o

c:.~

&dquo;’&dquo;col

&Scaron;Pc:OJ

;S

QQ)

L

0’1

Wo>

(J)e

~(J)

~Q)-0

.f.:iill

’q<0oQ)C

-0C<0

(J)

(J<0

::õ

oImo

<0

(J)E

£z o o

&dquo;- -1 ...,..., Q) Q)c > >

.~ ~ ~ Q1

(J) ’&dquo;

(I) 00

a o o

I U aill Q) Q)-0 L L

~ ~ ..¡.JP P

c0U m m

.~ ~ ~Ic C C~ to m(I) U Up I M

C C° o cl

I il .;ro * *Z *

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 11: Identifying Differentially Population Lawrence J

271

r,11

i )I

I o<o! [

IQ i

fn1

y« £ z I I J;l# ’ ~ l

~CO (§ !

£ I

i i ~~ i

/

4-o ((

< « # ~ l

i I l

~

§fl Q

A1

£ o

~ UP

P # ~PZ

8

#U

§Q

NG7

(/)E

~....o

c:

.~

~,.,..¡ .

~

£ Pa

c:OJ

~

I

~i

~

(/)

~ccE

CJ’B.~

o>CC

OJE

~OJ

~’&dquo;u

.S m.~ >

CC / C m llCJ’B

c:

) c11 m

N m

Nm~~ s,&dquo;-r

Iio

#1 ’

s

I, ;~~ .3 ~- ’«

z 8 o

o

8 w

<

c > >CJ’B QJ <II

OJ<J&dquo;&dquo;B ,....

m 13 0i ° °x

o o<II Q) <II’O .C L

e p -t-J

~ .;.Jlu CD CD

’....;.J »

C c #I OJ u u

/8 S Sj c c c

) cm m

In (n * * C1

o >I< >I<

¡Z >I<

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 12: Identifying Differentially Population Lawrence J

272

large numbers of Science items with negative in-dexes. The items identified by the ICC indexwere significantly (p < .05) associated, for whitesonly, with item content. This link was mainlydue to the identification of large numbers of Fe-male items and small numbers of Human Rela-

tionships items. None of the associations withitem type for items identified by this index weresignificant (p > .05).The overall lack of association between the

items identified by the indexes in the analyses ofrace and sex differences and the items’ type andcontent is congruent with the sparse findings inprevious work discussed earlier. A major excep-tion is that the partial correlation index consis-tently identified, for both races, items with fe-male content which favored females. However,apart from the finding, obtained for blacks only,that this same index identified items with sci-

ence content which favored males, the resultsdid not confirm other relationships suggested byearlier research: items with practical content

favoring males and items with human relation-ships content favoring females.The tendency for the limited associations ob-

served between the items that were identifiedand their type and content to center around thepartial correlation index and content deservesattention. The involvement of the partial cor-relation index provides some support for thevalidity of this procedure. In addition, the linkwith content indicates that this classificationscheme is not only meaningful but also has somevalue in identifying differentially performingitems; the contrasting lack of association withitem type implies either that this means of cate-gorizing items is not sensible or that this charac-teristic is unrelated to differential functioning inthese items.

Conclusion

The present evidence concerning the partialcorrelation index’s agreement with the ICC in-dex and associations with item content, alongwith its inherent ease of calculation, applicabil-

ity to small samples, flexibility in accommodat-ing different kinds of subgroups, and provisionfor an appraisal of statistical significance andmagnitude of effect, suggest that this measuremay be useful in future work in this area.

References

Angoff, W. H., & Ford, S. F. Item-race interaction ona test of scholastic aptitude. Journal of Education-al Measurement, 1973,10, 95-106.

Cleary, T. A., & Hilton, T. L. An investigation of itembias. Educational and Psychological Measure-ment, 1968, 28, 61-75.

Coffman, W. E. Sex differences in responses to itemsin an aptitude test. In E. M. Huddleston (Ed.),Eighteenth Yearbook, National Council on Mea-surement in Education. Ames IA: National Coun-cil on Measurement in Education, 1961.

Cohen, J. A. A coefficient of agreement for nominalscales. Educational and Psychological Measure-ment, 1960, 20, 37-46.

Conrad, L., Trismen, D., & Miller, R. (Eds.). Gradu-ate Record Examinations technical manual.Princeton NJ: Educational Testing Service, 1977.

Darlington, R. B. Another look at "cultural fair-ness." Journal of Educational Measurement,1971, 8, 71-82.

Donlon, T. F., & Angoff, W. H. The Scholastic Apti-tude Test. In W. H. Angoff (Ed.), The CollegeBoard admissions testing program: A technical re-port on research and development activities relat-ing to the Scholastic Aptitude Test and achieve-ment tests. New York: College Entrance Examina-tion Board, 1971.

Donlon, T. F., Hicks, M. M., & Wallmark, M. M.Sex differences in item responses on the GraduateRecord Examination. Applied Psychological Mea-surement, 1950, 4, 9-20.

Flaugher, R. L. The many definitions of test bias.American Psychologist, 1975, 33, 671-679.

Fleiss, J. L., Cohen, J., & Everitt, B. S. Large samplestandard errors of Kappa and weighted Kappa.Psychological Bulletin, 1969, 72, 323-327.

Guilford, J. P. Fundamental statistics in psychologyand education (4th ed.). New York: McGraw-Hill,1965.

Harman, H. H. Modern factor analysis (3rd ed.).Chicago: University of Chicago Press, 1976.

Jensen, A. R. Bias in mental testing. New York: FreePress, 1980.

Lord, F. M. A study of item bias, using item charac-teristic curve theory. In Y. H. Poortinga (Ed.),

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 13: Identifying Differentially Population Lawrence J

273

Basic problems in cross-cultural research. Am-sterdam: Swets & Zeitlinger, 1977. (a)

Lord, F. M. Personal communication, November 29,1977. (b)

Lord, F. M. Personal communication, July 13, 1979.Lord, F. M. Applications of item response theory to

practical testing problems. Hillsdale NJ: Erlbaum,1980.

Lord, F. M., & Novick, M. R. Statistical theories ofmental test scores. Reading MA: Addison-Wesley,1968.

Magnusson, D. Test theory. Reading MA: Addison-Wesley, 1967.

Merz, W. R. Test fairness and test bias: A review ofprocedures. In M. J. Wargo & D. R. Green (Eds.),Achievement testing of disadvantaged and minor-ity students for educational program evaluation.Monterey CA: CTB/McGraw-Hill, 1978.

Petersen, N. S. Bias in the selection rule&mdash;Bias in thetest. In L. J. T. van der Kamp, W. F. Langerak, &D. N. M. de Gruijter (Eds.), Psychometrics foreducational debates. New York: Wiley, 1980.

Powers, D. E., & Swinton, S. S. Extending the mea-surement of graduate admission abilities beyondthe verbal and quantitative domains. Applied Psy-chological Measurement, 1981, 5, 141-158.

Rudner, L. M., Getson, P. R., & Knight, D. L. Biaseditem detection techniques. Journal of EducationalStatistics, 1980, 5, 213-233.

Shepard, L. A. Identifying bias in test items. In B. F.Green (Ed.), Issues in testing: Coaching, disclo-sure, and ethnic bias. San Francisco: Jossey-Bass,1981.

Wood, R. L., Wingersky, M. S., & Lord, F. M.LOGIST&mdash;A computer program for estimatingexaminee ability and item characteristic curve pa-rameters (ETS RM 76-6). Princeton NJ: Educa-tional Testing Service, 1976.

Acknowledgments

This study was supported in part by the GraduateRecord Examinations Board. Portions of this articlewere presented at the meeting of the Society of Multi-variate Experimental Psychology, Los Angeles, No-vember 1979. Thanks are due Douglas N. Jackson,Frederic M. Lord, Ledyard R. Tucker, and MadelineM. Wallmarkfor advising about the research design

and statistical analysis; Frans C’. van der Lee for ar-ranging the retrieval of the data; Henrietta Gallagherfor statistical calculating; rS’atya P. Agarwal, Kather-ine Kornhauser, Barbara .I. Phillips, ~4lf~ed Rogers,Caroline Roth, and Marilyn S. Wingersky for com-puter programming; and Donald L. Alderman,Robert A. Altman, William H. ~4ngoff, Mary JoClark, Nancy S. Cole, Thomas R Donlon, Bert F.Green, Neal M. Kingston, Frederic M. Lord, andC’heryl L. Wild for critically reviewing drafts of thisarticle.

Author’s Address

Send requests for reprints or further information toLawrence J. Stricker, Educational Testing Service,Princeton NJ 08541.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/