selection of item response model by genetic ... - eduhk moodle

BehaviormetrikaVol.34, No.1, 2007, 1–26

SELECTION OF ITEM RESPONSE MODEL BY GENETICALGORITHM

Kojiro Shojima∗

A method of selecting an item response model with a genetic algorithm is proposed,where a model indicator variable is regarded as a chromosome to distinguish other in-dividuals. This scheme enables a model for each item to be selected automatically. Thegenetic algorithm with the set of techniques that is implemented here is called the simplegenetic algorithm, and the results obtained from simulation studies were satisfactory.An issue with the graded response model and the generalized partial credit model wasexamined using simulation studies and numerical examples was to find which was themore useful of these two prevailing kinds. The results obtained from simulation studiesproved the graded response model fit the data more flexibly, since it fit the data gener-ated under the generalized partial credit model more frequently than for the oppositecase. However, the generalized partial credit model was more suitable for two real datasets.

1. Introduction

Item response theory (IRT) (Lord, 1980; Hambleton & Swaminathan, 1985; van derLinden & Hambleton, 1997) is very powerful tool being a methodology for examiningthe characteristics of ability tests and psychological questionnaires. IRT was first de-veloped to describe the parametric relationships between the abilities of test takers andcorrect answers to true/false items. Recently, various models that can be applied to nom-inal, categorical ordered, continuous scores, and their multidimensional models have beendeveloped by many researchers. There are many polytomous models for categorical or-dered scores: the graded response model (GRM; Samejima, 1969), the rating scale model(Andrich, 1978), the partial credit model (PCM; Masters, 1982), the generalized PCM(GPCM; Muraki, 1992), the steps model (Verhelst, Glas & de Vries, 1997), and the se-quential model (Tutz, 1990). Each model was constructed according to each phenomenonconcerned, although Thissen & Steinberg (1986) and van der Ark (2001) classified theminto a few categories. This abundance of models is attractive, since we can deal withvarious types of phenomena. However, test designers are sometimes not confident whichpolytomous model for categorical ordered scores they need to select from the many modelsavailable, because it is not always apparent that the polytomous models they are concernedwith is the most compatible with the phenomenon in their presence.

It is easy to choose the best model where the present response format specifies it. Forexample, the nominal categories model (NCM; Bock, 1972) is the most suitable, and ismore precisely, the only model for multiple–choice items. When these items are witha single correct alternative, even if the response format is multiple–choice with nominal

Key Words and Phrases: item response theory, genetic algorithm, genetic programming, polytomous

model, graded response model, generalized partial credit model, EM algorithm

∗ Research Division, National Center for University Entrance Examinations, 2–19–23 Komaba

Meguro–ku, Tokyo 153–8501, Japan. E-mail: [email protected]

2 K. Shojima

categories, 2– or 3–parameter model for binary data can also be applied to such items.There are many such practical applications. Also, Wang & Zeng (2000) applied Same-jima’s (1973, 1974) continuous response model (CRM) to ordinary data with a relativelylarge number of categories. Furthermore, Shojima & Toyoda (2004) applied CRM to essayquestions with a scoring range of [0,30].

Samejima (1996) noticed that it was important to examine how a model fit the psy-chological process presumed to underlie the data. However, it is difficult to determinethe most suitable model in this way, especially concerning new items whose propertiesare not known well. Various models should be applied to such items and model–data fitindices should be computed as materials for making model decisions. Moreover, the IRTmodel applied to each item can be varied item–by–item such as done by Thissen (1991),Ogasawara (1998) and Shojima & Toyoda (2004).

The principle of one model for one test is currently prevailing, and there are only afew examples of applying various models to items mainly because the response format isuniform across items in the same test or psychological questionnaire. If GRM is suitablefor some items, and GPCM for others in the same test, the adopted model could be eitherof them in many cases. Similarly, the 3–parameter model for dichotomous responses couldbe applied to all items in a test even if NCM fitted some items more closely.

Such a policy or philosophy has been welcomed or tacitly accepted by many test prac-titioners or psychologists. However, the philosophy of a better model for each item wouldalso be acceptable. Being based on phenomena, it is very likely that the response processfor each item would vary even if the response format were identical across items becausethe response process must not be determined by the response format but some qualitativecontent of the item. Therefore, the situation where the applied model differs across itemsmust not be surprising, but natural. A way of applying one model a priori to a test orquestionnair should not be recommended, especially to items whose properties are notknown well. At least in the phase of developing a test, applying various models to itemseffectively deepens our knowledge about the items. It is especially necessary to applyvarious patterns of models to items with categorical ordered scores, since there are severalcandidates for polytomous models as mentioned above.

Let us assume a situation with three models: M1, M2, and M3, and five items. Then,which of these is more preferable to each item is of concern. More precisely, which combi-nation is more desirable for the five items, for example, (M1, M1, M2, M3, M3) or (M2,M2, M3, M1, M2)? This selection of a model cannot be done item–by–item, since eachitem under a certain IRT model is related to other items through latent traits, θs. Inother words, items are treated as locally independent under IRT, but it is inevitable thatthey will always be globally dependent.

However, there are Kn patterns for model combinations when the number of items aren and models K. It is impossible to explore all patterns in practice when n and K arelarge. Moreover, many patterns exist that do not even need to be tested. Therefore, anefficient method of searching for the best or most acceptable pattern is required. Thepurpose of this study is to propose a framework for searching for such patterns that fitto the test using a genetic algorithm (GA) (Holland, 1975; Goldberg, 1989; Vose, 1999).

IRT MODEL SELECTION BY GA 3

GA is a search algorithm based on the mechanics of natural selection and natural genetics(Goldberg, 1989), and is a powerful tool to solve combinatorial optimization problems.

A few applications of GA to multivariate analysis have recently been seen. Marcoulides& Drezner (2001) adopted GA for model specifications within the context of structuralequation modeling (Joreskog & Sorbom, 1993). In the field of test theory, Jiang and Tang(1998) implemented GA to improve the estimation of item parameters for the 3–parameterlogistic model. In addition, Fujimori (1999) applied GA to select items to compose a psy-chological questionnaire to make Cronbach’s (1951) alpha coefficient sufficiently high.

2. Method

Let θ be the latent trait, or ability, which is assumed to take on any real number(θ ∈ R). Let j denote an item, which is the smallest unit for measuring the latent traitconcerned. Let us also assume that there are K candidates for IRT models: M1, · · · , MK .Furthermore, let Xj be a random variable of a response to item j, and xj be its realization.Then, the probability of a test taker with latent trait θ obtaining Xj = xj becomes

Pr(Xj = xj |θ, Mk) = Pkxj(θ|λkj) (1)

under the kth model, where λkj is the item parameter set for item j under model k. Forgenerality, xj in the equation above allows any type of variable such as nominal, ordinal,or continuous.

Let N be the number of test takers and n be items. Also, let wjk be a dichotomousindicator that is coded 1 if the kth model is applied to item j, otherwise it is 0. Then,the likelihood of data matrix X = {xij} (N × n) becomes

p(X|θ,Λ, W ) =N∏

i=1

n∏j=1

K∏k=1

Pkxj(θi|λkj)wjk , (2)

where θ = [θ1 · · · θN ]′, Λ = [Λ1, · · · ,ΛK ], Λk = [λk1, · · · , λkn] (k = 1, 2, · · · , K), andW = {wjk|wjk ∈ {0, 1}} (n × K). Item parameters can be estimated after optimizingthe function above using the marginal maximum likelihood estimation method (e.g., Bock& Aitkin, 1981; Tsutakawa, 1984; Mislevy, 1986; Tsutakawa & Lin, 1986; Tanner, 1993;Wang & Zeng, 1998) with the EM algorithm (Dempster, Laird & Rubin, 1977).

Nuisance parameters (trait parameters in IRT) are integrated out in the EM algorithm,and the log–likelihood function of the structural parameters (item parameters in IRT) isgiven as the objective function to be maximized. That is

ln p(Λ|X, W ) =Eθ|Λ(s),X,W [ln p(Λ|θ, X, W )]

=N∑

i=1

∫Θi

ln p(xi|θi,Λ, W )p(θi|Λ(s), xi, W ) dθi + ln p(Λ) + const., (3)

where Θi is the parameter space of θi, Λ(s) is the estimate of Λ obtained at the sth M–step,and p(Λ) is the prior distribution of Λ and xi = [xi1 · · ·xin]′. In addition, p(xi|θi,Λ, W )

4 K. Shojima

and p(θi|Λ(s), xi, W ) in (3) are

p(xi|θi,Λ, W ) =n∏

j=1

K∏k=1

Pkxj(θ|λkj)wjk (4)

and

p(θi|Λ(s), xi, W ) =p(xi|θi,Λ(s), W )p(θi)∫

Θip(xi|θi,Λ(s), W )p(θi) dθi

, (5)

respectively, where p(θi) is the prior distribution of θi.The setups for parameters such as chromosome, initial population, fitness, selection,

crossover and mutation are needed for to practically implement GA. The GA characterizedby these parameters is called the simple genetic algorithm (SGA) (Vose, 1999).

2.1 Chromosomal Representation

W can express various patterns, and what pattern (hereafter, individual; note also thatactual respondents are called test takers) of W can attain better fitness as is discussedbelow has been observed through generations (times). Each individual takes one of Kn

patterns. For example, individual g (= 1, 2, · · · , G) takes a pattern as follows:

W g = {wgjk|wgjk ∈ {0, 1}} =

⎡⎢⎢⎢⎢⎢⎢⎣

1 0 0 · · · 00 1 0 · · · 00 0 0 · · · 1...

......

. . ....

0 0 1 · · · 0

⎤⎥⎥⎥⎥⎥⎥⎦

′

(n × K), (6)

where wgjk is wjk of individual g. Equation (6) reveals that individual g applies model 1to item 1, model 2 to item 2, model K to item 3, and so on. Here, as all the characteristicsof individual g are reflected in W g, W g is said to be the chromosome of individual g.

Also,

ξg = [1 2 K · · · 3]′ (n × 1) (7)

as the chromosome is easier to program rather than W g. Therefore, ξg (∈ NnK ;

NK ≡ {1, 2, · · · , K}) will be used as the chromosome to be operated on from now on. Thejth element of ξg, ξgj (∈ NK), is said to be the locus, which is coded k when wgjk = 1.

2.2 Generating Initial Population

Let ξ(t)g denote the chromosome of the gth individual at generation t. The initial pop-

ulation of G chromosomes, ξ(1)g (g = 1, 2, · · · , G), is required to start GA, where G is the

number of individuals at each generation. Parameter G can be varied through generations,but it is fixed at a certain natural number in many cases. The patterns for ξ(1)

g s become


more varied as the value of G increases. Therefore, the possibility of getting trapped inlocal optima decreases in searching for global optima. However, the computational burdenis heavy when G is large. Finally, extracting the value randomly from NK is sufficient forthe feature value (allele) of each locus, if there is no prior information about the locus(item).

2.3 Fitness Evaluation

Who shall live and who shall die is determined by each individual’s fitness. An indi-vidual with a low fitness level is regarded as unsuitable under the given circumstances.The chance to mate with another individual is therefore not given to such individuals,which is a necessary manipulation to generate their children’s chromosomes at the nextgeneration. Let f

(t)g = f(ξ(t)

g ) be an individual g’s fitness at generation t evaluated by acertain measure, f .

Various indices seem to be useful for f . The most straightforward index would be theχ2 statistic, which is obtained in the process of the maximum likelihood estimation andis an index of the distance between the data and the model. However, a comparison ofmodels with different numbers of parameters is advantageous for models with larger num-bers of parameters, since, generally speaking, a smaller χ2 statistic can be obtained easilyif the number of parameters in a model is increased. In this case, some information crite-ria such as Akaike’s (1974) information criterion (AIC), Bozdogan’s (1987) consistent AIC(CAIC) and Schwarz’s (1978) Bayesian information criterion (SBIC) are effective, becausethey penalize overfitting by redundant parametrization in the model. These informationcriteria are formulated as follows:

AIC = χ2 − 2df, (8)

CAIC = χ2 − (lnN + 1)df, (9)

and

SBIC = χ2 − (lnN)df, (10)

where N is the sample size and df is the degree of freedom. According to (8)–(10), asmaller value for the information criterion stands for a higher level of an individual’sfitness.

To apply an appropriate model to each item is the purpose to be attained in the prob-lem setting for this study. Therefore, an individual approaching nearer to a goal withoutredundant parameters can obtain smaller values for information criteria.

2.4 Selection

The manipulation of selection models natural selection in Darwinism. Numerous meth-ods have been proposed such as roulette wheel selection, uniform ranking selection, linear

6 K. Shojima

ranking selection, tournament selection, and steady–state selection. Roulette wheel selec-tion is the most prevailing, and is believed to be a similar mechanism to that occurring innature (Zhang & Kim, 2000). Assuming that p

(t)g is the survival ratio of individual g at

generation t, roulette wheel selection is expressed as

p(t)g =

f(t)g∑

g f(t)g

, (11)

provided that each evaluated fitness is larger than 0 (f (t)g > 0; g = 1, 2, · · · , G;

t = 1, 2, · · · , T ), where the T th generation means final generation. In addition, a larger f

must stand for a higher level of fitness, since an individual with a larger f can acquire ahigher survival rate.

In this study, the χ2 statistic or information criteria are used as f(t)g . However, ac-

cording to (8)–(10), a state where there was a better model–data fit without redundantparameters would lead to a smaller χ2 statistic and information criteria. In addition,information criteria are usually obtained as negative values. Therefore, roulette wheelselection cannot be used directly. It is possible to adopt the χ2 statistic or informationcriteria after scaling (Goldberg, 1989). Scaling is done to properly rescale f

(t)g , which

cannot be used as it is. Typical ways of doing this are linear scaling, sigma truncation,and power low scaling. Roulette selection can be used after scaling the χ2 statistic orinformation criteria. However, each scaling method includes various parameters to beoperated on, which makes the method more complicated.

Within the context of this study, uniform ranking selection (Schwefel, 1995) and linearranking selection (Baker, 1985) are easy to use because no manipulation is required forthe fitness value itself. Assume that f

(t)g is the hth best value among G individuals. Then,

in uniform ranking selection, the survival rate of individual g at generation t gives

p(t)g =

{1/S, 1 ≤ h ≤ S

0, S < h ≤ G, (12)

where S (< G) is the number of individuals to survive that have the possibility of matingwith one of the other surviving individuals in the next step. A selection probability of1/S is assigned to each of the best S individuals. However, the rest are discarded. Inaddition, the proportion of S over G, Rs = S/G, is often called the selection rate.

In linear ranking selection, the survival rate of individual g at generation t is definedas

p(t)g =

{{η − 2(η − 1)(h − 1)/(S − 1)}/S, 1 ≤ h ≤ S

0, S < h ≤ G. (13)

An individual with a smaller h (≤ S) is given a higher survival rate, while the reminderwhose r is larger than S cannot survive. Parameter η (1.0 ≤ η ≤ 2.0) in the equationabove is selection pressure. As all survival ratios of S individuals become equal when η is1.0, there is no difference between uniform and linear ranking selection. As parameter η


approaches 2.0, a higher survival ratio is allotted to an individual with a smaller h (≤ S).

2.5 Crossover

Chromosomes of individuals in the process of crossover at the next t + 1st generation,ξ(t+1)

g (g = 1, 2, · · · , G), are created by operating those of individuals selected in the pre-vious generation. Strings are randomly coupled to crossbreed provided that individualswith higher survival rates have more opportunities to mate.

One–point crossover or Two–point crossover are often adopted to generate the chromo-somes of their children. Assuming that ξA and ξB are chromosomes of a couple, then inone–point crossover, their chromosomes are exchanged at a certain point that is randomlychosen. Let z1 = [1′

n10′

n−n1]′ be an n× 1 vector, where n1 is a natural number randomly

drawn from Nn−1 ≡ {1, 2, · · · , n − 1} and 1p and 0p are vectors with size p in which allelements are 1s and 0s, respectively. Then, the chromosomes of their children, ξa and ξb,are created as

ξa = z1 � ξA + (1n − z1) � ξB (14)

and

ξb = (1n − z1) � ξA + z1 � ξB . (15)

Similarly, there are two cutoff points in z under the control of two-point crossover.Here, z2 = [1′

n10′

n21′

n−n1−n2]′ is substituted for z1 in the equations above, where n1 and

n2 are natural numbers randomly extracted from Nn−1 provided that the sum of n1 andn2 is smaller than n (n1 + n2 < n). Of course, the number of cutoff points can be greaterthan three.

Uniform crossover (Syswerda, 1989) is a method where the allele of each locus is ex-tracted randomly from the corresponding locus of ξA or ξB. That is, z∗ is composed ofnbit random binary patterns of 0s and 1s. Furthermore, a method called elitism is oftenused to preserve at least the most superior chromosome without making changes to thenext generation.

The number of individuals joining the crossover procedure can be controlled with thecrossover ratio, where some strings do not have the chance to couple if they survive theselection step. There are also numerous crossover methods with minimal changes, likewiseselection methods and fitness evaluation methods.

2.6 Mutation

The feature value stored in each locus of the newly generated strings, ξ(t+1)g s, is replaced

by its allele under a certain probability. The parameter at this stage is the mutation rate.The higher the mutation rate, the lower the possibility of reaching local optima, althoughthe computation time to converge is prolonged.

8 K. Shojima

2.7 Procedure for Genetic Algorithm

The flow for the series discussed in the above sections is as follows:

(1) Setting the parameters, i.e., the gene length, the number of alleles, the function toevaluate fitness, the selection method, the crossover method, and the mutation ratio.Here, the gene length is the test length (number of items). Also, the number of allelesis that of models applied to items.

(2) Generating the initial population by random numbers(3) Analyzing data and computing the fitness of each individual(4) Determining whether there is a final generation or not. Final generation may be a

stage when the chromosomes of all individuals are identical, the fitness of the mostsuperior individual at generation t + 1 converges in comparison with that for gen-eration t, or a certain criterion is adopted by the analyst. If these are satisfied,computation is over. Otherwise proceed to (5)

(5) Selecting individuals from the viewpoint of fitness(6) Carrying out crossover on chromosomes of selected individuals, and creating the

candidates for the next generation(7) Mutate chromosomes of the candidates, and go back to (3)

3. Simulation Study

There are some useful ways to use it. Such ways would be, for example, which is thebetter model for a multiple–choice single–answer item between 3–parameter model fordichotomous response and nominal categories model (Bock, 1972), or what is the mostsuitable model from polytomous models for a Likert-type item or a testlet? Some patternsin simulation studies need to confirm the effectiveness of the proposed method, however,all cases cannot be covered because of the computational burden this method involves.Here, let us emphasize the polytomous model selection problem, because this issue is ofpractical concern to both IRT theorists and test administrators.

Acccording to Thissen & Steinberg (1986), most polytomous models are divided intotwo classes of models: difference models and divided–by–total models. Let Uj be a randomvariable to assume an ordered response to item j, and c (∈ {0, 1, 2, · · · , Cj − 1}) be theactual response. Let us further assume that Pjc(θ) = Pr(Uj = c|θi), the so–called itemcategory response function (ICRF), is the probability of a test taker with latent trait θ

selecting category c on item j. Then, in the difference models, Pjc(θ) is formulated as

Pjc(θ) = P+j,c+1(θ) − P+

jc(θ), (16)

where P+jc(θ) is the cumulative probability of a response for category c or higher.

On the other hand, in divided–by–total models, the ICRF is parametrized as

Pjc(θ) =πjc(θ)∑c πjc(θ)

, (17)


where πjc(θ) (> 0) is the attractiveness or preference of category c as a stuimulus in termsof paired comparison. After all, the probability of category c being selected becomes theratio of the sum of preferences of all categories.

Hemeker, Sijtsma, Molenaar & Junker (1997) suggested that the best known polyto-mous model of the difference models that can be applied to categorical ordered scoreswas Samejima’s (1969) GRM, and a corresponding one in divided–by–total models wasMasters’ (1982) PCM. As a result, for test practitioners or psychologists who intend toapply a polytomous model to testlets or Likert–type items, it is difficult to choose betweenGRM and (G)PCM as the more suitable model.

Some literature have been published with respect to this issue. For instance, Baker,Rounds & Zevon (2000) applied both GRM and GPCM to a psychological questionnairemeasuring subjective well–being; they then reported that the χ2 fit index of the formermodel was smaller than that of the latter. Also, De Ayala, Dodd & Koch (1992) statedwithin the context of computerized adaptive testing (e.g., van der Linden & Glas, 2000),that the ability estimates achieved by GRM were slightly more accurate than those byPCM. More theoretically, Andrich (1995) reported that adjacent categories could be col-lapsed under the Thurstone class of models, which included GRM when data conformedto the joining assumption, but they could not under the Rasch class of models wherePCM and GPCM were the members. Finally, Samejima (1996) proposed four criteriasuch as additivity, natural generalization to a continuous response model, satisfaction withunique maximum conditions, and orderliness of modal points for operating characteristicsto evaluate polytomous models. Only GRM satisfies these four criteria at present.

These results seem to imply that GRM is more useful for categorical ordered responsedata than (G)PCM or the other polytomous models. However, Baker, Rounds & Zevon(2000) found that (G)PCM or other polytomous models might be desirable for other psy-chological questionnaires as they limited their findings to a questionnaire on subjectivewell–being. Also, PCM in which all slope parameters were fixed at 1.0 was compared byDe Ayala, Dodd & Koch (1992) to GRM which allowed slope parameters to vary acrossitems. Since the number of substantial location parameters for PCM is equal to that ofGRM, although the parametrization of PCM differs in the literature, the number of totalPCM parameters is fewer than that for GRM. Therefore, there is the possibility that aninsufficient data fit caused relatively inaccurate estimation results for PCM in comparisonwith those for GRM.

Whether the better model is GRM or GPCM is still a very important issue in practice,and the conclusion to the issue is likely to differ item–by–item. Therefore, the proposedmethod must be useful for determining the pattern of models in each case that is en-countered. This study places this problem at the center in the following example on realdata analysis. Therefore, similar situations with real data analysis were prepared in thesesimulation studies to utilize the findings.

To comply with the situation above, Pkxj(θ|λkj) in (2)–(5) should be rewritten as

Pkxj(θ|λkj) =

Cj−1∏c=0

Pkjc(θ)uijc (18)

10 K. Shojima

for the case of polytomous response data, where uijc is a dichotomous indicator that iscoded 1 when the response of individual i to item j is c, otherwise it is 0. In addition,assuming GRM is model 1, the Pkjc of the GRM, which will be denoted P1jc after this, isgiven as

P1jc(θ)=1

1 + exp{−1.7aj(θ − bjc)} − 11 + exp{−1.7aj(θ − bj,c+1)} (c = 0, 1, · · ·Cj − 1),

(19)

where a and bs are the slope and location parameters, respectively. Also, b0 and bCjare

defined as −∞ and ∞, respectively. Therefore, the number of parameters for GRM peritem becomes Cj : aj , bj1, · · · , bj,Cj−2 and bj,Cj−1. Furthermore, the Pkjc of GPCM asmodel 2, say P2jc, is defined as

P2jc(θ) =exp

{∑cc′=0 1.7a∗

j (θ − b∗jc′)}

∑Cj

h=0 exp{∑h

h′=0 1.7a∗j (θ − b∗jh′)

} (c = 0, 1, · · ·Cj − 1), (20)

where a∗ and b∗s are the slope and location parameters for GPCM, respectively. Thenumber of parameters for GPCM per item is also Cj , since both the numerator and de-nominator on the right of the equation above can be divided by exp{1.7aj(θ − bj0)}, andcan be rewritten as

P2jc(θ) =exp

{∑cc′=1 1.7a∗

j (θ − b∗jc′)}

1 +∑Cj

h=1 exp{∑h

h′=1 1.7a∗j (θ − b∗jh′)

} (c = 0, 1, · · ·Cj). (21)

Parameter b∗j0, which is always cancelled under any c, is substantially meaningless. There-fore, the parameters for GPCM of item j are a∗

j , b∗j1, · · · , b∗j,Cj−2, and b∗j,Cj−1.

3.1 Setup for Simulation

The number of items (gene length) was set at twelve, which is relatively small, butthis was nearly equal to the real data that will be introduced in the next section. Thenumber of alleles was two (K=2) because there were two models: GRM and GPCM.The true gene was set to ξT = [111111222222]′, which indicated that the first six itemswere for GRM, and the latter six items for GPCM. Also, as the number of categoriesof each item was four (={0, 1, 2, 3}), there were four parameters for both models, i.e.,one slope and three location parameters. The true values for the first six GRM itemswere (a, b1, b2, b3) = (0.6, −2.0, −1.0, 0.0), (0.6, −1.0, 0.0, 1.0), (0.6, 0.0, 1.0, 2.0),(1.2, −2.0, −1.0, 0.0), (1.2, −1.0, 0.0, 1.0), and (1.2, 0.0, 1.0, 2.0), and those for thesix GPCM items were (a∗, b∗1, b∗2, b∗3) = (0.6, −2.0, −1.0, 0.0), (0.6, −1.0, 0.0, 1.0),(0.6, 0.0, 1.0, 2.0), (1.2, −2.0, −1.0, 0.0), (1.2, −1.0, 0.0, 1.0), and (1.2, 0.0, 1.0, 2.0).Item responses were then generated under each model with the corresponding parameters.

Two different numbers of individuals at each generation were prepared (G = 16,32), and two different sample sizes were arranged (N = 1000, 2000). In addition, thelikelihood–ratio chi–square (χ2) statistic (McKinley & Mills, 1985) with degrees of free-dom (Q − 1) × ∑

j(Cj − 1) − 4n, where Q is the number of quadrature points on the


latent scale, and four is the number of parameters, was adopted for the fitness evaluationfunction. There were no substantial differences between the comparison with the infor-mation criteria and that with the χ2 statistic as long as the number of GRM parameterswas equal to that for GPCM. We used uniform ranking selection as the selection method,where S=4, 8 were prepared when G=16, and S=8, 16 were prepared when G=32. Thatis, two levels of 1/4 and 1/2 were allotted to each G as the selection rate (Rs = 1/4,1/2). Uniform crossover method was adopted, and one chromosome with the smallestχ2 (the best solution) at each generation was copied without changes made to the nextgeneration as elitism. The mutation rate was 0.05. The number of generations was set toten (T = 10), i.e., the 10th generation was the final generation. Finally, we replicated theprocedure 100 times under each 8 (=2 × 2 × 2) condition.

To set up the EM algorithm, (3) and (5) at each E–step were numerically approximatedby taking quadrature points on the θ axis, where the number of quadrature points wasnine (−4.0 to 4.0 by 1.0). In addition, Newton–Raphson iteration was implemented ineach M–step to optimize (3), where the maximum number of iterations was set to six.Finally, the maximum number of EM iterations was set to six.

3.2 Simulation Results for GA

Let ξ(t)1r = {ξ(t)

1rj} be the chromosome for the most superior individual, i.e., the onewith the smallest χ2 statistic in the G population at generation t in the rth replication.Let us assume that h

(t)r is the dichotomous indicator, which is coded 1 if ξ

(t)1r is identical

with the true gene (ξ(t)1r ≡ ξT ), and otherwise it is 0. Let us also assume that h

(t)rj is the

dichotomous index, which is 1 when the jth element of ξ(t)1r , ξ

(t)1rj is equal to that of the

true gene, ξTj , and otherwise it is 0. Then, the following hit rates

HR(t)G =

∑100r=1 h

(t)r

100(22)

and

HR(t)L =

∑100r=1

∑12j=1 h

(t)rj

12 × 100(23)

were adopted as the indices for true gene recovery at generation t. Subscripts G and L

on the left of the equations above stand for the gene and the locus level of the hit rate.Figures 1 and 2 plot the transitions of HRG and HRL under each condition through thegenerations. Figure 1 is under the condition of N = 1000, and Figure 2 is N = 2000. Thetrajectories for the smallest χ2 statistic, which was averaged over 100 replications at eachgeneration, were also plotted in the figures.

In general, both true gene and locus recovery under N=2000 were more successful thanthose under N=1000. The hit rates for gene recovery under G = 32 with N=2000 atfinal generation were over 0.5, and all the hit rates for locus recovery were over 0.9. It isnatural to assume on the basis of asymptotic theory that more accurate model inferencecan be done with a larger sample size. Also, the three indices (HRG, HRL and χ2 statis-tic) with G=32 were better through the generations than those with G = 16. This was

12 K. Shojima

Figure 1: Transitions in Hit Rates and χ2 Statistics (N=1000)Symbol � represents indices when (G, Rs) is under (16, 1/4). Also, ×, ♦, and �represent those under (16, 1/2), (32, 1/4), and (32, 1/2), respectively.

because more adaptive individuals were likely to emerge as G became larger, which led toa lowered possibility of being trapped in local optima.

As is empirically known in genetic programming, the large G and not very smallRs are recommended if computational power is sufficient. The simulation results un-der (G, Rs)=(32, 1/4) were satisfactory under the setup conditions for this simulationstudy in which the number of items was twelve and the sample size was 2000.

Finally, the accuracy of true value recovery for item parameters were examined as toindividuals succeeding in true gene recovery. Let λjr (∈ {aj , bj1, bj2, bj3, a∗

j , b∗j1 b∗j2, b∗j3})be an item parameter estimate for locus j (item j) of the most superior individual at thefinal generation under rth replication, and λjr (∈ {aj , bj1, bj2, bj3, a∗

j , b∗j1, b∗j2, b∗j3}) beits true value. Then,

MD =∑

r∈H(λjr − λjr)�{r ∈ H} (24)

and

RMSD =

√∑r∈H(λjr − λjr)2

�{r ∈ H} , (25)


Figure 2: Transitions in Hit Rates and χ2 Statistics (N=2000)Symbol � represents indices when (G, Rs) is under (16, 1/4). Also, ×, ♦, and �represent those under (16, 1/2), (32, 1/4), and (32, 1/2), respectively.

were adopted as indices to establish the accuracy of item parameter recovery, and the re-sults are listed in Table 1. Notation H denotes the set of individuals succeeding in modelspecifications in the final generation, and �{r ∈ H} is the number of such individuals. Inpractice, MDs and RMSDs were sufficiently small to be negligible.

3.3 Discussion on GRM Versus GPCM

The results of model selection for GRM and GPCM, which are based on practicalconcerns, are reported in this section. The indices referred to here are

HRGRM =

∑100r=1

∑6j=1 h

(10)rj

6 × 100(26)

and

HRGPCM =

∑100r=1

∑12j=7 h

(10)rj

6 × 100, (27)

where HRGRM is the ratio of the first six GRM items that could correctly select GRM inthe final generation for all 100 replications, and HRGPCM corresponds to the remaining

14 K. Shojima

Table 1: Item Parameter Recovery Results (Averaged Over Items for Each Model).

GRM GPCMN G Rs NH λ MD RMSD λ MD RMSD1000 16 1/4 21 a −0.0234 0.0754 a∗ −0.0663 0.0868

b1 −0.0388 0.1076 b∗1 −0.0763 0.1291b2 −0.0205 0.0834 b∗2 −0.0105 0.1072b3 −0.0030 0.1041 b∗3 0.0461 0.1296

1/2 18 a −0.0179 0.0768 a∗ −0.0654 0.0848b1 −0.0129 0.1044 b∗1 −0.0807 0.1425b2 −0.0069 0.0836 b∗2 −0.0036 0.0965b3 0.0021 0.1115 b∗3 0.0250 0.1185

32 1/4 30 a −0.0188 0.0814 a∗ −0.0703 0.0875b1 −0.0129 0.1077 b∗1 −0.0519 0.1061b2 −0.0067 0.0869 b∗2 0.0237 0.1013b3 −0.0089 0.1026 b∗3 0.0376 0.1120

1/2 24 a −0.0116 0.0649 a∗ −0.0537 0.0795b1 −0.0025 0.0892 b∗1 −0.0473 0.1124b2 −0.0043 0.0737 b∗2 0.0040 0.0899b3 −0.0082 0.0967 b∗3 0.0394 0.1078

2000 16 1/4 26 a −0.0206 0.0653 a∗ −0.0604 0.0722b1 −0.0045 0.0878 b∗1 −0.0398 0.0918b2 −0.0050 0.0706 b∗2 0.0022 0.0793b3 −0.0012 0.0934 b∗3 0.0548 0.0952

1/2 35 a −0.0192 0.0693 a∗ −0.0629 0.0734b1 −0.0020 0.0762 b∗1 −0.0334 0.0822b2 0.0014 0.0660 b∗2 0.0156 0.0739b3 0.0010 0.0854 b∗3 0.0474 0.0868

32 1/4 54 a −0.0154 0.0623 a∗ −0.0639 0.0743b1 −0.0046 0.0816 b∗1 −0.0520 0.0928b2 −0.0086 0.0685 b∗2 −0.0065 0.0760b3 −0.0063 0.0875 b∗3 0.0405 0.0863

1/2 51 a −0.0164 0.0686 a∗ −0.0689 0.0816b1 0.0001 0.0843 b∗1 −0.0355 0.0871b2 −0.0023 0.0667 b∗2 −0.0033 0.0718b3 −0.0080 0.0905 b∗3 0.0559 0.0936

six GPCM items. Furthermore,

HRGRM,j =

∑100r=1 h

(10)rj

100(j = 1, 2, 3, 4, 5, 6) (28)

and

HRGPCM,j =

∑100r=1 h

(10)rj

100(j = 7, 8, 9, 10, 11, 12) (29)

were also calculated. These indices are listed in Table 2, where the true values for eachitem parameter have been inserted for reference. To sum up the table we can predict, (i)that each model specification will be successful with a large number samples, (ii) that truemodels will be correctly selected in terms of decisions by the majority when the numberof replications is set to large even with a small number of samples, (iii) that hit rates forGPCM will be smaller than those for GRM, (iv) that hit rates for items with larger slopeparameters will be smaller than those for items with smaller slope parameters, and (v)


Table 2: Hit Rates for Model Recovery.

GRM HRGRM HRGRM,j

N S Rs 1 2 3 4 5 61000 16 1/4 0.933 1.00 0.98 1.00 0.92 0.79 0.91

1/2 0.920 1.00 0.95 0.99 0.93 0.81 0.8432 1/4 0.945 1.00 0.97 1.00 0.92 0.83 0.95

1/2 0.905 1.00 0.88 0.99 0.86 0.79 0.912000 16 1/4 0.965 1.00 0.99 1.00 0.98 0.89 0.93

1/2 0.957 1.00 0.96 1.00 0.94 0.87 0.9732 1/4 0.972 1.00 0.95 1.00 0.97 0.96 0.95

1/2 0.958 1.00 0.90 1.00 0.98 0.94 0.93True Values for a 0.6 0.6 0.6 1.2 1.2 1.2Item Parameters b1 −2.0 −1.0 0.0 −2.0 −1.0 0.0

b2 −1.0 0.0 1.0 −1.0 0.0 1.0b3 0.0 1.0 2.0 0.0 1.0 2.0

GPCM HRGPCM HRGPCM,j

N S Rs 7 8 9 10 11 121000 16 1/4 0.822 0.88 0.64 0.95 0.90 0.76 0.80

1/2 0.832 0.89 0.77 0.89 0.87 0.73 0.8432 1/4 0.883 0.96 0.82 0.91 0.94 0.77 0.90

1/2 0.885 0.95 0.81 0.92 0.89 0.82 0.922000 16 1/4 0.840 0.97 0.71 0.96 0.92 0.66 0.82

1/2 0.897 0.95 0.83 0.97 0.92 0.79 0.9232 1/4 0.920 0.99 0.81 0.98 0.93 0.86 0.95

1/2 0.928 0.99 0.91 0.99 0.93 0.83 0.92True Values for a∗ 0.6 0.6 0.6 1.2 1.2 1.2Item Parameters b∗1 −2.0 −1.0 0.0 −2.0 −1.0 0.0

b∗2 −1.0 0.0 1.0 −1.0 0.0 1.0b∗3 0.0 1.0 2.0 0.0 1.0 2.0

that hit rates for items whose location parameters are around the origin will be smallerthan those for items with larger absolute values for location parameters.

Points (i) and (ii) agree with the results in the previous subsection. The reasons for theresults in (iii)–(v) should be more thoroughly examined with more elaborate simulationsunder various settings for item parameter true values. However, one possible reason isdiscussed below. This is the phenotype similarity between GRM and GPCM.

The difference between phenotypes or expressed shapes for GRM and GPCM, namelythe item response functions for both models, are not very different, although the differ-ences between their mathematical formulations are quite different because they are basedon different cognitive models underlying the response process. That is, the GRM shapecan be approximately expressed by GPCM, and the GPCM shape by GRM.

Kullback–Leibler (K–L) (1951) information is useful for examining the distance betweentwo distributions. This K–L information between two polytomous models, A and B, canthen be measured with the following equation:

IKL(MA; MB) =∑

c

∫Θ

ln(PAjc(θ)

PBjc(θ)

)PAjc(θ) dθ

=∑

c

∫Θ

{ln PAjc(θ) − ln PBjc(θ)}PAjc(θ) dθ, (30)

16 K. Shojima

Table 3: Minimized K–L Information And Its Item Parameter Estimates.

K–L a∗ b∗1 b∗2 b∗3λ2,1 15.9658 0.3160 −1.9183 −0.9285 −0.1908λ2,2 16.4707 0.3163 −0.8734 0.0000 0.8734λ2,3 15.9658 0.3160 0.1908 0.9285 1.9183λ2,4 7.4806 0.8839 −1.7347 −0.9996 −0.2664λ2,5 7.5238 0.8849 −0.7344 0.0000 0.7344λ2,6 7.4806 0.8839 0.2664 0.9996 1.7347

K–L a b1 b2 b3

λ1,7 8.9503 0.8010 −2.4390 −0.9988 0.4493λ1,8 9.1602 0.7988 −1.4467 0.0000 1.4467λ1,9 8.9503 0.8010 −0.4493 0.9988 2.4390λ1,10 3.0924 1.4300 −2.1211 −0.9999 0.1212λ1,11 3.0994 1.4297 −1.1213 0.0000 1.1213λ1,12 3.0924 1.4300 −0.1212 0.9999 2.1211

where PAjc and PBjc are the ICRSs of models A and B. Then, regarding GRM and GPCMas models 1 and 2, the following estimates

λ2j = arg minλ2j

IKL(M1; M2|λ2j)

= arg minλ2j

{const. −

∑c

∫Θ

ln P2jc(θ|λ2j)P1jc(θ) dθ}

(j = 1, 2, 3, 4, 5, 6) (31)

and

λ1j = arg minλ1j

IKL(M2; M1|λ1j)

= arg minλ1j

{const. −

∑c

∫Θ

ln P1jc(θ|λ1j)P2jc(θ) dθ}

(j = 7, 8, 9, 10, 11, 12) (32)

were computed and are listed in Table 3, where each λ2j (j=1, 2, 3, 4, 5, 6) is aGPCM parameter estimate that minimizes the K–L distance to the item that was orig-inally used as the GRM item in this simulation. On the other hand, λ1j (j=7, 8, 9,10, 11, 12) is an item parameter estimate for GRM that is close to the GPCM itemfrom the viewpoint of K–L information. The minimized K–L information is also listedin Table 3. For instance, Item 1, which is one of six GRM items with parameters(a, b1, b2, b3) = (0.6,−2.0,−1.0, 0.0), was close to the GPCM item whose parameters were(a∗, b∗1, b

∗2, b

∗3) = (0.316,−0.191,−0.999, 0.449), and its K–L statistic was 15.97.

According to Table 3, it is obvious from the K–L statistics that GRM can express theGPCM shape more easily than the latter can express the shape of the former, althoughit is difficult to directly interpret the size of K–L statistics because they are not stan-dardized. Also, it is clear that an item with a larger value for the slope parameter canbe more easily expressed with the other model. The order for the size of K–L statisticsroughly corresponds to the results in Table 2. However, point (v) was not reflected in theK–L statistics, and further investigations under various settings for true values of itemparameters are required.


Table 4: Profiles for Two Tests.

World History A* Earth Science IA**N 2,046 1,424n 12 9

Contents (Num. of Categories) Contents (Num. of Categories)1 Modern America (4) Environmental Change (4)2 Modern Mideast (4) Air Pollution (4)3 Modern Europe (4) Climatology (4)4 Population Movement (4) Seabed Topography (4)5 Modern East Asia (4) Petrology (4)6 Judea and England (4) Mineralogy (4)7 French Revolution (4) Volcanology (4)8 Chinese Revolution (4) Seismology (Earthquake) (4)9 Russian Revolution (4) Seasonal Rain Front (4)

10 Free Trade Imperialism (4)11 World Depression (4)12 Vietnam War and ASEAN (4)

*The mean and SD of the test were 44.32 and 18.99, respectively.**The numbers of items and test takers of the full data set of the testwas 12 and 3,810, respectively. The mean and SD of the full case were55.89 and 16.62, respectively. The first nine items to which 1,424 stu-dents had responded were used for the analysis.

4. Real Data Analyses

In this section, two examples are presented in which the proposed method is used onreal data. The first of the two tests was for world history (World History A), and thesecond was earth science (Earth Science IA), and they were administered in the NationalCenter Test for University Admissions (January 2005). Table 4 lists the profiles for the twotests. All items were testlets composed of three small items among which the assumptionof local independence could not be held. The number of correct answers in each testletwas counted as the response data: uij ∈ {0, 1, 2, 3}. That is, the number of categorieswas four, and this was equal to that for the setup in the previous simulation. The formatfor each small item was standardized as a multiple choice type, and reasoning ability byreferring to figures and tables was required to obtain the correct answer.

A brief description of the contents of Items 1 and 2 of World History A (WH-A-1A andWH-A-1B) and Item 1 of Earth Science IA (ES-A-1A), from the above items, is given.First, WH-A-1A begins with a lead (with picture) describing an Expo held in the UnitedStates of America in the 1890s. Question 1 of this testlet is a four-option multiple-choicequestion that asks the student to select the answer that best describes what the Expocommemorated. Question 2 is also a four-option multiple-choice question. The studentmust select the sentence which best describes social conditions in the USA and Europeat the time. In Question 3, students are required to choose from the one sentence from achoice of four that correctly explains the USA’s expanding overseas influence at the time.

Second, WH-A-1B begins with a lead (with map) about the first Gulf War. Question1 of this testlet is a fill-in-the-blank item that asks students to complete a modern Iraqihistory table. In Question 2, students are asked the name of the sect of Islam founded

18 K. Shojima

by Muhammad’s cousin Ali and his descendants. In question 3, students are required tochoose the answer that most correctly describes Islamic history from Aurangzeb to theOttoman Empire, and up until the Iranian Revolution, or in other words, from the 13thto the 20th century. Questions 1 to 3 of WH-A-1B are all multiple-choice single-answerquestions.

Finally, ES-A-1A begins with a description of the environmental changes that occurredfollowing the formation of the ozonosphere during the Paleozoic era, about 450 millionyears ago. Question 1 of this testlet asks students to fill in two blanks. Both blanks mustbe correctly filled in. The first blank asks which is harmful to humans: ultraviolet rays,infrared rays, or solar wind, and the second asks which gas was the major ingredient of theatmosphere before the formation of the ozonosphere: methane or carbon dioxide. Ques-tion 2 is a four-option multiple-choice question that asks the distance of the ozonospherefrom the surface of the earth. Question 3 is also four-option multiple-choice and asks thestudents to choose the sentence that correctly describes the global-scale environmentalchanges caused by depletion of the ozone layer, reduction of tropical rain forests, andvolcanic eruptions.

These three items introduced above are testlets because their questions are not inde-pendent from each other. Moreover, it is not clear which model should be applied toeach testlet, since it is difficult to identify the cognitive process underlying each testletabout how many questions students correctly answer. Therefore, which testlet model isapplicable to each testlet should be examined. The proposed method in Section 2 was ap-plied to these two tests. The candidates were GRM and GPCM, and the likelihood–ratiochi–square (χ2) was used to evaluate fitness. Also, (G, Rs) was set to (32, 1/4) from theresults of simulation, since the combination (32, 1/4) was the most successful in true generecovery. Other setups were the same as those in the simulations. That is, selection andcrossover were uniform ranking selection and uniform crossover. Moreover, the individualwith the smallest χ2 was preserved without change in the next generation as elitism. Theprocedures for the EM algorithm and Newton–Raphson method were identical to thoseimplemented in simulations. We then did 100 replications under the above conditions.

Figures 3 and 4 plot the transition in the χ2 statistic averaged over the 100 best in-dividuals through the generations. World History A is for Figure 3, and Earth ScienceIA is for Figure 4. It is clear from the two figures that the mean of the χ2 statistic,which was averaged over individuals with the smallest χ2 under each replication becamesmaller through the generations. Also, model selection rates were bipolarized because themodel applied to each item had been fixed through the generations. Items whose ratiosapproached the floor indicated better fits under GRM than under GPCM. However, itemsapproaching the ceiling were becoming more suitable to GPCM through the generations.The format for all items being equal, the diversity of selected models was derived fromthe variability in item content or cognitive processes.

The chromosome patterns obtained in the final generations and their corresponding χ2

statistics are summarized in Table 5. The χ2 statistics fluctuated (SDs were not 0s) evenwith the same chromosome pattern because each process to reach the final generationvaried, although their standard deviations were generally very small. Furthermore, Tables


Figure 3: Transitions in Model Selection Rates and χ2 Statistics (World History A).

Figure 4: Transitions in Model Selection Rates and χ2 Statistics (Earth Science IA).

6 and 7 list item parameter estimates and their χ2 statistics per item for the most pre-dominant chromosome patterns. Table 6 is for World History A, and Table 7 is for EarthScience IA. Different processes to reach the final generations caused slightly different itemparameter estimates (SDs were not 0s) because the item parameter estimates obtained in

20 K. Shojima

Table 5: Chromosome Patterns for 100 Final Generations.

World History A Chi–squarePattern Chromosome N Mean SD Min Max

1 212222221222 56 282.1354 0.2422 281.6331 282.62282 212222221221 38 282.1266 0.3180 281.3491 282.88663 212212221222 2 284.5284 0.6024 284.1025 284.95434 212222222222 1 285.7801 285.7801 285.78015 212122221222 1 288.8612 288.8612 288.86126 211222221222 1 293.6171 293.6171 293.61717 112222221221 1 293.9622 293.9622 293.9622

Earth Science IA Chi–squarePattern Chromosome N Mean SD Min Max

1 212212122 28 102.5791 2.6299 99.6997 109.99172 112212122 25 103.2395 4.1550 100.2762 120.60433 212222122 15 102.1514 3.0534 99.7133 109.17784 112212222 7 120.0058 11.2358 101.6335 127.04825 212211122 5 115.3978 2.9215 111.2440 119.48856 112222122 4 105.1447 3.6244 101.3757 109.51737 212212222 4 116.3312 11.6817 104.1530 126.56818 212221122 3 116.0083 0.2030 115.7749 116.14399 112211122 3 116.4365 0.3895 115.9896 116.7032

10 112112122 2 105.2668 4.8999 101.8021 108.731611 112222222 1 102.0081 102.0081 102.008112 212222222 1 111.5826 111.5826 111.582613 122212122 1 112.2558 112.2558 112.255814 112122122 1 122.5125 122.5125 122.5125

each previous generation were used as the starting values in the next generation in theoptimization procedure. This caused diversity in item parameter estimates even with thesame chromosome pattern, which lead to different χ2 statistics. However, the standarddeviations of χ2 (Tables 5, 6, and 7) and item parameter estimates (Tables 6 and 7) wouldbe close to zero with a more rigorous convergence criterion in item parameter estimation.The number of chromosome patterns in Earth Science IA was larger than that for WorldHistory A despite its shorter gene length. This was because the sample size for EarthScience IA was smaller than that for World History A.

From Table 5, 56 out of 100 replications converged to [212222221222]′, and 38 replica-tions converged to [212222221221]′ in World History A. This difference essentially resultsfrom the feature value (allele) of Item 12: 1 (GRM) or 2 (GPCM). It was difficult toestablish whether one was better than the other because there were no significant differ-ences between these two chromosome patterns’ χ2 statistics. Which model is more suitedto Item 12 GRM or GPCM would require more minute qualitative consideration. Thiswould be dangerous to determine with the quantitative information we have here.

It was found that GPCM was more suited than GRM to the cognitive or responseprocess of students doing WH-A-1A. This means it is plausible that the cognitive processthat takes place when the students are doing the testlet is made up of stochastic stepsprogressing from 0 to 1, 1 to 2, and 2 to 3. This is because questions 1 to 3 of WH-A-1A require students to deduce the correct answer by interpreting the lead, examiningvarious clues in the picture, and using already-acquired information. On the other hand,


Table 6: Item Parameter Estimates for World History A.

Mean (SD)Item Model χ2 (df=20) a or a∗ b1 or b∗1 b2 or b∗2 b3 or b∗31 GPCM 12.2797 0.3605 −1.2386 0.5664 2.3072

(0.0571) (0.0003) (0.0042) (0.0052) (0.0067)2 GRM 18.8759 0.8062 −2.1360 −0.4388 1.2782

(0.0304) (0.0005) (0.0035) (0.0046) (0.0055)3 GPCM 16.5956 0.3499 −1.8651 −0.1196 1.9419

(0.0324) (0.0001) (0.0043) (0.0048) (0.0053)4 GPCM 17.9634 0.4621 −1.1547 0.4591 2.0148

(0.0764) (0.0002) (0.0045) (0.0050) (0.0055)5 GPCM 20.0604 0.5677 −1.6922 −0.1106 0.9402

(0.0216) (0.0002) (0.0042) (0.0049) (0.0051)6 GPCM 8.0974 0.3817 −2.5151 0.0242 1.8607

(0.0221) (0.0004) (0.0027) (0.0049) (0.0065)7 GPCM 36.5708 0.6443 −0.4358 0.5935 1.7070

(0.0773) (0.0001) (0.0047) (0.0048) (0.0054)8 GPCM 18.6965 0.6560 −0.1239 0.8331 1.7143

(0.0050) (0.0003) (0.0049) (0.0050) (0.0056)9 GRM 31.1238 0.6830 −1.7715 −0.1009 1.6540

(0.0428) (0.0005) (0.0037) (0.0048) (0.0059)10 GPCM 39.7847 0.6017 −0.4320 0.7399 1.5047

(0.0120) (0.0001) (0.0047) (0.0049) (0.0054)11 GPCM 40.3242 0.4754 −0.5475 0.8598 2.3154

(0.0863) (0.0003) (0.0044) (0.0049) (0.0068)12 GPCM 21.7629 0.4375 −1.5431 0.0892 1.3373

(0.0867) (0.0004) (0.0039) (0.0050) (0.0055)

These results were based on 56 individuals whose chromosomes were[212222221222]′ at final generation.

WH-A-1B measures the students’ fragmental knowledge without information processing,and the number of correct answers in this testlet is determined only by the amount ofalready-acquired knowledge. Therefore, GRM was applied to WH-A-1B because GRM isdesigned to measure the level of latent ability. These facts became clear only after analysisof the GA implemented model selection method, and it was difficult to identify models inadvance.

Items 7, 9, 10, and 11 selected GPCM, GRM, GPCM and GPCM, respectively. How-ever, there was little to choose from for GRM or GPCM for these items, since the χ2

statistics for Items 7, 9, 10, and 11 indicated their goodness–of–fit were not satisfactory.Other models should be applied to these items.

For Earth Science IA, there was a tendency for GPCM to be preferred more thanGRM in model selection as observed from World History A. Three main chromosome pat-terns were shown in from Table 5: [212212122]′ (28 times), [112212122]′ (25 times), and[212222122]′ (15 times). This instability in model determination was caused by the unsta-ble alleles of Items 1 and 5. Both GRM and GPCM were found to be applicable to Item1 (ES-A-1A) because their χ2 statistics were not high (df=20). There is a possibility thatthe cognitive process that takes place when doing the testlet is at an intermediate levelbetween GRM and GPCM. Therefore, either model could be applied to ES-A-1A (andItem 5). Of course, a more qualitative examination should be done for model selection.

22 K. Shojima

Table 7: Item Parameter Estimates for Earth Science IA.

Mean (SD)Item Model χ2 (df=20) a or a∗ b1 or b∗1 b2 or b∗2 b3 or b∗31 GPCM 9.7503 0.6981 −2.0011 −0.6760 0.4415

(0.2837) (0.0048) (0.0163) (0.0092) (0.0050)2 GRM 13.5438 0.6118 −3.1876 −0.8734 1.2814

(0.0800) (0.0030) (0.0206) (0.0093) (0.0053)3 GPCM 5.1949 0.3117 −3.9055 −1.6267 0.3153

(0.0629) (0.0028) (0.0526) (0.0185) (0.0067)4 GPCM 1.8805 0.5063 −1.5933 −0.2111 1.3324

(0.0420) (0.0041) (0.0162) (0.0069) (0.0061)5 GRM 4.1577 0.7081 −2.0503 −0.3191 1.4891

(0.0678) (0.0029) (0.0144) (0.0079) (0.0047)6 GPCM 22.1850 0.5895 −0.7966 0.2716 0.5650

(0.1178) (0.0028) (0.0098) (0.0052) (0.0045)7 GRM 26.5676 0.9397 −1.6688 −0.1795 1.1062

(0.1826) (0.0042) (0.0134) (0.0071) (0.0036)8 GPCM 7.9974 0.1645 −4.7618 −1.1294 2.5908

(2.0815) (0.0096) (0.3578) (0.0858) (0.1578)9 GPCM 11.3018 0.4851 −2.5580 0.0713 2.4546

(0.0713) (0.0029) (0.0205) (0.0060) (0.0104)

These results were based on 28 individuals whose chromosomes were[212212122]′ at final generation.

However, the χ2 statistics of Items 6 and 7 would easily be rejected if the sample size waslarger, although their actual values were not rejected under df=20. Neither model wassuitable, and other models or an unknown model may suit these items.

5. Discussion

A method of selecting item response models with a genetic algorithm was proposed,where indicator variable W = {wjk} was regarded as a chromosome to distinguish otherindividuals. This scheme enabled a model for each item to be selected automatically. TheGA with the set of techniques implemented with our method is called the simple geneticalgorithm (SGA; Vose, 1999), and a GA with a more elaborate setup, for example, theadaptive GA (Davis, 1989; Srinivas & Patnaik, 1994) could also be used. However, theresults of simulations with (S, Rs, N)=(32, 1/4, 2000) were satisfactory even under theSGA, although a larger S would be desirable when the number of items was larger than12 as in the simulation studies and the actual data example (World History A).

An issue with GRM and GPCM was investigated in the simulations and numerical ex-amples. In short, we need to know which is the more useful model of these two prevailingmodels. If anything, the results of simulation proved that GRM was more useful, sincecases where GRM fit the data generated under GPCM could be observed more frequentlythan was the opposite case. This demonstrated the flexibility of the mathematical form ofGRM, and was consistent with the simulations done by De Ayala, Dodd, & Koch (1992).However, the GPCM was more beneficial for items in two real data sets, and this findingwas actually different from the prediction. The data used by Baker, Rounds, & Zevon


(2000) obtained from a psychological questionnaire and the adopted format was a Likertscale. However, two analyzed data sets were from testlet items where the assumption oflocal independence could not hold and the probability for correct answers became largerwhen a previous item was answered correctly. GPCM would have a higher degree ofapplication to such cases.

There must be some compatibility between the item format and model, for instance,GRM is for Likert–type items, and GPCM is for testlet items, since a certain item for-mat leads to an appropriate cognitive process, and such a cognitive process indicates aspecific model. However, an item format does not always specify the same model becauseselected models were not uniform in numerical examples in which every item’s formatwas multiple-choice itemtype. The item content, as well as the item format, was alsoimportant to determine the model. Therefore, the proposed framework GA is useful, andthe results of the analysis must contribute to examine items’ profiles from various angles.This exploratory method of model determination promises to enrich our knowledge aboutitems.

However that may be, the true model for each item is always unknown. We need tocompare the simulation results for (N, n) = (2000, 12) with the analysis example resultsfor World History A whose (N, n) = (2046, 12). From Figures 2 and 3, the greatestdifference between them must be within the χ2 statistics. The value averaged over 100replications in their final generations of World History A was slightly over 280, althoughthose in the simulation were around 95. This was because each response data used insimulation was generated to follow the corresponding model. However, real data are notalways consistent with the model. Model selection with the proposed procedure involvedrelative comparisons of values for fitness evaluation functions. Therefore, several itemswere not absolutely suitable for the selected models.

When the numbers of models and items are two and twelve, the combinations of allchromosome patterns is 212 = 4, 096. If we examine all the combinations, the probabilityof true model recovery is almost 1.0. However, 320 times is required for (G, Rs, G) = (32,1/4, 10). Therefore, our framework can shorten the time of 320/4096 in this case. It isefficient because the reproducibility of the true model was around 0.95 in Table 2.

Two model candidates were used for the simulations and numerical examples in thisstudy due to practical concerns, although the computational burden with this methodwas another reason. With greater computational power, other polytomous models such asthe steps model (Verhelst, Glas & de Vries, 1997) and the sequential model (Tutz, 1990)could be possible candidates. The results for numerical examples would then be changed.However, the specifications of our personal computer used in this study was sufficient forpresent days (June, 2005), i.e., the processor was Intel(R) Pentium(R) 4 CPU 3.60GHz,and the RAM was 2GB. Furthermore, there is the possibility that unknown models will fitwell with the present data. Therefore, the development of new polytomos models shouldalso be continued.

24 K. Shojima

REFERENCES

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on

Automatic Control, AC-19, 716–723.

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43,

561–573.

Baker, F.B. (1992). Item Response Theory: Parameter Estimation Techniques. New York: Marcel

Dekker, Inc.

Baker, J.E. (1985). Adaptive selection methods for genetic algorithms. In Proceedings of the 1st

International Conference on Genetic Algorithms and Their Applications, pp.101–111.

Baker, J.G., Rounds, J.B., & Zevon, M.A. (2000). A comparison of graded response and Rasch

patial credit models with subjective well–being. Journal of Educational and Behavioral

Statistics, 25, 253–270.

Bock, R.D. (1972). Estimating item parameters and latent ability when responses are scored in

two or more ominal categories. Psychometrika, 37, 29–51.

Bock, R.D. (1997). The nominal categories model. In W. J. van der Linden, & R. K. Hambleton

(Eds.), Handbook of Modern Item Response Theory (pp.33–50). New York: Springer-Verlag.

Bock, R.D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters :

An application of an EM algorithm. Psychometrika, 46, 443–459.

Bock, R.D., & Lieberman, M. (1970). Fitting a response model for n dichotomously scored items.

Psychometrika, 35, 179–197.

Bozdogan, H. (1987). Model selection and Akaike’s information criteria (AIC). Psychometrika,

52, 345–370.

Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16,

297–334.

Davis, L. (1989). Adapting operator probabilities in genetic algorithms. Proceedings of the 3rd

International Conference on Genetic Algorithms, 61–69.

De Ayala, R.J., Dodd, B.G., & Koch, W.R. (1992). A comparison of the partial credit model

and graded response models in computerized adaptive testing. Applied Measurement in

Education, 5, 17–34.

Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum likelihood from incomplete data

via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B,

39, 1–38.

Fujimori, S. (1999). Item selection using a genetic algorithm. Bulletin of Human Science (Re-

search Bulletin of Faculty of Human Science, Bunkyo University), 21, 57–66.

Goldberg, D.E. (1989). Genetic algorithms in search, optimization, and machine learning. Boston:

Addison Wesley Longman, Inc.

Hambleton, R.K., & Swaminathan, H. (1985). Item Response Theory. Boston: Kluwer–Nijhoff.

Hemker, B.T., Sijtsma, K., Molenaar, I.W., & Junker, B.W. (1997). Stochastic ordering using

the latent trait and the sum score in polytomous IRT models. Psychometrika, 62, 331–347.

Holland, H. (1975). Adaptation in Natural and Artificial Systems. Ann Arbor: The University of

Michigan Press.

Jiang, H., & Tang, K.L. (1998). New method of calibrating IRT models. Paper presented at the

Annual Meeting of the National Council on Measurement in Education (San Diego, CA,

April 14–16).

Joreskog, K.G., & Sorbom, D. (1993). LISREL 8: Structural Equation Modeling with the SIM-

PLIS Command Language. Hillsdale: Lawrence Erlbaum Associates.

Kullback, S., & Leibler, R.A. (1951). On information and sufficiency. Annals of Mathematical

Statistics, 22, 79–86.

Lord, F.M. (1980). Applications of Item Response Theory to Practical Testing Problems. Hillsdale:


Lawrence Erlbaum Associates.

Marcoulides, G.A., & Drezner, Z. (2001). Specification searches in structural equation modeling

with a genetic algorithm. In G. A. Marcoulides, & R. E. Schumacker (Eds.), New Develop-

ments and Techniques in Structural Equation Modeling (pp.247–268). Mahwah: Lawrence

Erlbaum Associates, Inc.

Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.

McKinley, R.L., & Mills, C.N. (1985). A comparison of several goodness-of-fit statistics. Applied

Psychological Measurement, 9, 49–57.

Mislevy, R.J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177–

195.

Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied

Psychological Measurement, 16, 159–176.

Muraki, E. (1997). A generalized partial credit model. In W.J. van der Linden, & R.K. Hambleton

(Eds.), Handbook of Modern Item Response Theory (pp.153–164). New York: Springer-

Verlag.

Ogasawara, H. (1998). A factor analysis model for a mixture of various types of variables. Be-

haviormetrika, 25, 1–12.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores.

Psychometrika Monograph, No.17.

Samejima, F. (1973). Homogeneous case of the continuous response model. Psychometrika, 38,

203–219.

Samejima, F. (1974). Normal ogive model on the continuous response level in the multidimen-

sional latent space. Psychometrika, 39, 111–121.

Samejima, F. (1996). Evaluation of mathematical models for ordered polychotmous responses.

Behaviourmetrika, 23, 17–35.

Schwartz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.

Schwefel H.P. (1995). Evolution and optimum seeking. New York: Wiley.

Shojima, K., & Toyoda, H. (2004). Item parameter estimation when a test contains different item

response models. The Japanese Journal of Educational Psychology, 52, 61–70.

Srinivas, M., & Patnaik, L.M. (1994). Adaptive probabilities of crossover and mutation in genetic

algorithms, IEEE Transactions on Systems, Man and Cybernetics, 24, 656–667.

Syswerda, G. (1989). Uniform crossover in genetic algorithms. In Proceedings of the 3rd Interna-

tional Conference on Genetic Algorithms, 2–9.

Tanner, M.A. (1993). Tools for statistical inference: Methods for the exploration of posterior

distributions and likelihood functions. New York: Springer-Verlag.

Thissen, D. (1991). MULTILOG User’s Guide–Version 6. Chicago: Scientific Software Interna-

tional Inc.

Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51,

567–577.

Tsutakawa, R.K. (1984). Estimation of two–parameter logistic item response curves. Journal of

Educational Statistics, 9, 263–276.

Tsutakawa, R.K., & Lin, H.Y. (1986). Bayesian estimation of item response curves. Psychome-

trika, 51, 251–267.

Tutz, G. (1990). Sequential item response models with an ordered response. British Journal of

Mathematical and Statistical Psychology, 43, 39–55.

van der Ark, L.A. (2001). Relationships and properties of polytomous item response theory mod-

els. Applied Psychological Measurement, 25, 273–282.

van der Linden, W.J., & Glas, C.A.W. (Eds.), (2000). Computerized Adaptive Testing: Theory

and Practice. Dordrecht: Kluwer Academic Publishers.

van der Linden, W.J., & Hambleton, R.K. (Eds.), (1997). Handbook of Modern Item Response

26 K. Shojima

Theory. New York: Springer-Verlag.

Verhelst, N.D., Glas, C.A.W., & de Vries, H.H. (1997). A steps model to analyze partial credit.

In W.J. van der Linden, & R.K. Hambleton (Eds.), Handbook of Modern Item Response

Theory (pp.123–138). New York: Springer-Verlag.

Vose, M.D. (1999). The Simple Genetic Algorithm: Foundations and Theory. Cambridge: The

MIT Press.

Wang, T., & Zeng, L. (1998). Item parameter estimation for a continuous response model using

an EM algorithm. Applied Psychological Measurement, 22, 333–344.

Zhang, B.-T., & Kim, J.-J. (2000). Comparison of selection methods for evolutionary optimiza-

tion. An International Journal on the Internet, 2, 55–70.

(Received March 15 2006, Revised October 20 2006)

selection of item response model by genetic ... - eduhk moodle

Documents