applying ideal point irt models to score single stimulus and pairwise preference personality items...

Applying Ideal Point IRT Models to Score Single Stimulus and

Pairwise Preference Personality Items

Stephen Stark (USF)

Oleksandr S. Chernyshenko (UC, NZ)

Fritz Drasgow (UIUC)

2

Overview

“Problems” with current personality

assessment procedures

The case for ideal point response

process assumptions in personality

Ideal point IRT models for single

statement and pairwse preference items

Score comparability study

3

Personality Scale Construction Today

Rooted in Classical Test Theory (CTT) and

Common Factor Theory (CFT)

Uses single stimulus format, fixed length

scales and total scores in all analyses and

interpretations

Existing inventories Are static

Contain a large number of relatively short scales

4

Problem # 1

Current scales worked well for research purposes,

where the interest is to “understand the

relationship” between constructs

But, these measures are not well-suited for

adaptive formats or feedback purposes Item parameters are scale dependent

Item difficulties do not directly correspond to item content,

because of reverse scoring

Scales are too short to have good precision

More flexible test construction technology is needed

5

Problem # 2

CTT and CFT make dominance response process

assumption This has been “adopted” from cognitive ability testing

To satisfy constraints of the dominance assumption Reverse scoring of negative items is introduced

Neutral or extreme items are deleted from items pools

because they have low item-total correlations (loadings)

This results in depleted item pools and scales with

properties more suitable for scholarship exams

6

Person endorses item if her standing on the latent trait, theta, is more extreme than that of the item.

Only appropriate for moderately positive/negative items (e.g., “I like/dislike parties”)

0.00.10.20.30.40.50.60.70.80.91.0

-3 -2 -1 0 1 2 3

Theta

Pro

b o

f P

osi

tive

Re

spo

nse

Item Person

Dominance Response Process and Personality Items (MBR, 2001; JAP, 2006)

Person endorses item if her standing on the latent trait, theta, is near

that of the item. “My social skills are about average.” Disagree either because:

Too introverted (uncomfortable talking to people)

Too extraverted (great skills)

Ideal Point Process: A More Flexible Alternative?

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0Theta

Item

TooIntroverted

TooExtraverted

8

Ideal Point Process and Personality (JAP, 2006; Psych Assessment, in

press)

Ideal point IRT models provided better fit to a

wider variety of personality items than

dominance IRT models

Many nonmonotonic, but highly discriminating

items have been found

30% more items were retained in item pools More items are available for scale construction

9

Conclusions and Further Basic Research

Ideal point process offers numerous advantages for

improving current measures

More research is needed Only few ideal point models are available; more flexibility is

needed

Item and person parameter estimation must be improved

(APM, 2005)

Responses to adaptive scales may be more complicated

than we think

Note that this research carries limited applied value,

because traditional items are easily FAKED

10

Single Stimulus Response Format

Items consist of individual statements I get along well with others. (A+) I try to be the best at everything I do. (C+) I insult people. (A-) My peers call me “absent minded.” (C-)

Agree/Disagree or Likert type (SD,D,N,A,SA) response options are used

In each case, socially desirable response is obvious.

11

How to Deal With Faking?

Social Desirability (SD) scales often used to

“detect” and “correct” for faking Adjustments made to content scale scores

Little effect on validity

Correcting for faking using SD scores is

problematic, because… SD scales may function differently across testing situations

(JAP, 2001)

Need to develop fake-resistant items

12

Search for Fake-Resistant Formats

Empirically keyed, nontransparent items But problems with construct and face validity data

Biodata or situational judgments Do not measure personality directly Can be easily faked as soon as respondents told

personality is being assessed

Forced-choice (FC) items Halo and other biases are reduced (Borman et al.,

2001) Intuitively, should reduce faking (Jackson et al., 2000)

13

Unidimensional Pairwise Preference Format

Create items by pairing stimuli that are on the same dimension, but representing different locations on the trait continuum

Sociability item: I talk a lot. (+3) My social skills are about average . (0)

Respondent chooses statement that is “More Like Me”

Navy Computer Adaptive Personality Scales (NCAPS) uses this format

14

Multidimensional Pairwise Preference Format

Create items by pairing stimuli that are similar in desirability, but representing different dimensions

Positive item: I get along well with others. (A+) I set very high standards for myself. (C+)

Negative item: I insult people. (A-) I work just enough to pass my classes. (C-)

Variation of this approach is the tetrad format (Army AIM or SHL’s OPQ-32-i)

15

Scoring Forced Choice Measures

Traditional scoring of FC items is problematic Unidimensional FC scale scores have bi-modal

distributions Multidimensional FC scores are ipsative

Inter-individual comparisons not possible Scale scores correlate negatively (even facets of Big 5)

Scoring lacks a formal psychometric model Difficult to evaluate scoring accuracy Does not provide insight about item construction Not usable for adaptive testing

16

Are Forced Choice Scores Equivalent to Traditional

Scores? FC measures are gaining popularity

But, direct comparisons of traditional FC and

SS scores not possible “Score inflations” can only be evaluated within measures

Correlations between measures are low

Before evaluating FC measures in operational

settings: Scores must be normative

Under honest conditions, FC and SS scores should be the

same

17

Response Format Study(in review)

Used advances in IRT to obtain normative scores for

Order, Self Control and Sociability 36-item Single Stimulus measure

36-pair Unidimensional Pairwise Preference measure

36-pair Multidimensional Pairwise Preference measure

All scores were estimated using IRT

All items administered under honest conditions

(N=602 for self reports and N=110 for observers)

18

IRT Model for Single Stimulus Items

Generalized Graded Unfolding Model

(GGUM; Roberts et al., 1998) GGUM fit personality items well

(Chernyshenko, 2002)

No reverse scoring needed

C

w

w

kikiji

w

kikiji

z

kikiji

z

kikiji

ji

wMw

zMz

zZP

0 00

00

expexp

expexp

|

19

Example: “Ideal Point IRT” Order Scale

GGUM Fit Plot for ORD23:My room neatness is about average.

0.0

0.2

0.4

0.6

0.8

1.0

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

Theta

Pro

bab

ilit

y o

f P

osi

tive

R

esp

on

se

ORF

EMP

GGUM Fit Plot for ORD24:Half of the time I do not put things in their

proper place.

0.0

0.2

0.4

0.6

0.8

1.0

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

ThetaP

rob

abil

ity

of

Po

siti

ve R

esp

on

se ORF

EMP

20

IRT Model for Scoring Unidimensional Pairwise Preferences (Stark &

Drasgow,2002)

Zinnes and Griggs (1974) Probabilistic

Unfolding Model (ZG model)

Idea: Respondent has ideal point

representing his/her perception of typical

behavior (trait level)

Task: On each trial, respondent chooses the

statement that better describes him/her

04/19/23

Equation for ZG Item Response Functions

P a b a b

a

b

jk jk jk jk jk

jk j k

jk j k

i i i i i

i i i

i i i

( ) ( ) ( ) ( ) ( )

( ) /

0

0

1 2

2 3

is the cumulative standard normal

IRF for Stimulus-Pair j = 17, k = 18(

0.00

0.20

0.40

0.60

0.80

1.00

2.00 4.00 6.00 8.00

Pj,k(

0)

23

Respondent evaluates each stimulus (personality statement)

separately and makes independent decisions about endorsement.

Stimuli may be on different dimensions.

Single stimulus response probabilities P{0} and P{1} computed

using a unidimensional ideal point model for “traditional” items

(GGUM)

IRT Model for Scoring Multidimensional Pairwise Preferences (Stark, 2002; Stark, Chernyshenko,

& Drasgow, 2005)

}1{}0{}0{}1{

}0{}1{

}1,0{}0,1{

}0,1{),()(

tsts

ts

stst

stddts PPPP

PP

PP

PP

tsi

1 = Agree0 = Disagree

Refer to new pairwise preference model as MDPP

Model Notation (a)

(b)

25

Normative Score Recovery

Roberts et al. (2000) and Stark (1998, 2002)

showed in simulations studies: Accurate normative scores could be recovered for

GGUM, ZG and MDPP models

10 items or pairs per dimension are sufficient to

obtain reasonable estimates

But, no empirical study has compared scores

from these 3 formats, even under “honest”

conditions

26

Results for Conscientiousness Facets

GGUM MDPP ZG GGUM MDPP ZGGGUM .83MDPP .76 .76

ZG .75 .74 .75GGUM .34 .42 .39 .67MDPP .18 .34 .25 .64 .69

ZG .23 .31 .28 .66 .62 .70

Order

Self Control

Dimension FormatOrder Self Control

Correlations = reliability

Positive correlation for MDPP facet scores.

27

Results for Order and Sociability

Correlations = reliability

GGUM MDPP ZG GGUM MDPP ZGGGUM .83MDPP .76 .76

ZG .75 .74 .75GGUM -.08 -.20 -.13 .77MDPP -.10 -.14 -.13 .79 .76

ZG -.05 -.18 -.12 .73 .73 .73

Order

Sociability

Dimension FormatOrder Sociability

28

Criterion Validities

GGUM MDPP ZG GGUM MDPP ZGPreventative Health Behaviors

.15 .16 .21 .09 .01 .04

Traffic Risk Behaviors -.17 -.22 -.23 .08 .06 .18

Substance Avoidance .09 .17 .14 -.20 -.18 -.18

Study Behaviors .38 .38 .38 .02 .01 .00

SociabilityCriterion

Order

Criterion validities are comparable

29

Conclusions

Under honest conditions, MDPP, ZG, and SS

versions of the questionnaire provided equivalent

measurement and can be viewed as alternate

forms

Moving toward FC formats did not affect the validity

of personality scores.

Observing a positive correlation between Order

and Self Control MDPP scales provided empirical

evidence for normative scoring

30

Current Research

Results of this study speak in favor of using ZG and MDPP IRT models for scoring FC scales

Having IRT models makes transition to adaptive testing easy

Adaptive format may offer additional benefit of fake resistance (see NCAPS presentations for recent IMTA talks)

Current studies: How to best pair stimuli? How many unidimensional parings needed? Will increasing # of dimensions lead to more fake resistant

scores? Can we better detect faking using forced choice than

traditional format?

applying ideal point irt models to score single stimulus and pairwise preference personality items...

Documents

items pools

extreme items

ideal point models

personality items mbr

traditional items

discriminating items

item content

extraverted slide