applying ideal point irt models to score single stimulus and pairwise preference personality items...
TRANSCRIPT
Applying Ideal Point IRT Models to Score Single Stimulus and
Pairwise Preference Personality Items
Stephen Stark (USF)
Oleksandr S. Chernyshenko (UC, NZ)
Fritz Drasgow (UIUC)
2
Overview
“Problems” with current personality
assessment procedures
The case for ideal point response
process assumptions in personality
Ideal point IRT models for single
statement and pairwse preference items
Score comparability study
3
Personality Scale Construction Today
Rooted in Classical Test Theory (CTT) and
Common Factor Theory (CFT)
Uses single stimulus format, fixed length
scales and total scores in all analyses and
interpretations
Existing inventories Are static
Contain a large number of relatively short scales
4
Problem # 1
Current scales worked well for research purposes,
where the interest is to “understand the
relationship” between constructs
But, these measures are not well-suited for
adaptive formats or feedback purposes Item parameters are scale dependent
Item difficulties do not directly correspond to item content,
because of reverse scoring
Scales are too short to have good precision
More flexible test construction technology is needed
5
Problem # 2
CTT and CFT make dominance response process
assumption This has been “adopted” from cognitive ability testing
To satisfy constraints of the dominance assumption Reverse scoring of negative items is introduced
Neutral or extreme items are deleted from items pools
because they have low item-total correlations (loadings)
This results in depleted item pools and scales with
properties more suitable for scholarship exams
6
Person endorses item if her standing on the latent trait, theta, is more extreme than that of the item.
Only appropriate for moderately positive/negative items (e.g., “I like/dislike parties”)
0.00.10.20.30.40.50.60.70.80.91.0
-3 -2 -1 0 1 2 3
Theta
Pro
b o
f P
osi
tive
Re
spo
nse
Item Person
Dominance Response Process and Personality Items (MBR, 2001; JAP, 2006)
Person endorses item if her standing on the latent trait, theta, is near
that of the item. “My social skills are about average.” Disagree either because:
Too introverted (uncomfortable talking to people)
Too extraverted (great skills)
Ideal Point Process: A More Flexible Alternative?
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0Theta
Item
TooIntroverted
TooExtraverted
8
Ideal Point Process and Personality (JAP, 2006; Psych Assessment, in
press)
Ideal point IRT models provided better fit to a
wider variety of personality items than
dominance IRT models
Many nonmonotonic, but highly discriminating
items have been found
30% more items were retained in item pools More items are available for scale construction
9
Conclusions and Further Basic Research
Ideal point process offers numerous advantages for
improving current measures
More research is needed Only few ideal point models are available; more flexibility is
needed
Item and person parameter estimation must be improved
(APM, 2005)
Responses to adaptive scales may be more complicated
than we think
Note that this research carries limited applied value,
because traditional items are easily FAKED
10
Single Stimulus Response Format
Items consist of individual statements I get along well with others. (A+) I try to be the best at everything I do. (C+) I insult people. (A-) My peers call me “absent minded.” (C-)
Agree/Disagree or Likert type (SD,D,N,A,SA) response options are used
In each case, socially desirable response is obvious.
11
How to Deal With Faking?
Social Desirability (SD) scales often used to
“detect” and “correct” for faking Adjustments made to content scale scores
Little effect on validity
Correcting for faking using SD scores is
problematic, because… SD scales may function differently across testing situations
(JAP, 2001)
Need to develop fake-resistant items
12
Search for Fake-Resistant Formats
Empirically keyed, nontransparent items But problems with construct and face validity data
Biodata or situational judgments Do not measure personality directly Can be easily faked as soon as respondents told
personality is being assessed
Forced-choice (FC) items Halo and other biases are reduced (Borman et al.,
2001) Intuitively, should reduce faking (Jackson et al., 2000)
13
Unidimensional Pairwise Preference Format
Create items by pairing stimuli that are on the same dimension, but representing different locations on the trait continuum
Sociability item: I talk a lot. (+3) My social skills are about average . (0)
Respondent chooses statement that is “More Like Me”
Navy Computer Adaptive Personality Scales (NCAPS) uses this format
14
Multidimensional Pairwise Preference Format
Create items by pairing stimuli that are similar in desirability, but representing different dimensions
Positive item: I get along well with others. (A+) I set very high standards for myself. (C+)
Negative item: I insult people. (A-) I work just enough to pass my classes. (C-)
Variation of this approach is the tetrad format (Army AIM or SHL’s OPQ-32-i)
15
Scoring Forced Choice Measures
Traditional scoring of FC items is problematic Unidimensional FC scale scores have bi-modal
distributions Multidimensional FC scores are ipsative
Inter-individual comparisons not possible Scale scores correlate negatively (even facets of Big 5)
Scoring lacks a formal psychometric model Difficult to evaluate scoring accuracy Does not provide insight about item construction Not usable for adaptive testing
16
Are Forced Choice Scores Equivalent to Traditional
Scores? FC measures are gaining popularity
But, direct comparisons of traditional FC and
SS scores not possible “Score inflations” can only be evaluated within measures
Correlations between measures are low
Before evaluating FC measures in operational
settings: Scores must be normative
Under honest conditions, FC and SS scores should be the
same
17
Response Format Study(in review)
Used advances in IRT to obtain normative scores for
Order, Self Control and Sociability 36-item Single Stimulus measure
36-pair Unidimensional Pairwise Preference measure
36-pair Multidimensional Pairwise Preference measure
All scores were estimated using IRT
All items administered under honest conditions
(N=602 for self reports and N=110 for observers)
18
IRT Model for Single Stimulus Items
Generalized Graded Unfolding Model
(GGUM; Roberts et al., 1998) GGUM fit personality items well
(Chernyshenko, 2002)
No reverse scoring needed
C
w
w
kikiji
w
kikiji
z
kikiji
z
kikiji
ji
wMw
zMz
zZP
0 00
00
expexp
expexp
|
19
Example: “Ideal Point IRT” Order Scale
GGUM Fit Plot for ORD23:My room neatness is about average.
0.0
0.2
0.4
0.6
0.8
1.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
Theta
Pro
bab
ilit
y o
f P
osi
tive
R
esp
on
se
ORF
EMP
GGUM Fit Plot for ORD24:Half of the time I do not put things in their
proper place.
0.0
0.2
0.4
0.6
0.8
1.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
ThetaP
rob
abil
ity
of
Po
siti
ve R
esp
on
se ORF
EMP
20
IRT Model for Scoring Unidimensional Pairwise Preferences (Stark &
Drasgow,2002)
Zinnes and Griggs (1974) Probabilistic
Unfolding Model (ZG model)
Idea: Respondent has ideal point
representing his/her perception of typical
behavior (trait level)
Task: On each trial, respondent chooses the
statement that better describes him/her
04/19/23
Equation for ZG Item Response Functions
P a b a b
a
b
jk jk jk jk jk
jk j k
jk j k
i i i i i
i i i
i i i
( ) ( ) ( ) ( ) ( )
( ) /
0
0
1 2
2 3
is the cumulative standard normal
IRF for Stimulus-Pair j = 17, k = 18(
0.00
0.20
0.40
0.60
0.80
1.00
2.00 4.00 6.00 8.00
Pj,k(
0)
23
Respondent evaluates each stimulus (personality statement)
separately and makes independent decisions about endorsement.
Stimuli may be on different dimensions.
Single stimulus response probabilities P{0} and P{1} computed
using a unidimensional ideal point model for “traditional” items
(GGUM)
IRT Model for Scoring Multidimensional Pairwise Preferences (Stark, 2002; Stark, Chernyshenko,
& Drasgow, 2005)
}1{}0{}0{}1{
}0{}1{
}1,0{}0,1{
}0,1{),()(
tsts
ts
stst
stddts PPPP
PP
PP
PP
tsi
1 = Agree0 = Disagree
Refer to new pairwise preference model as MDPP
Model Notation (a)
(b)
25
Normative Score Recovery
Roberts et al. (2000) and Stark (1998, 2002)
showed in simulations studies: Accurate normative scores could be recovered for
GGUM, ZG and MDPP models
10 items or pairs per dimension are sufficient to
obtain reasonable estimates
But, no empirical study has compared scores
from these 3 formats, even under “honest”
conditions
26
Results for Conscientiousness Facets
GGUM MDPP ZG GGUM MDPP ZGGGUM .83MDPP .76 .76
ZG .75 .74 .75GGUM .34 .42 .39 .67MDPP .18 .34 .25 .64 .69
ZG .23 .31 .28 .66 .62 .70
Order
Self Control
Dimension FormatOrder Self Control
Correlations = reliability
Positive correlation for MDPP facet scores.
27
Results for Order and Sociability
Correlations = reliability
GGUM MDPP ZG GGUM MDPP ZGGGUM .83MDPP .76 .76
ZG .75 .74 .75GGUM -.08 -.20 -.13 .77MDPP -.10 -.14 -.13 .79 .76
ZG -.05 -.18 -.12 .73 .73 .73
Order
Sociability
Dimension FormatOrder Sociability
28
Criterion Validities
GGUM MDPP ZG GGUM MDPP ZGPreventative Health Behaviors
.15 .16 .21 .09 .01 .04
Traffic Risk Behaviors -.17 -.22 -.23 .08 .06 .18
Substance Avoidance .09 .17 .14 -.20 -.18 -.18
Study Behaviors .38 .38 .38 .02 .01 .00
SociabilityCriterion
Order
Criterion validities are comparable
29
Conclusions
Under honest conditions, MDPP, ZG, and SS
versions of the questionnaire provided equivalent
measurement and can be viewed as alternate
forms
Moving toward FC formats did not affect the validity
of personality scores.
Observing a positive correlation between Order
and Self Control MDPP scales provided empirical
evidence for normative scoring
30
Current Research
Results of this study speak in favor of using ZG and MDPP IRT models for scoring FC scales
Having IRT models makes transition to adaptive testing easy
Adaptive format may offer additional benefit of fake resistance (see NCAPS presentations for recent IMTA talks)
Current studies: How to best pair stimuli? How many unidimensional parings needed? Will increasing # of dimensions lead to more fake resistant
scores? Can we better detect faking using forced choice than
traditional format?