introduction to item response theory
TRANSCRIPT
![Page 1: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/1.jpg)
Day 1 AM: An Introduction toItem Response TheoryNathan A. ThompsonVice President, Assessment Systems CorporationAdjunct faculty, University of [email protected]
![Page 2: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/2.jpg)
Welcome!Thank you for attending!Introductions and important info nowSoftware… download or USBPlease ask questions
◦Also, slow me down or ask for translation!Goal: provide an intro on IRT/CAT to
those who are new◦For those with some experience, to
provide new viewpoints and more resources/recommendations
![Page 3: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/3.jpg)
Where I’m from, professionallyPhD, University of Minnesota
◦CAT for classificationsTest development manager for
ophthalmology certificationsPsychometrician at Prometric
(many certifications)VP at ASC
![Page 4: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/4.jpg)
Where I’m from, geographically
![Page 5: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/5.jpg)
Except now things look like…
![Page 6: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/6.jpg)
We do odd things in winter
![Page 7: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/7.jpg)
Introduce yourselvesNameEmployer/organizationTypes of tests you do and/or why
you are interested in IRT/CAT
(There might be someone with similar interests here)
![Page 8: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/8.jpg)
Another announcementNewly formed: International
Association for Computerized Adaptive Testing (IACAT)◦www.iacat.org◦Free membership◦Growing resources◦Next conference: August 2012,
Sydney
![Page 9: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/9.jpg)
Welcome!This workshop is on two highly
related topics: IRT and CATIRT is the modern paradigm for
developing, analyzing, scoring, and linking tests
CAT is a next-generation method of delivering tests
CAT is not feasible without IRT, so we do IRT first
![Page 10: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/10.jpg)
IRT – where are we going?IRT, as many of you know, provides a
way of analyzing itemsHowever, it has drawbacks (no
distractor analysis), so the main reasons to use IRT are at the test level
It solves certain issues with classical test theory (CTT)
But the two should always be used together
![Page 11: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/11.jpg)
IRT – where are we going?Advantages
◦Better error characterization◦More precise scores◦Better linking◦Model-based◦Items and people on same scale
(CAT)◦Sample-independence◦Powerful test assembly
![Page 12: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/12.jpg)
IRT – where are we going?Keyword: paradigm or approach
◦Not just another statistical analysis◦It is a different way of thinking about
how tests should work, and how we can approach specific problems (scaling, equating, test assembly) from that viewpoint
![Page 13: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/13.jpg)
Day 1There will be four parts this
morning, covering the theory behind IRT:◦Rationale: A graphical introduction to
IRT◦Models (dichotomous and polytomous)
and their response functions◦IRT scoring (θ estimation)◦Item parameter estimation and model
fit
![Page 14: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/14.jpg)
Part 1A graphical introduction to IRT
![Page 15: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/15.jpg)
What is IRT?Basic Assumptions1. Unidimensionality
A unidimensional latent trait (1 at a time) Item responses are independent of each
other (local independence), except for the trait/ability that they measure
2. A specific form of the relationship between trait level and probability of a response
The response function, or IRT model There are a growing number of models
![Page 16: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/16.jpg)
What is IRT?A theory of mathematical functions
that model the responses of examinees to test items/questions
These functions are item response functions (IRFs)
Historically, it has also been known as latent trait theory and item characteristic curve theory
The IRFs are best described by showing how the concept is derived from classical analysis…
![Page 17: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/17.jpg)
Classical item statisticsCTT statistics are typically
calculated for each optionOption N Prop Rpbis Mean
1 307 0.860 0.221 91.876
2 25 0.070 -0.142 85.600
3 14 0.039 -0.137 83.929
4 11 0.031 -0.081 86.273
![Page 18: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/18.jpg)
Classical item statisticsThe proportions are often
translated to a figure like this, where examinees are split into groups
![Page 19: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/19.jpg)
Classical item statisticsThe general idea of IRT is to split
the previous graph up into more groups, and then find a mathematical model for the blue line
This is what makes the item response function (IRF)
![Page 20: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/20.jpg)
Classical item statisticsExample with 10 groups
![Page 21: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/21.jpg)
The item response functionReflects the probability of a given
response as a function of the latent trait (z-score)
Example:
![Page 22: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/22.jpg)
The IRFFor dichotomously scored items,
it is the probability of a correct or keyed response
Also called Item Characteristic Curve (ICC) or Trace Line
Only one curve (correct response), and all other responses are grouped as (1-IRF)
For polytomous items (partial credit, etc.), it is the probability of each response
![Page 23: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/23.jpg)
The IRFHow do we know exactly what
the IRF for an item is?We estimate parameters for an
equation that draws the curveFor dichotomous IRT, there are
three relevant parameters: a, b, and c
![Page 24: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/24.jpg)
The IRFa: The discrimination parameter;
represents how well the item differentiates examinees; slope of the curve at its center
b: The difficulty parameter; represents how easy or hard the item is with respect to examinees; location of the curve (left to right)
c: The pseudoguessing parameter; represents the ‘base probability’ of answering the question; lower asymptote
![Page 25: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/25.jpg)
The IRFa=1, b=0, c=0.25
![Page 26: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/26.jpg)
The IRF…is the “basic building block” of IRTwill differ from item to itemcan be one of several different
models (now)can be used to evaluate items (now)is used for IRT scoring (next)leads to “information” used for test
design (after that)is the basis of CAT (tomorrow)
![Page 27: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/27.jpg)
Part 2IRT models
![Page 28: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/28.jpg)
IRT modelsSeveral families of models
◦Dichotomous◦Polytomous◦Multidimensional◦Facets (scenarios vs raters)◦Mixed (additional parameters)◦Cognitive diagnostic
◦We will focus on first two
![Page 29: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/29.jpg)
Dichotomous IRT modelsThere are 3 main models in use, as
mentioned earlier: 1PL, 2PL, 3PLThe “L” refers to “logistic”: which is
the type of equationIRT was originally developed
decades ago with a cumulative normal curve
This means that calculus needed to be used
![Page 30: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/30.jpg)
The logistic functionAn approximation was developed:
the logistic curveNo calculus neededThere are two formats based on DIf D = 1.702, then diff < 0.01If D = 1.0, a little more difference;
called the true logistic formDoes not really matter, as long as
you are consistent
![Page 31: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/31.jpg)
The logistic functionThe basic form of the curve
![Page 32: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/32.jpg)
Item parametersWe add parameters to slightly
modify the shape to get it to match our data
For example, a 4-option multiple choice item has a 25% chance of being guessed correctly
So we add a c parameter as a lower asymptote, which means that the curve is “squished” so it never goes below 0.25 (next)
![Page 33: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/33.jpg)
Item parametersSample IRF to show c
![Page 34: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/34.jpg)
Item parametersWe can also add a parameter (a)
that modifies the slopeAnd a b parameter that slides the
entire curve left or right◦Tells us which person z-score for which
the item is appropriateItems can be evaluated based on
these just like with CTT statisticsA little more next…
![Page 35: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/35.jpg)
Item parameters: aThe a parameter ranges from 0.0
to about 2.0 in practice (theoretically to infinity)
Higher means better discriminating
For achievement testing, 0.7 or 0.8 is good, aptitude testing is higher
Helps you: Remove items with a<0.4? Identify a>1.0 as great items?
![Page 36: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/36.jpg)
Item parameters: bFor what person z-score is the
item appropriate? (non-Rasch)Should be between -3 and 3
◦99.9% of students are in that range0.0 is average person1.0 is difficult (85th percentile)-1.0 is easy (15th percentile)2.0 is super difficult (98%)-2.0 is super easy (2%)
![Page 37: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/37.jpg)
Item parameters: bIf item difficulties are normally
distributed, where does this fall? (Rasch)
0.0 is average item (NOT PERSON)
![Page 38: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/38.jpg)
Item parameters: cThe c parameter should be about 1/k,
where k is the number of optionsIf higher, this indicates that options
are not attractiveFor example, suppose c = 0.5This means there is a 50/50 chanceThat implies that even the lowest
students are able to ignore two options and guess between the other two options
![Page 39: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/39.jpg)
Item parametersExtreme example:
◦What is 23+25? A. 48 B. 47 C. 3.141529… D. 1,256,457
![Page 40: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/40.jpg)
The (3PL) logistic functionHere is the equation for the 3PL, so
you can see where the parameters are inserted
Item i, person j
Equivalent formulations can be seen in the literature, like moving the (1-c) above the line
( )
1( 1| ) (1 )1 i j ii i j i i Da bP X c ce
![Page 41: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/41.jpg)
The (3PL) logistic functionai is the item discrimination
parameter for item i,bi is the item difficulty or location
parameter for item i,ci is the lower asymptote, or
pseudoguessing parameter for item i,
D is the scaling constant equal to 1.702 or 1.0.
![Page 42: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/42.jpg)
The (3PL) logistic functionThe P is due primarily to (-b)The effect due to a and c is not
as strongThat is, your probability of
getting the item correct is mostly due to whether it is easy/difficult for you◦This leads to the idea of adaptive
testing
![Page 43: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/43.jpg)
3PLIRT has 3 dichotomous modelsI’ll now go through the models
with more detail, from 3PL down to 1PL
The 3PL is appropriate for knowledge or ability testing, where guessing is relevant
Each item will have an a, b, and c parameter
![Page 44: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/44.jpg)
IRT modelsThree 3PL IRFs, c = 0, 0.1, 0.2,
(b = -1, 0, 1; a = 1, 1, 1)
-3 -2 -1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
theta
probability
![Page 45: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/45.jpg)
2PLThe 2PL assumes that there is no
guessing (c = 0.0)Items can still differ in
discriminationThis is appropriate for attitude or
psychological type data with dichotomous responses◦I like recess time at school (T/F)◦My favorite subject is math (T/F)
![Page 46: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/46.jpg)
IRT modelsThree 2PL IRFs, a = 0.75, 1.5,
0.3, b = -1.0, 0.0, 1.0
-3 -2 -1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
theta
probability
![Page 47: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/47.jpg)
1PLThe 1PL assumes that all items
are of equal discriminationItems only differ in terms of
difficultyThe raw score is now a sufficient
statistic for the IRT score Not the case with 2PL or 3PL; it’s
not just how many items you get right, but which ones
10 hard items vs. 10 easy items
![Page 48: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/48.jpg)
1PLThe 1PL is also appropriate for
attitude or psychological type data, but where there is no reason to believe items differ substantially in terms of discrimination
This is rarely the caseStill used: see Rasch discussion
later
![Page 49: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/49.jpg)
1PLThree 1PL IRFs: b = -1, 0, 1
-3 -2 -1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
theta
probability
![Page 50: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/50.jpg)
How to choose?Characteristics of the itemsCheck with the data! (fit)Sample size:
◦1PL = 100 minimum◦2PL = 300 minimum◦3PL = 500 minimum
Score report considerations (sufficient statistics)
![Page 51: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/51.jpg)
The Rasch PerspectiveAnother argument in choiceThere is a group of
psychometricians (mostly from Australia and Chicago) who believe that the 1PL is THE model
Everything else is just noiseData should be “cleaned” to
reflect this
![Page 52: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/52.jpg)
The Rasch PerspectiveHow to clean? A big target is to
eliminate guessingBut how do you know?Slumdog Millionaire Effect
![Page 53: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/53.jpg)
The Rasch PerspectiveThis group is very strong in their
beliefWhy? They believe it is
“objective” measurementScore scale centered on items,
not people, so “person-free”Software and journals devoted
just to the Rasch idea
![Page 54: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/54.jpg)
The Rasch PerspectiveShould you use it?I was trained to never use Rasch
◦Equal discrimination assumption is completely unrealistic… we all know some items are better than others
◦We all know guessing should not be ignored
◦Data should probably not be doctored
◦Instead, data should drive the model
![Page 55: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/55.jpg)
The Rasch PerspectiveHowever, while some researchers
hate the Rasch model, I don’t◦It is very simple◦It works better with tiny samples◦It is easier to describe◦Score reports and sufficient statistics◦Discussion points from you?
◦Nevertheless, I recommend IRT
![Page 56: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/56.jpg)
Polytomous modelsPolytomous models are for items
that are not scored correct/incorrect, yes/no, etc.
Two types:◦Rating scale or Likert: “Rate on a
scale of 1 to 5”◦Partial credit – very useful in
constructed-response educational items My experience as a scorer
![Page 57: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/57.jpg)
Polytomous modelsPartial credit example with rubric:
◦Open response question to “2+3(4+5)=“ 0: no answer 1: 2, 3, 4, or 5 (picks one) 2: 14 (adds all) 3: 45 (does (2+3) x (4+5) ) 4: 27 (everything but add 2) 5: 29 (correct)
![Page 58: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/58.jpg)
The IRFPolytomous example (CRFs):
![Page 59: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/59.jpg)
Comparison tableModel Item Disc. Step
SpacingStep Ordering
Option Disc.
RSM Fixed Fixed Fixed Fixed
PCM Fixed Variable Variable Fixed
GRSM Variable Fixed Fixed Fixed
GRM Variable Variable Fixed Fixed
GPCM Variable Variable Variable Fixed
NRM Variable (each option)
Variable Variable Variable
Fixed/Variable between items… more later, if time
![Page 60: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/60.jpg)
Part 3Ability () estimation
(IRT Scoring)
![Page 61: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/61.jpg)
ScoringFirst: throw out your idea of a
“score” as the number of items correct
We actually want something more accurate: the precise z-score
Because the z-scores axis is called θ in IRT, the scoring is called θ estimation
![Page 62: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/62.jpg)
ScoringIRT utilizes the IRFs in scoring
examineesIf an examinee gets a question
right, they “get” the item’s IRFIf they get the question wrong,
they “get” the (1-IRF)These curves are multiplied for
all items to get a final curve called the likelihood function
![Page 63: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/63.jpg)
ScoringHere’s an example IRF; a =1, b=0,
c = 0
![Page 64: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/64.jpg)
ScoringA “1-IRF”
![Page 65: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/65.jpg)
ScoringWe multiply those to get a curve
like this…
![Page 66: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/66.jpg)
Scoring - MLEThe score is the point on the x-
axis where the highest likelihood is
This is the maximum likelihood estimate
In the example, 0.0 (average ability)
This obtains precise estimates on the scale
![Page 67: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/67.jpg)
Maximum likelihoodThe LF is technically defined as:
Where u is a response vector of 1s and 0s
Note what this does to the exponents
ij i jn
u 1 uj ij ij
i 1
L P Qu
![Page 68: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/68.jpg)
Scoring - SEMA quantification of just how
precise can also be calculated, called the standard error of measurement
This is assumed to be the same for everyone in classical test theory, but in IRT depends on the items and the responses, and the level of
![Page 69: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/69.jpg)
Scoring - SEMHere’s a new LF – blue has the
same MLE but is less spread outBoth are two items, blue with a =
2
![Page 70: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/70.jpg)
Scoring - SEMThe first LF had an SEM ~ 1.0The second LF had an SEM ~ 0.5We have more certainty about
the second person’s scoreThis shows how much high-
quality items aid in measurement◦Same items and responses, except a
higher a
![Page 71: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/71.jpg)
Scoring - SEMSEM is usually used to stop CATsGeneral interpretation:
confidence intervalPlus or minus 1.96 (about 2) is
95%So if the SEM in the example is
0.5, we are 95% sure that the student’s true ability is somewhere between -1.0 and +1.0
![Page 72: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/72.jpg)
Scoring - SEMIf a student gives aberrant
responses (cheating, not paying attention, etc.) they will have a larger SEM
This is not enough to accuse of cheating (they could have just dozed off), but it can provide useful information for research
![Page 73: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/73.jpg)
Scoring - SEMSEM CI is also used to make
decisions◦Pass if 2 SEMs above a cutoff
![Page 74: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/74.jpg)
Details on IRT ScoresStudent scores are on the scale,
which is analogous to the standard normal z scale – same interpretations!
There are four methods of scoring◦Maximum Likelihood (MLE)◦Bayesian Modal (or MAP, for
maximum a posteriori)◦Bayesian EAP (expectation a
posteriori)◦Weighted MLE (less common)
![Page 75: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/75.jpg)
Maximum likelihoodTake the likelihood function “as
is” and find the highest point
![Page 76: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/76.jpg)
Maximum likelihoodProblem: all incorrect or all
correct answers
![Page 77: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/77.jpg)
Bayesian modalAddresses that problem by
always multiplying the LF by a bell-shaped curve, which forces it to have a maximum somewhere
Still find the highest point
![Page 78: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/78.jpg)
Bayesian EAPArgues that the curve is not
symmetrical, and we should not ignore everything except the maximum
So it takes the “average” of the curve by splitting it into many slices and finding the weighted average
The slices are called quadrature points or nodes
![Page 79: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/79.jpg)
Bayesian EAPExample: see 3PL tail
![Page 80: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/80.jpg)
Bayesian EAPSimple EAP overlay: ~ -0.50
![Page 81: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/81.jpg)
BayesianWhy Bayesian?
◦Nonmixed response vectors◦Asymmetric LF
Why not Bayesian?◦Biased inward – if you find the
estimates of 1000 students, the SD would be smaller with the Bayesian estimates, maybe 0.95
![Page 82: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/82.jpg)
Newton-RaphsonMost IRT software actually uses a
somewhat different approach to MLE and Bayesian Modal
The straightforward way is to calculate the value of the LF at each point in , within reason
For example, -4 to 4 at 0.001That’s 8,000 calculations! Too
much for 1970s computers…
![Page 83: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/83.jpg)
Newton-RaphsonNewton-Raphson is a shortcut
method that searches the curve iteratively for its maximum
Why? Same 0.001 level of accuracy in only 5 to 20 iterations
Across thousands of students, that is a huge amount of calculations saved
But certain issues (local maxima or minima)… maybe time to abandon?
![Page 84: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/84.jpg)
ExamplesSee IRT Scoring and Graphing
Tool
![Page 85: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/85.jpg)
Part 4Item parameter estimation
How do we get a, b, and c?
![Page 86: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/86.jpg)
The estimation problemEstimating student given a set
of known item parameters is easy because we have something established
But what about the first time a test is given?
All items are new, and there are no established student scores
![Page 87: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/87.jpg)
The estimation problemWhich came first, the chicken or the
egg?Since we don’t know, we go back
and forth, trying one and then the other◦Fix “temporary” z-scores◦Estimate item parameters◦Fix the new item parameters◦Estimate scores◦Do it again until we’re satisfied
![Page 88: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/88.jpg)
Calibration algorithmsThere are two calibration
algorithms◦Joint maximum likelihood (JML) –
older◦Marginal maximum likelihood (MML)
– newer, and works better with smaller samples… the standard
◦Also conditional maximum likelihood, but it only works with 1PL, so rarer
◦New in research, but not in standard software: Markov chain monte carlo
![Page 89: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/89.jpg)
Calibration algorithmsThe term maximum likelihood is used
here because we are maximizing the likelihood of the entire data set, for all items i and persons j
X is the data set of responses xijb is the set of item parameters bi is the set of examinee js
![Page 90: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/90.jpg)
Calibration algorithmsThis means we want to find the b
and that make that number the largest
So we set , find a good b, use it to score students and find a new , find a better b, etc…◦Marginal ML uses marginal
distributions not exact points, hence it being faster and working better with smaller samples of people/items
![Page 91: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/91.jpg)
Calibration algorithmsNote: rather than examine the LF
(which gets incredibly small), software examines -2*ln(LF)
IRT software tracks these iterations because they provide information on model fit
See output
![Page 92: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/92.jpg)
Part 4 (cont.)Assumptions of IRT: Model-data fit
![Page 93: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/93.jpg)
Checking fitOne assumption of IRT (#2) is that
our data even follows the idea of IRT!
This is true at both the item and the test level
Also true about examinees: they should be getting items wrong that are above their θ and getting items correct that are below their θ
![Page 94: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/94.jpg)
Model-data fitWhenever fitting any
mathematical model to empirical data (not just IRT), it is important to assess fit
Fit refers to whether the model adequately represents the data
Alternatively, if the data is far away from the model
![Page 95: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/95.jpg)
Model-data fitThere are two types of fit
important in IRT◦Item (and test) - compares observed
data to the IRF◦Person – evaluates whether
individual students are responding according to the model Easy items correct, hard items incorrect
![Page 96: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/96.jpg)
Model-data fitRemember the 10-group
empirical IRF that I drew? This is great!
![Page 97: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/97.jpg)
Model-data fitYou’re more likely to see
something like this:
![Page 98: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/98.jpg)
Model-data fitOr even worse…
![Page 99: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/99.jpg)
Model-data fitNote that if we drew an IRF in
each of those graphs, it would be about the same
But it is obviously less appropriate in Graph #3 (“even worse”)
Fit analyses provide a way of quantifying this
![Page 100: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/100.jpg)
Item fitMost basic approach is to
subtract observed frequency correct from the expected value for each slice (g) of
This is then summarized in a chi-square statistic
Bigger = bad fit
![Page 101: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/101.jpg)
Item fitGraphical depiction:
![Page 102: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/102.jpg)
Item fitBetter fit
![Page 103: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/103.jpg)
Item fitThe slices are called quadrature
pointsAlso used for item parameter
estimationThe number of slices for chi-square
need not be the same as for estimation, but it helps interpretation
![Page 104: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/104.jpg)
Item fitChi-square is oversensitive to sample
sizeA better way is to compute
standardized residualsDivide a chi-square by its df = G-m
where m is the number of item parameters
This is more interpretable because of the well-known scale
0 is OK, examine items > 2
![Page 105: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/105.jpg)
Item fitFor broad analysis of fit, use
quantile plots (Xcalibre, Iteman, or Lertap)◦3 to 7 groups◦Can find hidden issues (My example:
social desirability in Likert #2)See Xcalibre output
◦Fit statistics◦Fit graphs (many more groups, and
IRF)
![Page 106: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/106.jpg)
Person fitIs an examinee responding oddly?Most basic measure: take the log
of the LF at the max ( estimate)
A higher number means we are more sure of the estimate
But this is dependent on the level of , so we need it standardized: lz
n
1i
u1
i
u
ioii ˆQˆPln l
![Page 107: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/107.jpg)
Person fitlz is like a z-score for fit: z = (x-
μ)/sLess than -2 means bad fit
n
1i
2
i
iiio
n
1iiiiio
o
ooz
ˆP1
ˆPlnˆP1ˆPlVar
ˆP1ln ˆP1ˆPln ˆPlE
lVarlEl
l
![Page 108: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/108.jpg)
Person fitlz is sensitive to the distribution
of item difficultiesWorks best when there is a range
of difficultyThat is, if there are no items for
high-ability examinees, none of them will have a good estimate!
Best to evaluate groups, not individuals
![Page 109: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/109.jpg)
How is fit useful?Throw out items?Throw out people?Change model used?Bad fit can flag other possible
issues◦Speededness: fit (and N) gets worse
at end of test◦Multidimensionality: certain areas
![Page 110: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/110.jpg)
How is fit useful?Note that this fits in with the
estimation processIRT calibration is not “one-click”Review results, then make
adjustments◦Remove items/people◦Modify par distributions◦Modify quadrature points◦Etc.
![Page 111: Introduction to Item Response Theory](https://reader035.vdocuments.net/reader035/viewer/2022081416/58eed19a1a28ab565e8b45c3/html5/thumbnails/111.jpg)
SummaryThat was a basic intro to the
rationale of IRTNow start talking about some
applications and usesAlso examine IRT software and
output