patient reported outcome measures and the rasch model helen parsons
TRANSCRIPT
Patient reported outcome measures and the Rasch model
Helen Parsons
Patient reported outcome measures◦ Quick overview◦ Analysis problems
Rasch models◦ Simple Rasch formulation◦ Rasch extensions: polytomous data
Application of the Rasch model◦ Using the Oxford Knee Score◦ Model fit criteria◦ DIF checking
Summary
Outline
Outcome measures are widespread, with patient reported outcome measurements (PROMS) increasingly used
Try to capture some latent trait of the respondent◦ ie. Some trait that is difficult to directly measure like patient
like “quality of life” or “anxiety” Often in a self-report questionnaire format
◦ EQ5D Some outcome measures reported by clinicians
◦ HoNoS Sometimes incorporates clinical findings as well as
questionnaire data◦ DAS 28
Outcome measures
Outcome measures have a variety of usages◦ One off assessment as a diagnosis tools◦ Comparative assessment
Such as measuring the outcome before and after an intervention
◦ Longitudinal analysis◦ The NHS records and publishes1 the aggregated
results from 4 PROMs as part of the quality assurance process
1: http://www.ic.nhs.uk/proms
As PROMs tend to be in a questionnaire format, often in the format of “total score”◦ i.e. a sum of ordinal scores
Often not “nice” distributions◦ Not normal◦ Bi-modal ◦ Floor and ceiling effects
Analysis usually assumes linear relationships◦ That is, moving from 4/10 to 5/10 is the same
clinical gain as moving from 9/10 to 10/10
5
Analysis of outcome measures
Histogram of total score
total
Fre
qu
en
cy
20 30 40 50
05
10
15
20
25
30
35
Example of PROM baseline data2
Here a low score denotes good function
Most patients on higher values
Tail is abruptly cut off on RHS◦ Can have worse
function than, but score the same as others
2: Data from Nick: OHS from WAT trial (ref: slide 15)
Part of Item Response Theory Introduced by Georg Rasch (1901 - 1980)
◦ Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests.
Used in psychometrics, so was created to describe a participant’s ability measured by item difficulties◦ Ability: the ‘latent trait’ of the participant
i.e. “maths ability” of a student◦ Difficulty: which levels of latent trait the question can
discriminate i.e. “easy” items identify poor students whilst “hard” items
show the difference between good and excellent students
7
Rasch Models
Given a data matrix of (binary) scores on n persons (S1, S2, … Sn) to a fixed set of k items (I1, I2, … Ik) that measure the same latent trait, θ
Each subject, Sv has a person parameter θv denoting their position on the latent trait (ability)
Each item Ii has a item parameter βi denoting its difficulty
8
Rasch formulation
Let:◦ β represent the vector of item parameters◦ θ represent the vector of person parameters◦ X be the n x k data matrix with elements xvi equal to 0 or 1
Then:
Also assume:◦ Independence of answers between persons
No group work, no cheating!◦ A person’s answers are stochastically independent
All dependent on ability only No person subgroups
◦ The latent trait is uni-dimensional i.e. can be used to assess “shame” but not “anxiety and
depression”
𝑃 (𝑋𝑣𝑖=𝑥𝑣𝑖|𝜃𝑣 , 𝛽𝑖 )=exp (𝑥𝑣𝑖 (𝜃𝑣− 𝛽𝑖 ))1+exp (𝜃𝑣− 𝛽𝑖 )
Rasch Models: Foundations, recent developments and applications. Fischer and Molenaar. Springer 1995.
-4 -2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
ICC for Q1
Latent Dimension
Pro
ba
bili
ty to
So
lve
When varying ability, the item response is a logistic relationship
The probability of a positive answer is 0.5 when the person ability equals item difficulty
Given a set difficulty, larger abilities have a greater chance of affirming the item◦ i.e. better students score
more!Ability
This plot is called an “Item Characteristic Curve” (ICC)
Notice that the latent dimension is rescaled to centre zero and measured in logistic units
A logistic model better captures a finite scale Gives information on both persons and items Model parameters are simple to obtain
◦ Total score is sufficient for calculating the person parameter
◦ Item score across persons is sufficient for calculating the item parameter
Extensions include◦ Polytomous data ◦ 2 and 3 parameter IRT models
2nd parameter adds a “discrimination” (slope) parameter 3rd parameter allows “guessing”
11
Rasch advantages
The Rasch model uses a pass/fail score However, what happens when pass some of
the item?◦ E.g. exam marking – questions with multiple
marks available◦ E.g. Surveys – Likert format questions
Two model variants
12
Rasch extension: polytomous data
Partial Credit Models
Allows a different number of thresholds each at a
separate difficulty for each item
Rating Scale Models
Items all have the same number of thresholds at identical difficulty levels
Polytomous ICCs
Rating scale model Partial credit model
13The same data (eRm package example) was used to create each model
Plots for Question 1Plots for Question 2Plots for Question 3
RSM items only shift left and right
PCM items change shape as well as shift
Several payware packages available WINSTEPS
◦ www.winsteps.com/ RUMM2020
◦ www.rummlab.com.au/ Freeware becoming available
◦ Several R packages now released eRm is used throughout this talk Itm and psych also have Rasch implementations
Growing literature base◦ But introduction books and courses hard to find!
14
Software and resources
Assesses hip function Designed to assess patients undergoing hip
replacement surgery◦ Patient reported measure◦ 12 questions, patients choose appropriate
statement which reflects their situation (out of 5 possible)
◦Here, each item marked 0-4, total score summed Minimum of 0 indicates ‘perfect’ function Maximum of 48
15
Example: The Oxford Hip Score
Dawson J, Fitzpatrick R, Carr A, Murray D. Questionnaire on the perceptions of patients about total hip replacement. J Bone Joint Surg Br. 1996 Mar;78(2):185-90.
Data from the WAT trial ◦ The Warwick Arthroplasty Trial◦ 126 participants at baseline◦ 2 intervention groups: hip replacement v.
resurfacing Analysed using the Partial credit model
◦ Where categories were not all used, the remaining categories were renumbered, starting from 0
Data available from the same cohort longitudinally
16
Costa ML, Achten J, Parsons N, Edlin RP, Foguet P, Prakash U, Griffin DR. A Randomised Controlled Trial of Total Hip Arthroplasty Versus Resurfacing Arthroplasty in the Treatment of Young Patients with Arthritis of the Hip Joint. BMJ 2012; 344:e2147.
OHS0_12
OHS0_11
OHS0_10
OHS0_9
OHS0_8
OHS0_7
OHS0_6
OHS0_5
OHS0_4
OHS0_3
OHS0_2
OHS0_1
-3 -2 -1 0 1 2 3 4 5 6
Latent Dimension
2 1 4 3
1 2 3 4
2 1 43
1 3 2 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 4 3
1 2 3 4
12 3 4
12 3 4
1 2
*
*
*
*
Person-Item Map
ttx
PersonParameter
Distribution
17
Distribution of abilities
Item difficulties
Mean difficulty
Items in red indicate non-sequential categories
Category thresholds
Baseline data
Question 9 has the lowest mean item parameter◦ Indicating best function◦ Have you been limping when walking?
Question 2 has the highest mean item parameter ◦ Indicating worst function◦ Have you had any trouble washing and drying yourself?
Question 8 covers the widest set of difficulties◦ Most discriminating item◦ After a meal (sat at a table), how painful has it been for
you to stand up from a chair? 4 questions have non-sequential thresholds
◦ Why does this happen?
18
Item parameters
Non-sequential categories
Non-sequential item(Question 5)
Sequential item(Question 11)
-4 -2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
ICC plot for item OHS0_5
Latent Dimension
Pro
ba
bili
ty to
So
lve
Category 0Category 1Category 2Category 3Category 4
-4 -2 0 2 40
.00
.20
.40
.60
.81
.0
ICC plot for item OHS0_11
Latent Dimension
Pro
ba
bili
ty to
So
lve
Category 0Category 1Category 2Category 3Category 4
19
Thresholds occur where curves cross
0|1
0|1
1|2 1|2
2|3
2|3
3|4
3|4
Non sequential categories result from◦ Underused categories◦ Unexpected scoring
patters Could suggest
problems with item Fixed by
◦ Removal of item◦ Combing categories
Total Score
OHS Q5 Score
0 1 2 3 4
Up to 10
1 0 0 0 0
11 to 20
8 7 1 0 0
21 to 30
21 16 16 2 4
31 to 40
0 8 14 7 12
41 + 0 0 1 1 7
All 30 31 32 10 23
20
Can associate scores and abilities◦ Monotonically
increasing relationship Clear that an increase
of 1 is associated with different increases of ability◦ “Bigger” loss of
function for low scorers ◦ Middle of score scale
gives similar abilities
0 10 20 30 40
-20
24
6
Plot of the Person Parameters
Person Raw Scores
Pe
rso
n P
ara
me
ters
(T
he
ta)
21
Person parameters
Baseline score
Model results
Total score distribution Ability distribution
22
Centred about zeroHeavy tail
Have several models at different time points◦ Could use baseline
model throughout◦ Could use new models
at each time point Have two treatment
groups ◦ A and B
Four follow-up points post intervention
23
Comparison useBaseline data: Ability by
treatment group
24
Change in function between baseline and 6 weeks follow up
Raw scores at baseline Raw scores at 6 weeks
A B2
03
04
05
06
0A B
-2-1
01
23
4
A B-2
02
4
Calculated abilities at baseline
Calculated abilities at 6 weeks
Using intention to treat groups
No significant differences between groups
Using baseline model at 6 weeks follow up
Scores at 6 weeks Predicted abilities at 6 weeks
A B-1
01
23
45
6
25
No significant differences between groups
Using baseline model at 12 months follow up
Scores at 12 months Predicted abilities at 12 months
26
Differences between scores at 12 months and baseline
Differences between abilities at 12 month (predicted) and baseline
(calculated)
No significant differences in either rating
Primary outcome of trial
OHS6w _12
OHS6w _11
OHS6w _10
OHS6w _9
OHS6w _8
OHS6w _7
OHS6w _6
OHS6w _5
OHS6w _4
OHS6w _3
OHS6w _2
OHS6w _1
-3 -2 -1 0 1 2 3 4
Latent Dimension
2 1 4 3
1 2 43
2 1 4 3
1 3 4 2
1 32 4
1 2 3 4
1 2 34
1 4 2 3
1 4 2 3
1 2 3 4
1 2 3 4
1 32 4
*
*
*
*
*
*
*
*
Person-Item Map
ttx
PersonParameter
Distribution
Very different to baseline
• Question 4 now easiest (was Q9)
• Question 3 now hardest (was Q2)
• Double the number of reversed scales (8)
Suggests that patient function has changed greatly
27
Items at 6 weeks
OHS0_12
OHS0_11
OHS0_10
OHS0_9
OHS0_8
OHS0_7
OHS0_6
OHS0_5
OHS0_4
OHS0_3
OHS0_2
OHS0_1
-3 -2 -1 0 1 2 3 4 5 6
Latent Dimension
2 1 4 3
1 2 3 4
2 1 43
1 3 2 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 4 3
1 2 3 4
12 3 4
12 3 4
1 2
*
*
*
*
Person-Item Map
ttx
PersonParameter
Distribution
Remember Baseline model
OHS12m_12
OHS12m_11
OHS12m_10
OHS12m_9
OHS12m_8
OHS12m_7
OHS12m_6
OHS12m_5
OHS12m_4
OHS12m_3
OHS12m_2
OHS12m_1
-3 -2 -1 0 1 2 3
Latent Dimension
2 1 3 4
1 2 3 4
12 3 4
3 1 4 2
1 2 3 4
1 2 34
12 3
2 1 4 3
1 2 3 4
1 2 3
1 2 3 4
1 2 3 4
*
*
*
Person-Item Map
ttx
PersonParameter
Distribution
Notice wide range of abilities◦ Some patients now
“recovered”◦ Some patients still with
low function Similar to baseline
model◦ Q9 easiest◦ Q8 most discriminatory◦ Q2 second most
difficult
28
Items at 12 months
-1 0 1 2 3 4
2030
4050
60
Ability
Tot
al O
HS
29
Model Comparisons
Model at 6 weeks
Model at 12 months
Baseline model
Abilities using baseline model
Abilities using 6 week model
Histogram of predicted abilities at 6 weeks
pred.6w
Fre
qu
en
cy
0 2 4 6
01
02
03
04
0
Histogram of abilities at 6 weeks
abil.6w
Fre
qu
en
cy
-4 -2 0 2 40
10
20
30
40
50
30
Scale calibrated from 6 week data collection allows comparison of
items
Scale calibrated from baseline data
collection allows comparison of persons
Because at baseline no responders used the lowest two categories, did not have the full range of scores◦ Q1: how would you describe the pain you usually had from your
hip? This resulted in missing values in other collection points
◦ At 6 weeks: 7 no score, 14 total missing◦ At 12 months: 3 no score, 9 total missing
Would need “calibration” data◦ From “healthy” population?◦ All time points?
Rasch model excludes maximum and minimum scores in model◦ Can calculate post-hoc
31
Problems
Fit statistics are not standardised across software, so it’s hard to get a clear picture◦ Names, formulae and boundaries are different ◦ There doesn’t appear to be a standard approach
Using WINSTEPS nomenclature◦ As the manual is available on line◦ http://www.winsteps.com/winman/index.htm
But this is still work in progress!◦ Not clear which implementation eRm package
uses
32
Item fit statistics
Chi squared statistics ◦ Observed v model expected
Mean square residuals (MSQ) t-statistics
◦ Transformation of MSQ◦ Not certain where useful cut offs are
Two versions of each type◦ Infit (weighted by ability)◦ Outfit (overall sample)
33
Common statistics
Sample size dependence varies by statistic Most defined in terms of the standard Rasch
model only Personfit statistics also available
◦ Similar approach When removing a missfitting item, whole
model must be recalculated◦ Which then finds new poor fitting items, etc, etc
Removed over half of all items in ME data set May be problems due to instrument not
designed for Rasch analysis◦ Subscales a major problem
34
Smith et al. Rasch fit statistics and sample size considerations for polytomous data. BMC Medical Research Methodology 2008, 8:33
Chisq df p-value Outfit MSQ InfitMSQ Outfit t Infit t
Q1 79.564 125 0.999 0.631 0.666 -4.44 -4.52
Q2 112.69 125 0.777 0.894 0.898 -1.27 -1.3
Q3 97.892 125 0.965 0.777 0.799 -2.57 -2.38
Q4 137.05 125 0.217 1.088 1.098 1.08 1.19
Q5 127.267 125 0.427 1.01 1.04 0.15 0.51
Q6 135.979 125 0.237 1.079 1.086 1.01 1.14
Q6 95.593 125 0.977 0.759 0.77 -3.13 -3
Q8 101.437 125 0.94 0.805 0.848 -2.44 -1.97
Q9 119.625 125 0.619 0.949 0.857 -0.41 -1.45
Q10 171.166 125 0.004 1.358 1.322 2.77 3.4
Q11 107.299 125 0.872 0.852 0.85 -1.78 -1.88
Q12 130.612 125 0.348 1.037 0.901 0.3 -0.99
35
WAT Baseline data
36
Item Pathway MapOther problems to consider:
Lots of variability in item parameters
95% CI for ability thresholdsoverlap
-4 -2 0 2 4 6
-2-1
01
23
4
Person Map
Infit t statistic
La
ten
t Dim
en
sio
n
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
P11
P12
P13
P14
P15
P16
P17
P18
P19
P20
P21
P22
P23 P24
P25
P26P27
P28
P29
P30
P31
P32
P33
P34
P35
P36
P37
P38
P39
P40
P41
P42
P43
P44
P45
P46
P47
P48
P49
P50P51
P52
P53
P54
P55
P56
P57P58
P59
P60P61
P62
P63
P64
P65
P66
P67
P68
P69P70
P71P72
P73
P74
P75
P76
P77
P78
P79
P80
P81P82
P83
P84
P85
P86
P87
P88
P89
P90
P91
P92
P93
P94
P95
P96
P97
P98
P99
P100
P101P102
P103
P104
P105 P106
P107
P108
P109
P110
P111
P112
P113
P114
P115
P116
P117
P118
P119
P120
P121
P122
P123
P124
P125
P126
37
Add in 95% CI for each person
Person pathway map at BaselineOften have miss-fitting persons – but not looked into how to deal with this
to date
Rasch model requires that item difficulty does not change between groups
E.g. A shoulder function questionnaire asks about the ability to brush and style hair◦ If (on average) women spend more effort on more
elaborate hairstyles, it would not be surprising to see that women with the same level of function find doing their hair more difficult
Differential item functioning (DIF) checks if this is indeed the case
38
Differential item functioning
-2 0 2 4 6
OHS0_1.c1
OHS0_1.c2
OHS0_4.c1
OHS0_4.c2
OHS0_4.c3
OHS0_4.c4
OHS0_5.c1
OHS0_5.c2
OHS0_5.c3
OHS0_5.c4
OHS0_8.c1
OHS0_8.c2
OHS0_8.c3
OHS0_8.c4
OHS0_10.c1
OHS0_10.c2
OHS0_10.c3
OHS0_10.c4
OHS0_12.c1
OHS0_12.c2
OHS0_12.c3
OHS0_12.c4
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
Group BGroup A
39
Confidence plot of thresholds
Group A
Group B
Overall differences using Anderson’s LR test:No difference (p = 0.645)
Maybe something here
However: 6 questions excluded as not all thresholds used by both groups
Rasch models ◦ Give an alternative analysis approach to ordinal
and binary scales◦ Less “bodging” of assumptions!◦ Give information on questions as well as
respondents◦ 1 parameter case of item response theory
Rasch models could potentially be used in PROM analysis◦ Have potential applications in validation and
construction of new PROMS
40
Summary
When is it a good fit◦ Still working on model fit statistics
Then assess person fit statistics◦ Does it matter at all?
How do you compare different populations◦ Is a calibration population the best way to go?◦ How can you find a clinically meaningful change?
How does item information effect the analysis◦ Is it useful?!
Thanks for listening!
41
Things I’m still working on