patient reported outcome measures and the rasch model helen parsons

Patient reported outcome measures and the Rasch model

Helen Parsons

Patient reported outcome measures◦ Quick overview◦ Analysis problems

Rasch models◦ Simple Rasch formulation◦ Rasch extensions: polytomous data

Application of the Rasch model◦ Using the Oxford Knee Score◦ Model fit criteria◦ DIF checking

Summary

Outline

Outcome measures are widespread, with patient reported outcome measurements (PROMS) increasingly used

Try to capture some latent trait of the respondent◦ ie. Some trait that is difficult to directly measure like patient

like “quality of life” or “anxiety” Often in a self-report questionnaire format

◦ EQ5D Some outcome measures reported by clinicians

◦ HoNoS Sometimes incorporates clinical findings as well as

questionnaire data◦ DAS 28

Outcome measures

Outcome measures have a variety of usages◦ One off assessment as a diagnosis tools◦ Comparative assessment

Such as measuring the outcome before and after an intervention

◦ Longitudinal analysis◦ The NHS records and publishes1 the aggregated

results from 4 PROMs as part of the quality assurance process

1: http://www.ic.nhs.uk/proms

As PROMs tend to be in a questionnaire format, often in the format of “total score”◦ i.e. a sum of ordinal scores

Often not “nice” distributions◦ Not normal◦ Bi-modal ◦ Floor and ceiling effects

Analysis usually assumes linear relationships◦ That is, moving from 4/10 to 5/10 is the same

clinical gain as moving from 9/10 to 10/10

5

Analysis of outcome measures

Histogram of total score

total

Fre

qu

en

cy

20 30 40 50

05

10

15

20

25

30

35

Example of PROM baseline data2

Here a low score denotes good function

Most patients on higher values

Tail is abruptly cut off on RHS◦ Can have worse

function than, but score the same as others

2: Data from Nick: OHS from WAT trial (ref: slide 15)

Part of Item Response Theory Introduced by Georg Rasch (1901 - 1980)

◦ Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests.

Used in psychometrics, so was created to describe a participant’s ability measured by item difficulties◦ Ability: the ‘latent trait’ of the participant

i.e. “maths ability” of a student◦ Difficulty: which levels of latent trait the question can

discriminate i.e. “easy” items identify poor students whilst “hard” items

show the difference between good and excellent students

7

Rasch Models

Given a data matrix of (binary) scores on n persons (S1, S2, … Sn) to a fixed set of k items (I1, I2, … Ik) that measure the same latent trait, θ

Each subject, Sv has a person parameter θv denoting their position on the latent trait (ability)

Each item Ii has a item parameter βi denoting its difficulty

8

Rasch formulation

Let:◦ β represent the vector of item parameters◦ θ represent the vector of person parameters◦ X be the n x k data matrix with elements xvi equal to 0 or 1

Then:

Also assume:◦ Independence of answers between persons

No group work, no cheating!◦ A person’s answers are stochastically independent

All dependent on ability only No person subgroups

◦ The latent trait is uni-dimensional i.e. can be used to assess “shame” but not “anxiety and

depression”

𝑃 (𝑋𝑣𝑖=𝑥𝑣𝑖|𝜃𝑣 , 𝛽𝑖 )=exp (𝑥𝑣𝑖 (𝜃𝑣− 𝛽𝑖 ))1+exp (𝜃𝑣− 𝛽𝑖 )

Rasch Models: Foundations, recent developments and applications. Fischer and Molenaar. Springer 1995.

-4 -2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

ICC for Q1

Latent Dimension

Pro

ba

bili

ty to

So

lve

When varying ability, the item response is a logistic relationship

The probability of a positive answer is 0.5 when the person ability equals item difficulty

Given a set difficulty, larger abilities have a greater chance of affirming the item◦ i.e. better students score

more!Ability

This plot is called an “Item Characteristic Curve” (ICC)

Notice that the latent dimension is rescaled to centre zero and measured in logistic units

A logistic model better captures a finite scale Gives information on both persons and items Model parameters are simple to obtain

◦ Total score is sufficient for calculating the person parameter

◦ Item score across persons is sufficient for calculating the item parameter

Extensions include◦ Polytomous data ◦ 2 and 3 parameter IRT models

2nd parameter adds a “discrimination” (slope) parameter 3rd parameter allows “guessing”

11

Rasch advantages

The Rasch model uses a pass/fail score However, what happens when pass some of

the item?◦ E.g. exam marking – questions with multiple

marks available◦ E.g. Surveys – Likert format questions

Two model variants

12

Rasch extension: polytomous data

Partial Credit Models

Allows a different number of thresholds each at a

separate difficulty for each item

Rating Scale Models

Items all have the same number of thresholds at identical difficulty levels

Polytomous ICCs

Rating scale model Partial credit model

13The same data (eRm package example) was used to create each model

Plots for Question 1Plots for Question 2Plots for Question 3

RSM items only shift left and right

PCM items change shape as well as shift

Several payware packages available WINSTEPS

◦ www.winsteps.com/ RUMM2020

◦ www.rummlab.com.au/ Freeware becoming available

◦ Several R packages now released eRm is used throughout this talk Itm and psych also have Rasch implementations

Growing literature base◦ But introduction books and courses hard to find!

14

Software and resources

http://www.winsteps.com/

http://www.winsteps.com/

http://www.rummlab.com.au/

http://www.rummlab.com.au/

Assesses hip function Designed to assess patients undergoing hip

replacement surgery◦ Patient reported measure◦ 12 questions, patients choose appropriate

statement which reflects their situation (out of 5 possible)

◦Here, each item marked 0-4, total score summed Minimum of 0 indicates ‘perfect’ function Maximum of 48

15

Example: The Oxford Hip Score

Dawson J, Fitzpatrick R, Carr A, Murray D. Questionnaire on the perceptions of patients about total hip replacement. J Bone Joint Surg Br. 1996 Mar;78(2):185-90.

Data from the WAT trial ◦ The Warwick Arthroplasty Trial◦ 126 participants at baseline◦ 2 intervention groups: hip replacement v.

resurfacing Analysed using the Partial credit model

◦ Where categories were not all used, the remaining categories were renumbered, starting from 0

Data available from the same cohort longitudinally

16

Costa ML, Achten J, Parsons N, Edlin RP, Foguet P, Prakash U, Griffin DR. A Randomised Controlled Trial of Total Hip Arthroplasty Versus Resurfacing Arthroplasty in the Treatment of Young Patients with Arthritis of the Hip Joint. BMJ 2012; 344:e2147.

OHS0_12

OHS0_11

OHS0_10

OHS0_9

OHS0_8

OHS0_7

OHS0_6

OHS0_5

OHS0_4

OHS0_3

OHS0_2

OHS0_1

-3 -2 -1 0 1 2 3 4 5 6

Latent Dimension

2 1 4 3

1 2 3 4

2 1 43

1 3 2 4

1 2 3 4

1 2 3 4

1 2 3 4

1 2 4 3

1 2 3 4

12 3 4

12 3 4

1 2

*

*

*

*

Person-Item Map

ttx

PersonParameter

Distribution

17

Distribution of abilities

Item difficulties

Mean difficulty

Items in red indicate non-sequential categories

Category thresholds

Baseline data

Question 9 has the lowest mean item parameter◦ Indicating best function◦ Have you been limping when walking?

Question 2 has the highest mean item parameter ◦ Indicating worst function◦ Have you had any trouble washing and drying yourself?

Question 8 covers the widest set of difficulties◦ Most discriminating item◦ After a meal (sat at a table), how painful has it been for

you to stand up from a chair? 4 questions have non-sequential thresholds

◦ Why does this happen?

18

Item parameters

Non-sequential categories

Non-sequential item(Question 5)

Sequential item(Question 11)

-4 -2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

ICC plot for item OHS0_5

Latent Dimension

Pro

ba

bili

ty to

So

lve

Category 0Category 1Category 2Category 3Category 4

-4 -2 0 2 40

.00

.20

.40

.60

.81

.0

ICC plot for item OHS0_11

Latent Dimension

Pro

ba

bili

ty to

So

lve

Category 0Category 1Category 2Category 3Category 4

19

Thresholds occur where curves cross

0|1

0|1

1|2 1|2

2|3

2|3

3|4

3|4

Non sequential categories result from◦ Underused categories◦ Unexpected scoring

patters Could suggest

problems with item Fixed by

◦ Removal of item◦ Combing categories

Total Score

OHS Q5 Score

0 1 2 3 4

Up to 10

1 0 0 0 0

11 to 20

8 7 1 0 0

21 to 30

21 16 16 2 4

31 to 40

0 8 14 7 12

41 + 0 0 1 1 7

All 30 31 32 10 23

20

Can associate scores and abilities◦ Monotonically

increasing relationship Clear that an increase

of 1 is associated with different increases of ability◦ “Bigger” loss of

function for low scorers ◦ Middle of score scale

gives similar abilities

0 10 20 30 40

-20

24

6

Plot of the Person Parameters

Person Raw Scores

Pe

rso

n P

ara

me

ters

(T

he

ta)

21

Person parameters

Baseline score

Model results

Total score distribution Ability distribution

22

Centred about zeroHeavy tail

Have several models at different time points◦ Could use baseline

model throughout◦ Could use new models

at each time point Have two treatment

groups ◦ A and B

Four follow-up points post intervention

23

Comparison useBaseline data: Ability by

treatment group

24

Change in function between baseline and 6 weeks follow up

Raw scores at baseline Raw scores at 6 weeks

A B2

03

04

05

06

0A B

-2-1

01

23

4

A B-2

02

4

Calculated abilities at baseline

Calculated abilities at 6 weeks

Using intention to treat groups

No significant differences between groups

Using baseline model at 6 weeks follow up

Scores at 6 weeks Predicted abilities at 6 weeks

A B-1

01

23

45

6

25

No significant differences between groups

Using baseline model at 12 months follow up

Scores at 12 months Predicted abilities at 12 months

26

Differences between scores at 12 months and baseline

Differences between abilities at 12 month (predicted) and baseline

(calculated)

No significant differences in either rating

Primary outcome of trial

OHS6w _12

OHS6w _11

OHS6w _10

OHS6w _9

OHS6w _8

OHS6w _7

OHS6w _6

OHS6w _5

OHS6w _4

OHS6w _3

OHS6w _2

OHS6w _1

-3 -2 -1 0 1 2 3 4

Latent Dimension

2 1 4 3

1 2 43

2 1 4 3

1 3 4 2

1 32 4

1 2 3 4

1 2 34

1 4 2 3

1 4 2 3

1 2 3 4

1 2 3 4

1 32 4

*

*

*

*

*

*

*

*

Person-Item Map

ttx

PersonParameter

Distribution

Very different to baseline

• Question 4 now easiest (was Q9)

• Question 3 now hardest (was Q2)

• Double the number of reversed scales (8)

Suggests that patient function has changed greatly

27

Items at 6 weeks

OHS0_12

OHS0_11

OHS0_10

OHS0_9

OHS0_8

OHS0_7

OHS0_6

OHS0_5

OHS0_4

OHS0_3

OHS0_2

OHS0_1

-3 -2 -1 0 1 2 3 4 5 6

Latent Dimension

2 1 4 3

1 2 3 4

2 1 43

1 3 2 4

1 2 3 4

1 2 3 4

1 2 3 4

1 2 4 3

1 2 3 4

12 3 4

12 3 4

1 2

*

*

*

*

Person-Item Map

ttx

PersonParameter

Distribution

Remember Baseline model

OHS12m_12

OHS12m_11

OHS12m_10

OHS12m_9

OHS12m_8

OHS12m_7

OHS12m_6

OHS12m_5

OHS12m_4

OHS12m_3

OHS12m_2

OHS12m_1

-3 -2 -1 0 1 2 3

Latent Dimension

2 1 3 4

1 2 3 4

12 3 4

3 1 4 2

1 2 3 4

1 2 34

12 3

2 1 4 3

1 2 3 4

1 2 3

1 2 3 4

1 2 3 4

*

*

*

Person-Item Map

ttx

PersonParameter

Distribution

Notice wide range of abilities◦ Some patients now

“recovered”◦ Some patients still with

low function Similar to baseline

model◦ Q9 easiest◦ Q8 most discriminatory◦ Q2 second most

difficult

28

Items at 12 months

-1 0 1 2 3 4

2030

4050

60

Ability

Tot

al O

HS

29

Model Comparisons

Model at 6 weeks

Model at 12 months

Baseline model

Abilities using baseline model

Abilities using 6 week model

Histogram of predicted abilities at 6 weeks

pred.6w

Fre

qu

en

cy

0 2 4 6

01

02

03

04

0

Histogram of abilities at 6 weeks

abil.6w

Fre

qu

en

cy

-4 -2 0 2 40

10

20

30

40

50

30

Scale calibrated from 6 week data collection allows comparison of

items

Scale calibrated from baseline data

collection allows comparison of persons

Because at baseline no responders used the lowest two categories, did not have the full range of scores◦ Q1: how would you describe the pain you usually had from your

hip? This resulted in missing values in other collection points

◦ At 6 weeks: 7 no score, 14 total missing◦ At 12 months: 3 no score, 9 total missing

Would need “calibration” data◦ From “healthy” population?◦ All time points?

Rasch model excludes maximum and minimum scores in model◦ Can calculate post-hoc

31

Problems

Fit statistics are not standardised across software, so it’s hard to get a clear picture◦ Names, formulae and boundaries are different ◦ There doesn’t appear to be a standard approach

Using WINSTEPS nomenclature◦ As the manual is available on line◦ http://www.winsteps.com/winman/index.htm

But this is still work in progress!◦ Not clear which implementation eRm package

uses

32

Item fit statistics

http://www.winsteps.com/winman/index.htm

http://www.winsteps.com/winman/index.htm

Chi squared statistics ◦ Observed v model expected

Mean square residuals (MSQ) t-statistics

◦ Transformation of MSQ◦ Not certain where useful cut offs are

Two versions of each type◦ Infit (weighted by ability)◦ Outfit (overall sample)

33

Common statistics

Sample size dependence varies by statistic Most defined in terms of the standard Rasch

model only Personfit statistics also available

◦ Similar approach When removing a missfitting item, whole

model must be recalculated◦ Which then finds new poor fitting items, etc, etc

Removed over half of all items in ME data set May be problems due to instrument not

designed for Rasch analysis◦ Subscales a major problem

34

Smith et al. Rasch fit statistics and sample size considerations for polytomous data. BMC Medical Research Methodology 2008, 8:33

Chisq df p-value Outfit MSQ InfitMSQ Outfit t Infit t

Q1 79.564 125 0.999 0.631 0.666 -4.44 -4.52

Q2 112.69 125 0.777 0.894 0.898 -1.27 -1.3

Q3 97.892 125 0.965 0.777 0.799 -2.57 -2.38

Q4 137.05 125 0.217 1.088 1.098 1.08 1.19

Q5 127.267 125 0.427 1.01 1.04 0.15 0.51

Q6 135.979 125 0.237 1.079 1.086 1.01 1.14

Q6 95.593 125 0.977 0.759 0.77 -3.13 -3

Q8 101.437 125 0.94 0.805 0.848 -2.44 -1.97

Q9 119.625 125 0.619 0.949 0.857 -0.41 -1.45

Q10 171.166 125 0.004 1.358 1.322 2.77 3.4

Q11 107.299 125 0.872 0.852 0.85 -1.78 -1.88

Q12 130.612 125 0.348 1.037 0.901 0.3 -0.99

35

WAT Baseline data

36

Item Pathway MapOther problems to consider:

Lots of variability in item parameters

95% CI for ability thresholdsoverlap

-4 -2 0 2 4 6

-2-1

01

23

4

Person Map

Infit t statistic

La

ten

t Dim

en

sio

n

P1

P2

P3

P4

P5

P6

P7

P8

P9

P10

P11

P12

P13

P14

P15

P16

P17

P18

P19

P20

P21

P22

P23 P24

P25

P26P27

P28

P29

P30

P31

P32

P33

P34

P35

P36

P37

P38

P39

P40

P41

P42

P43

P44

P45

P46

P47

P48

P49

P50P51

P52

P53

P54

P55

P56

P57P58

P59

P60P61

P62

P63

P64

P65

P66

P67

P68

P69P70

P71P72

P73

P74

P75

P76

P77

P78

P79

P80

P81P82

P83

P84

P85

P86

P87

P88

P89

P90

P91

P92

P93

P94

P95

P96

P97

P98

P99

P100

P101P102

P103

P104

P105 P106

P107

P108

P109

P110

P111

P112

P113

P114

P115

P116

P117

P118

P119

P120

P121

P122

P123

P124

P125

P126

37

Add in 95% CI for each person

Person pathway map at BaselineOften have miss-fitting persons – but not looked into how to deal with this

to date

Rasch model requires that item difficulty does not change between groups

E.g. A shoulder function questionnaire asks about the ability to brush and style hair◦ If (on average) women spend more effort on more

elaborate hairstyles, it would not be surprising to see that women with the same level of function find doing their hair more difficult

Differential item functioning (DIF) checks if this is indeed the case

38

Differential item functioning

-2 0 2 4 6

OHS0_1.c1

OHS0_1.c2

OHS0_4.c1

OHS0_4.c2

OHS0_4.c3

OHS0_4.c4

OHS0_5.c1

OHS0_5.c2

OHS0_5.c3

OHS0_5.c4

OHS0_8.c1

OHS0_8.c2

OHS0_8.c3

OHS0_8.c4

OHS0_10.c1

OHS0_10.c2

OHS0_10.c3

OHS0_10.c4

OHS0_12.c1

OHS0_12.c2

OHS0_12.c3

OHS0_12.c4

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

Group BGroup A

39

Confidence plot of thresholds

Group A

Group B

Overall differences using Anderson’s LR test:No difference (p = 0.645)

Maybe something here

However: 6 questions excluded as not all thresholds used by both groups

Rasch models ◦ Give an alternative analysis approach to ordinal

and binary scales◦ Less “bodging” of assumptions!◦ Give information on questions as well as

respondents◦ 1 parameter case of item response theory

Rasch models could potentially be used in PROM analysis◦ Have potential applications in validation and

construction of new PROMS

40

Summary

When is it a good fit◦ Still working on model fit statistics

Then assess person fit statistics◦ Does it matter at all?

How do you compare different populations◦ Is a calibration population the best way to go?◦ How can you find a clinically meaningful change?

How does item information effect the analysis◦ Is it useful?!

Thanks for listening!

41

Things I’m still working on

patient reported outcome measures and the rasch model helen parsons

Documents