1 class 6 additional psychometric characteristics: validity and bias, responsiveness and sensitivity...

1

Class 6

Additional Psychometric Characteristics: Validity and Bias, Responsiveness and Sensitivity to

Change

October 25, 2007

Anita L. Stewart Institute for Health & Aging

University of California, San Francisco

2

Overview

Validity– Including bias

– How bias affects validity Responsiveness, sensitivity to change

– Meaningfulness of change

3

Validity

Does a measure (or instrument) measure what it is supposed to measure?

And…Does a measure NOT measure what it is NOT supposed to measure?

4

Valid Scale? No!

There is no such thing as a “valid” scale Validity is not a property of a measure

but rather of the meaning of the scores on the measure (Messick, 1995)

Similar to reliability

5

Validation of Measures is an Iterative, Lengthy Process

Accumulation of evidence– Different samples

– Longitudinal designs

6

Three Major Forms of Measurement Validity

Content Criterion Construct

7

Content Validity: Synonymous with Concept Development

Extent to which a set of items represents adequately the concept as defined

More relevant at the point of developing a set of questions to measures a particular concept (see class 1)

8

Content Validity: Relevant to Selecting Measures

Part of “conceptual adequacy” Does “candidate” measure represent

adequately the concept YOU are intending to measure

9

Content Validity Appropriate at Two Levels

Battery or Are all relevant conceptsinstrument represented in a set of health

measures?

Measure Are all aspects of a defined

concept represented in theitems of a scale?

10

Example of Content Validity: Battery

You are studying health-related quality of life (HRQL) of people with clinical depression– Your concept of HRQL includes sleep problems,

ability to work, and social functioning The SF-36 is one candidate

– It is missing sleep problems– It does reflect the other domains

11

Criterion Validity

How well a measure is correlated with another measure considered to be more accurate or an accepted standard (criterion)– Requires a “gold standard”

For patient-reported health measures, we rarely have a gold standard– Some exceptions

12

Criterion Validity: Predict Outcome

If goal of measure is to predict health or some other outcome– Criterion validity is the extent to which the

measure predicts the outcome Example: Criterion measure for measure

of war-related stress is subsequent PTSD (Vogt et al., 2004, readings)

13

“Predictive” Validity

Measure estimates or predicts some behavior or event

Two types:– Concurrent Predict event at same

point in time

– Predictive Predict future event

14

Criterion Validity: Test Validity of Self-report Measure Against “Objective” Data

Test validity of self-report measure of health care utilization against criterion of computer records of utilization– # MD visits past 6 months (self-report)

correlated with computer records 0.64

– # hospitalizations past 6 months (self-report) correlated with computer records 0.74

Ritter PL et al, J Clin Epid, 2001;54:136-141

15

Validity Coefficient in Criterion Validity

Coefficient is correlation between measure and “criterion” measure– the higher the coefficient, the more valid

16

Criterion Validity: Develop Screening Measure

Develop depression screening tool to identify persons with high levels of general distress– Administer extensive battery to detect

disorders only to those who screen positive Criterion validity

– Extent to which the screening tool detects (predicts) those with disorder

17

Sensitivity and Specificity: Criterion Validity?

Sensitivity: proportion of true positive cases (have disorder) correctly detected by screener– Goal: minimize missed cases (false negatives)– “Positive predictive value”

Specificity: proportion of true negative cases correctly identified.– Goal minimize cases without disorder (false

positives)

18

Sensitivity and Specificity: Depend on Screener Cutpoint and “Criterion”

Screener cutpoint– Lower screener cutpoint (requires fewer depressive

symptoms) increases sensitivity, reduces specificity– Higher cutpoint (requires more symptoms)

increases specificity, reduces sensitivity Criterion

– Stringent DIS definition (e.g., lifetime depression)– Standard DIS definition (e.g., recent depression)

19

Sensitivity

DIS Depression

+ - Total

Dep screener

+ a b a + b

- c d c + d

Total a + c = true cases

b + d =no dep

Sensitivity: a / (a + c) or proportion of true cases detected by screener

20

Specificity

DIS Depression

+ - Total

Dep screener

+ a b a + b

- c d c + d

Total a + c = true cases

b + d =no dep

Specificity: d / (b + d) or proportion of “truly” non-depressed cases detected by screener

21

Sensitivity and Specificity Data

Major depression

Yes No Total

CES-D score

Depressed (>16) 14 79 93

Not depressed 2 213 215

Total 16 292

Breslau N, Psychiatry Research, 1985; 15: 219-229.

22

Sensitivity Data

Major depression

Yes No Total

CES-D score

Depressed (>16) 14 79 93


Total 16 292

Sensitivity: 14 / 16 = 88%

23

Specificity Data

Major depression

Yes No Total

CES-D score

Depressed (>16) 14 79 93


Total 16 292

Sensitivity: 14 / 16 = 88%Specificity: 213 / 292 = 73%

24

Construct Validity

The major form of validation for measures of health and its determinants

25

Construct Validity Thoughts: Lee Sechrest

There is no point at which construct validity is established

Construct validity can only be established incrementally– Our attempts to measure constructs help us

better understand and revise these constructs

Sechrest L, Health Serv Res, 2005;40(5 part II), 1596

26

Construct Validity

“An impression of construct validity emerges from examining a variety of empirical results that together make a compelling case for the assertion of construct validity” (Sechrest, 2005)

27

Construct Validity: The Basic Question

Does measure relate to other measures in hypothesized ways?– Do measures “behave as expected”?

28

Construct Validity Basics

A three-step process: What is the hypothesis?

– State hypotheses regarding direction and magnitude of relationship

Test the hypothesis with data Do the results support (confirm) the

hypothesis?

29

Source of Hypotheses in Construct Validity Prior literature in which specific

associations between constructs have been observed– e.g., other samples, with other measures

of constructs you are testing Theory, that specifies how constructs should

be related

30

Who Tests for Validity?

When measure is being developed, researchers should test construct validity with carefully chosen hypotheses

As more people use a measure, you can collect evidence of construct validity from results of various studies– You determine if a particular pattern should

exist, interpret results in light of your hypothesis

31

Validity Coefficient in Construct Validity

Coefficient usually is a correlation between one measure and “validity” measures

Validity criteria - no consensus– McHorney:

> .70 is strong, and.30 to <.70 is substantial<.30 is weak

– Nunnally:» .30 to .40 is moderate

32

Four Basic Types of Construct Validity

Convergent validity Discriminant validity Convergent/discriminant validity Factorial validity

33

Four Basic Types of Construct Validity

Convergent validity Discriminant validity Convergent/discriminant validity

ALL of these can be– Concurrent– Predictive

34

Convergent Validity: Definitions

Strength of association between two methods of measuring the same construct– Two methods of measuring a construct should be

correlated (measures should “converge” on the construct)

The extent to which two or more measures that purport to be measuring the same concept agree

35

Convergent Validity of Measures

Hypotheses are stated as expected direction and magnitude of correlations

“We expect X measure of depression to be positively and moderately correlated with two measures of psychosocial problems”– The higher the depression, the higher the level

of problems on both measures

36

Interpersonal Processes of Care (IPC) Measures and Satisfaction with Care

Hypothesis: IPC measures will be positively associated with satisfaction with care

Satisfaction– 7-item scale focused on satisfaction with

interpersonal aspects of care

37

Results: 6 of 7 Scales Independently Associated with Satisfaction

B

Decided together (+) .10 (p<.001)

Compassionate, respectful (+) .16 (p<.001)

Interactions with race/ethnicity:

Lack of clarity p < .01

Elicited concerns, responded p < .01

Explained results p < .05

Disrespectful office staff p < .05

Discriminated due to race/ethnicity ns

R square .56

38

Testing Validity of Expectations Regarding Aging Measure

Hypothesis 1: ERA-38 would correlate moderately with ADLS, PCS, MCS, depression, comorbidity, and age

Hypothesis 2: Functional independence scale would show strongest associations with ADLs, PCS, and comorbidity

Sarkisian CA et al. Gerontologist. 2002;42:534

39

ERA-38 Validity Results

ERA-38ERA FunctionalIndependence

ADL .19** .20***

PCS-12 .27** .32***

MCS-12 .35** .30**

Comorbidity - .09* ns

Depressive symptoms - .33** - .28**

Age - .24** - .14**

40

ERA-38 Validity Results: Hypothesis 1


ADL .19** .20***

PCS-12 .27** .32***

MCS-12 .35** .30**



Age - .24** - .14**

41



ADL .19** .20***

PCS-12 .27** .32***

MCS-12 .35** .30**



Age - .24** - .14**

42



ADL .19** .20***

PCS-12 .27** .32***

MCS-12 .35** .30**



Age - .24** - .14**

43

ERA-38 Validity Results


ADL .19** .20***

PCS-12 .27** .32***

MCS-12 .35** .30**



Age - .24** - .14**

44

Known Groups Validity: A Type of Discriminant Validity

Does the measure distinguish between groups known to differ in concept being

measured?

Known groups validity is testing formean differences between groups “known” to differ on the construct

being measured

45

Evidence of Known-Groups Validity: Hypothesis Supported

Average Score* onGroup Health Perceptions

General population 86Patients visiting doctors 78Patients with heart disease 65

All pairwise comparisons statistically significant, p < .01*Scores range from 0-100, higher is better

46

Another Example of a Known GroupsValidity Hypothesis Among three groups:

– General population

– Patients visiting providers

– Patients in a public health clinic

Hypothesis: Physical functioning will be the best in a general population and the worst in patients in a public health clinic

47

Mean Scores on MOS 20-item Short Form in Three Groups

PublicGeneral MOS Health

Population Patients PatientsPhysical function 91 78 50Role function 88 78 39Pain -- 69 42Mental health 78 73 59Health perceptions 74 63 41

Bindman AB et al., Medical Care 1990;28:1142

48

Also Convergent Validity of Items

Part of scale construction methods – “Item convergence”

An item will correlate substantially with the scale it is hypothesized to represent – An item will correlate at least 0.30 or with the sum

of the other items in a scale» item-scale correlation corrected for overlap

49

Types of Construct Validity

Convergent validity Discriminant or divergent validity Convergent/discriminant validity Factorial validity

50

Convergent/Discriminant Validity: A Type of Construct Validity

Does measure correlate lower with measures it is not expected to be related to … than to measures it is expected to be related to?

The extent to which the pattern of correlations conforms to hypothesis is confirmation of construct validity

51

Basis for Convergent/Discriminant Hypotheses

All measures of health will correlate to some extent

Hypothesis is in terms of relative magnitude

52

Example of Convergent/Discriminant Validity Hypothesis

Specify expected pattern of relationships

A measure of physical functioning is “hypothesized” to be more highly related to a measure of mobility than to a measure of depression

53

Example of Convergent/Discriminant Validity Evidence

Pearson correlation:

Mobility Depression

Physical functioning .57 .25

54

Convergent/ Discriminant Validity of Items

Item correlates higher with the scale it is intended to measure (convergent) than with another scale (discriminant)

– Correlation of the item with the sum of the other items in the scale is within 2 standard errors higher than with other scales

55

Types of Construct Validity

Convergent validity Discriminant or divergent validity Convergent/discriminant validity Factorial validity

56

Factorial Validity Conduct a factor analysis on items

– Do you obtain same number of factors as in original study?

– Do you observe any new factors? Useful when taking a measure into a new

sample

57

Example of Factorial Validity

Factorial validity of the CES-D (Center for Epidemiologic Studies Depression Scale)

Probability sample of households, 2 communities– 4 factors for 20-item CES-D scale in

Caucasian adults

Radloff LS, Applied Psychol Measurement, 1977;1:385-401.

58

Example of Factorial Validity Test

Replicated same factor structure in various subgroups

»Blacks

»3 age groups

»3 educational groups

Radloff LS, Applied Psychol Measurement, 1977;1:385-401.

59

Example of Factorial Validity Test

Hertzog et al. – Confirmed same four factors in two cross

sectional samples» Community dwelling persons age 20-80» Canadian adults age 55-78

Found high inter-factor correlations Support for one “depression” score

Hertzog C et al., J Consulting Clin Psychol, 1990;2:64-72.

60

Construct Validity Thoughts: Lee Sechrest

There is no point at which construct validity is established

Construct validity can only be established incrementally– Our attempts to measure constructs help us

better understand and revise these constructs

Sechrest L, Health Serv Res, 2005;40(5 part II), 1596

61

“Validity of Measures is No Simple Matter” (Sechrest, 2005)

“Construct validity cannot be assumed simply because a measure correlates with some other measure or in a factor analysis it seems to have an appropriate factor structure”

These are “prerequisites”

62

“Validity of Measures is No Simple Matter” (cont)

Because of the wide range of constructs in the social sciences, many of which cannot be exactly defined..– …once measures are developed and in use,

we must continue efforts to understand them and their relationships to other measured variables.

63

Overview

Validity– Including bias

Responsiveness, sensitivity to change


64

Revisiting Bias: How Bias Affects Validity

Bias affects “mean level” of true score

– Not “variation” of true score

65

Components of an Individual’s Observed Item Score (from Class 3)

Observed true item score score

= + error random

systematic

66

Random versus Systematic Error

Observed true item score score

= + error random

systematic

Relevant to reliability

Relevant to validity

67

Bias – systematic error

Affects validity of scores– If scores contain systematic error, cannot

know the “true” mean score

– Will obtain an observed score that is either systematically higher or lower than the “true” score

68

Sources of Bias in Observed Scores of Individuals

Respondent– Socially desirable responding– Acquiescent response bias– Cultural beliefs (e.g., not reporting distress)– Endpoint aversion– Halo affects

Observer– Belief that respondent is ill

Instrument

69

Socially Desirable Responding

Tendency to respond in socially desirable ways– to present oneself favorably

Affects observed score– observed score is consistently lower or higher

than true score in the direction of a more socially acceptable score

70

Socially Desirable Response Set – Looking “good”

After coming up with an answer to a question, respondent “screens” the answer– Will this make the person like me less?

May “edit” their answer to be more desirable Example: a woman has 2 drinks of alcohol a

day, but responds that she drinks a few times a week– Systematic underreporting of “risk” behavior

71

Ways to Minimize Socially Desirable Responding

Can write items to increase “acceptability” of an “undesirable” response– “Have you followed your doctor’s

recommendations?”– “Have you had any problems following your

doctor’s recommendations?” Can write instructions to encourage candid

reporting

72

Example of Bias Due to Cultural Norms or Beliefs

A person feels sad “most of the time” Unwilling to admit this to the interviewer so

answers “a little of the time”– Not culturally appropriate to admit to negative

feelings– Always presents a positive personality

Observed response reflects less sadness than “true” sadness of respondent

73

Acquiescent Response Set - Yea Saying Tendency to

– agree with statements regardless of content– give “positive” response such as yes, true, satisfied

Extent and nature of bias depends on direction of wording of the questions

Minimizing acquiescence:– Include positively- and negatively-worded items in

the same scale

74

Endpoint Aversion

Some people less likely to use extreme ends of any scale– believe that nothing is so absolute

Reduces range of observed scores

75

Minimizing Endpoint Aversion

Label endpoints in “non-absolute” terms–“usually” or “almost always” instead

of “always” Add “throwaway” endpoints

– If a 5-point scale is desired, create a 7-point scale with extreme endpoints

76

Halo Effects

Judgments on a specific question or statement about a person (e.g., physician) are based on attitudes about person as a whole – not discriminating each separate item

Especially in patient satisfaction measures– Natural tendency to not complain– People report high satisfaction with most items

regardless of true experiences

77

Observer Sources of Bias

Different “proxies” provide consistent sources of bias

In reporting on a patient’s well-being– Patients report highest levels

– Clinicians report levels in the middle

– Family members report the lowest levels

78

Multiple Sources of Information: Bias or Difference in Perspective?

In reporting on a patient’s well-being– Patients report highest levels– Clinicians report levels in the middle– Family members report the lowest levels

No way to know which is the “true” score– to say one score is “biased” implies another one

is the “true score”– in this situation, scores are simply different

79

Overview

Validity

– Including bias Responsiveness, sensitivity to change


80

Two “Meanings” of Sensitivity and Responsiveness to Change

Measure is able to detect true changes One can know how much change in the

measure is meaningful– regardless of statistical significance

– change scores are interpretable in terms of meaningfulness

– clinical importance

81

Sensitivity to Change

Sensitive to true differences or changes in the attribute being measured

Sensitive enough to measure differences in outcomes that might be expected given the relative effectiveness of treatments– Ability of a measure to detect change

statistically

82

Importance of Sensitivity

Need to know measure can detect change if planning to use it as outcome of intervention

Approaches for testing sensitivity are often simultaneous tests of – effectiveness of an intervention – sensitivity or responsiveness of measures

Thus, may look for changes in a self-reported measure that parallel changes in clinical measures in a randomized trial

83

Sensitivity to True Change as a Function of Number of Scale Levels

A lotNo ofPain 0 1 2 3 4 5 6 7 8 9 10 Pain

84



0 1 2 No Some A lot pain pain of pain

85




Time 2 Time 1

86




Time 2 Time 1

On this 3-level scale, the reduction occurred within the “some pain” response choice

87

Measuring Sensitivity

Measure must be stable in those who are not changing

Measure must change in those who are actually changing (true change)

Not easy to quantify– can administer multiple measures of same concept

in intervention» see which measures change the most

88

Responsiveness to Change:

Used DSM-IV criteria to classify patients with major depression at baseline into:– Persistent depression– Partial remission– Full remission

Examined change scores in the PHQ-9 in relation to these “criteria” – PHQ-9 – a short screener for depression

Löwe B et al. Med Care, 2004;42:1194-1201

89

Changes in PHQ-9 Scores by Criteria of Change in Depression

Mean change Effect size

3-mo follow-up

Persistent -3.2 -0.6

Partial remission -8.7 -1.9

Full remission -12.4 -2.2

6-mo follow-up

Persistent -4.4 -0.9

Partial remission -8.8 -1.8

Full remission -13.0 -2.6

Löwe et al, 2004, p. 1200

90

Relevant or Meaningful Change

“Is the observed change important?” To clinician:

– meaningful to clinician

– change might influence patient management To patient:

– patient notices change

– amount of change matters

91

Significance and Importance

Statistical significance is not sufficient for clinical importance– Depends on sample size

– Can obtain statistical significance of a very small change

92

“Minimal Important Difference” (MID)

MID = the minimal difference that is clinically important

Smallest difference considered to be worthwhile or important– Context specific

93

Anchor-Based Approaches to MID

Anchor: external information on amount of change

Identify group that has changed by a “minimal” amount– Clinical change– Patient reported change

MID is the change in the health measure for this group

94

Other Approaches to Assess Meaning of Change (Relative to a Measured Change)

Patient “noticed” change– Since …., how would you rate the amount of

change in your physical functioning?» 7-point scale: very much better ……. very much worse

Patient “satisfied” with change– How satisfied are you with the amount of change

in physical functioning?» 7-point scale: extremely satisfied … not at all satisfied

95

Other Measures of Perceived “Success”

Study of patients with hip or knee replacement How successful was your (hip, knee)

replacement in..– allowing you to return to your normal daily

activities?– relieving your pain?

Response choices: extremely, very, moderately, slightly, not at all successful

KB Bayley et al. Medical Care 1995:33:AS226

96

Next Class (Class 7)

Measurement issues in health disparities research and research with diverse populations

97

Readings for Next Week: Teresi and Holmes, 2001

Methodological issues in cross-cultural research

Jeanne Teresi is a leading methodologist in item response theory– Ways of examining bias in items

An introductory article in a special issue of the Journal of Mental Health & Aging on measurement issues in diverse populations

98

Readings for Next Week: Stewart and Nápoles-Springer 2003

Summarizes discussions of a small conference convened by six RCMAR centers in 2001

Reviews measurement issues for research in diverse race/ethnic and language groups

Suggestions for improving measures of socioeconomic status, discrimination, acculturation/enculturation, and quality of care– 4 key determinants of health disparities

99

Homework

Complete rows 24-29 in matrix for your two measures– Validity

– Responsiveness and sensitivity to change

1 class 6 additional psychometric characteristics: validity and bias, responsiveness and sensitivity...

Documents

measure of war

measure messick

instrument measure

validity coefficient

outcomeif goal of measure

validcriterion validity

outcomecriterion validity

example of content validity