1 class 6 additional psychometric characteristics: validity and bias, responsiveness and sensitivity...
TRANSCRIPT
1
Class 6
Additional Psychometric Characteristics: Validity and Bias, Responsiveness and Sensitivity to
Change
October 25, 2007
Anita L. Stewart Institute for Health & Aging
University of California, San Francisco
2
Overview
Validity– Including bias
– How bias affects validity Responsiveness, sensitivity to change
– Meaningfulness of change
3
Validity
Does a measure (or instrument) measure what it is supposed to measure?
And…Does a measure NOT measure what it is NOT supposed to measure?
4
Valid Scale? No!
There is no such thing as a “valid” scale Validity is not a property of a measure
but rather of the meaning of the scores on the measure (Messick, 1995)
Similar to reliability
5
Validation of Measures is an Iterative, Lengthy Process
Accumulation of evidence– Different samples
– Longitudinal designs
6
Three Major Forms of Measurement Validity
Content Criterion Construct
7
Content Validity: Synonymous with Concept Development
Extent to which a set of items represents adequately the concept as defined
More relevant at the point of developing a set of questions to measures a particular concept (see class 1)
8
Content Validity: Relevant to Selecting Measures
Part of “conceptual adequacy” Does “candidate” measure represent
adequately the concept YOU are intending to measure
9
Content Validity Appropriate at Two Levels
Battery or Are all relevant conceptsinstrument represented in a set of health
measures?
Measure Are all aspects of a defined
concept represented in theitems of a scale?
10
Example of Content Validity: Battery
You are studying health-related quality of life (HRQL) of people with clinical depression– Your concept of HRQL includes sleep problems,
ability to work, and social functioning The SF-36 is one candidate
– It is missing sleep problems– It does reflect the other domains
11
Criterion Validity
How well a measure is correlated with another measure considered to be more accurate or an accepted standard (criterion)– Requires a “gold standard”
For patient-reported health measures, we rarely have a gold standard– Some exceptions
12
Criterion Validity: Predict Outcome
If goal of measure is to predict health or some other outcome– Criterion validity is the extent to which the
measure predicts the outcome Example: Criterion measure for measure
of war-related stress is subsequent PTSD (Vogt et al., 2004, readings)
13
“Predictive” Validity
Measure estimates or predicts some behavior or event
Two types:– Concurrent Predict event at same
point in time
– Predictive Predict future event
14
Criterion Validity: Test Validity of Self-report Measure Against “Objective” Data
Test validity of self-report measure of health care utilization against criterion of computer records of utilization– # MD visits past 6 months (self-report)
correlated with computer records 0.64
– # hospitalizations past 6 months (self-report) correlated with computer records 0.74
Ritter PL et al, J Clin Epid, 2001;54:136-141
15
Validity Coefficient in Criterion Validity
Coefficient is correlation between measure and “criterion” measure– the higher the coefficient, the more valid
16
Criterion Validity: Develop Screening Measure
Develop depression screening tool to identify persons with high levels of general distress– Administer extensive battery to detect
disorders only to those who screen positive Criterion validity
– Extent to which the screening tool detects (predicts) those with disorder
17
Sensitivity and Specificity: Criterion Validity?
Sensitivity: proportion of true positive cases (have disorder) correctly detected by screener– Goal: minimize missed cases (false negatives)– “Positive predictive value”
Specificity: proportion of true negative cases correctly identified.– Goal minimize cases without disorder (false
positives)
18
Sensitivity and Specificity: Depend on Screener Cutpoint and “Criterion”
Screener cutpoint– Lower screener cutpoint (requires fewer depressive
symptoms) increases sensitivity, reduces specificity– Higher cutpoint (requires more symptoms)
increases specificity, reduces sensitivity Criterion
– Stringent DIS definition (e.g., lifetime depression)– Standard DIS definition (e.g., recent depression)
19
Sensitivity
DIS Depression
+ - Total
Dep screener
+ a b a + b
- c d c + d
Total a + c = true cases
b + d =no dep
Sensitivity: a / (a + c) or proportion of true cases detected by screener
20
Specificity
DIS Depression
+ - Total
Dep screener
+ a b a + b
- c d c + d
Total a + c = true cases
b + d =no dep
Specificity: d / (b + d) or proportion of “truly” non-depressed cases detected by screener
21
Sensitivity and Specificity Data
Major depression
Yes No Total
CES-D score
Depressed (>16) 14 79 93
Not depressed 2 213 215
Total 16 292
Breslau N, Psychiatry Research, 1985; 15: 219-229.
22
Sensitivity Data
Major depression
Yes No Total
CES-D score
Depressed (>16) 14 79 93
Not depressed 2 213 215
Total 16 292
Sensitivity: 14 / 16 = 88%
23
Specificity Data
Major depression
Yes No Total
CES-D score
Depressed (>16) 14 79 93
Not depressed 2 213 215
Total 16 292
Sensitivity: 14 / 16 = 88%Specificity: 213 / 292 = 73%
24
Construct Validity
The major form of validation for measures of health and its determinants
25
Construct Validity Thoughts: Lee Sechrest
There is no point at which construct validity is established
Construct validity can only be established incrementally– Our attempts to measure constructs help us
better understand and revise these constructs
Sechrest L, Health Serv Res, 2005;40(5 part II), 1596
26
Construct Validity
“An impression of construct validity emerges from examining a variety of empirical results that together make a compelling case for the assertion of construct validity” (Sechrest, 2005)
27
Construct Validity: The Basic Question
Does measure relate to other measures in hypothesized ways?– Do measures “behave as expected”?
28
Construct Validity Basics
A three-step process: What is the hypothesis?
– State hypotheses regarding direction and magnitude of relationship
Test the hypothesis with data Do the results support (confirm) the
hypothesis?
29
Source of Hypotheses in Construct Validity Prior literature in which specific
associations between constructs have been observed– e.g., other samples, with other measures
of constructs you are testing Theory, that specifies how constructs should
be related
30
Who Tests for Validity?
When measure is being developed, researchers should test construct validity with carefully chosen hypotheses
As more people use a measure, you can collect evidence of construct validity from results of various studies– You determine if a particular pattern should
exist, interpret results in light of your hypothesis
31
Validity Coefficient in Construct Validity
Coefficient usually is a correlation between one measure and “validity” measures
Validity criteria - no consensus– McHorney:
> .70 is strong, and.30 to <.70 is substantial<.30 is weak
– Nunnally:» .30 to .40 is moderate
32
Four Basic Types of Construct Validity
Convergent validity Discriminant validity Convergent/discriminant validity Factorial validity
33
Four Basic Types of Construct Validity
Convergent validity Discriminant validity Convergent/discriminant validity
ALL of these can be– Concurrent– Predictive
34
Convergent Validity: Definitions
Strength of association between two methods of measuring the same construct– Two methods of measuring a construct should be
correlated (measures should “converge” on the construct)
The extent to which two or more measures that purport to be measuring the same concept agree
35
Convergent Validity of Measures
Hypotheses are stated as expected direction and magnitude of correlations
“We expect X measure of depression to be positively and moderately correlated with two measures of psychosocial problems”– The higher the depression, the higher the level
of problems on both measures
36
Interpersonal Processes of Care (IPC) Measures and Satisfaction with Care
Hypothesis: IPC measures will be positively associated with satisfaction with care
Satisfaction– 7-item scale focused on satisfaction with
interpersonal aspects of care
37
Results: 6 of 7 Scales Independently Associated with Satisfaction
B
Decided together (+) .10 (p<.001)
Compassionate, respectful (+) .16 (p<.001)
Interactions with race/ethnicity:
Lack of clarity p < .01
Elicited concerns, responded p < .01
Explained results p < .05
Disrespectful office staff p < .05
Discriminated due to race/ethnicity ns
R square .56
38
Testing Validity of Expectations Regarding Aging Measure
Hypothesis 1: ERA-38 would correlate moderately with ADLS, PCS, MCS, depression, comorbidity, and age
Hypothesis 2: Functional independence scale would show strongest associations with ADLs, PCS, and comorbidity
Sarkisian CA et al. Gerontologist. 2002;42:534
39
ERA-38 Validity Results
ERA-38ERA FunctionalIndependence
ADL .19** .20***
PCS-12 .27** .32***
MCS-12 .35** .30**
Comorbidity - .09* ns
Depressive symptoms - .33** - .28**
Age - .24** - .14**
40
ERA-38 Validity Results: Hypothesis 1
ERA-38ERA FunctionalIndependence
ADL .19** .20***
PCS-12 .27** .32***
MCS-12 .35** .30**
Comorbidity - .09* ns
Depressive symptoms - .33** - .28**
Age - .24** - .14**
41
ERA-38 Validity Results: Hypothesis 1
ERA-38ERA FunctionalIndependence
ADL .19** .20***
PCS-12 .27** .32***
MCS-12 .35** .30**
Comorbidity - .09* ns
Depressive symptoms - .33** - .28**
Age - .24** - .14**
42
ERA-38 Validity Results: Hypothesis 2
ERA-38ERA FunctionalIndependence
ADL .19** .20***
PCS-12 .27** .32***
MCS-12 .35** .30**
Comorbidity - .09* ns
Depressive symptoms - .33** - .28**
Age - .24** - .14**
43
ERA-38 Validity Results
ERA-38ERA FunctionalIndependence
ADL .19** .20***
PCS-12 .27** .32***
MCS-12 .35** .30**
Comorbidity - .09* ns
Depressive symptoms - .33** - .28**
Age - .24** - .14**
44
Known Groups Validity: A Type of Discriminant Validity
Does the measure distinguish between groups known to differ in concept being
measured?
Known groups validity is testing formean differences between groups “known” to differ on the construct
being measured
45
Evidence of Known-Groups Validity: Hypothesis Supported
Average Score* onGroup Health Perceptions
General population 86Patients visiting doctors 78Patients with heart disease 65
All pairwise comparisons statistically significant, p < .01*Scores range from 0-100, higher is better
46
Another Example of a Known GroupsValidity Hypothesis Among three groups:
– General population
– Patients visiting providers
– Patients in a public health clinic
Hypothesis: Physical functioning will be the best in a general population and the worst in patients in a public health clinic
47
Mean Scores on MOS 20-item Short Form in Three Groups
PublicGeneral MOS Health
Population Patients PatientsPhysical function 91 78 50Role function 88 78 39Pain -- 69 42Mental health 78 73 59Health perceptions 74 63 41
Bindman AB et al., Medical Care 1990;28:1142
48
Also Convergent Validity of Items
Part of scale construction methods – “Item convergence”
An item will correlate substantially with the scale it is hypothesized to represent – An item will correlate at least 0.30 or with the sum
of the other items in a scale» item-scale correlation corrected for overlap
49
Types of Construct Validity
Convergent validity Discriminant or divergent validity Convergent/discriminant validity Factorial validity
50
Convergent/Discriminant Validity: A Type of Construct Validity
Does measure correlate lower with measures it is not expected to be related to … than to measures it is expected to be related to?
The extent to which the pattern of correlations conforms to hypothesis is confirmation of construct validity
51
Basis for Convergent/Discriminant Hypotheses
All measures of health will correlate to some extent
Hypothesis is in terms of relative magnitude
52
Example of Convergent/Discriminant Validity Hypothesis
Specify expected pattern of relationships
A measure of physical functioning is “hypothesized” to be more highly related to a measure of mobility than to a measure of depression
53
Example of Convergent/Discriminant Validity Evidence
Pearson correlation:
Mobility Depression
Physical functioning .57 .25
54
Convergent/ Discriminant Validity of Items
Item correlates higher with the scale it is intended to measure (convergent) than with another scale (discriminant)
– Correlation of the item with the sum of the other items in the scale is within 2 standard errors higher than with other scales
55
Types of Construct Validity
Convergent validity Discriminant or divergent validity Convergent/discriminant validity Factorial validity
56
Factorial Validity Conduct a factor analysis on items
– Do you obtain same number of factors as in original study?
– Do you observe any new factors? Useful when taking a measure into a new
sample
57
Example of Factorial Validity
Factorial validity of the CES-D (Center for Epidemiologic Studies Depression Scale)
Probability sample of households, 2 communities– 4 factors for 20-item CES-D scale in
Caucasian adults
Radloff LS, Applied Psychol Measurement, 1977;1:385-401.
58
Example of Factorial Validity Test
Replicated same factor structure in various subgroups
»Blacks
»3 age groups
»3 educational groups
Radloff LS, Applied Psychol Measurement, 1977;1:385-401.
59
Example of Factorial Validity Test
Hertzog et al. – Confirmed same four factors in two cross
sectional samples» Community dwelling persons age 20-80» Canadian adults age 55-78
Found high inter-factor correlations Support for one “depression” score
Hertzog C et al., J Consulting Clin Psychol, 1990;2:64-72.
60
Construct Validity Thoughts: Lee Sechrest
There is no point at which construct validity is established
Construct validity can only be established incrementally– Our attempts to measure constructs help us
better understand and revise these constructs
Sechrest L, Health Serv Res, 2005;40(5 part II), 1596
61
“Validity of Measures is No Simple Matter” (Sechrest, 2005)
“Construct validity cannot be assumed simply because a measure correlates with some other measure or in a factor analysis it seems to have an appropriate factor structure”
These are “prerequisites”
62
“Validity of Measures is No Simple Matter” (cont)
Because of the wide range of constructs in the social sciences, many of which cannot be exactly defined..– …once measures are developed and in use,
we must continue efforts to understand them and their relationships to other measured variables.
63
Overview
Validity– Including bias
Responsiveness, sensitivity to change
– Meaningfulness of change
64
Revisiting Bias: How Bias Affects Validity
Bias affects “mean level” of true score
– Not “variation” of true score
65
Components of an Individual’s Observed Item Score (from Class 3)
Observed true item score score
= + error random
systematic
66
Random versus Systematic Error
Observed true item score score
= + error random
systematic
Relevant to reliability
Relevant to validity
67
Bias – systematic error
Affects validity of scores– If scores contain systematic error, cannot
know the “true” mean score
– Will obtain an observed score that is either systematically higher or lower than the “true” score
68
Sources of Bias in Observed Scores of Individuals
Respondent– Socially desirable responding– Acquiescent response bias– Cultural beliefs (e.g., not reporting distress)– Endpoint aversion– Halo affects
Observer– Belief that respondent is ill
Instrument
69
Socially Desirable Responding
Tendency to respond in socially desirable ways– to present oneself favorably
Affects observed score– observed score is consistently lower or higher
than true score in the direction of a more socially acceptable score
70
Socially Desirable Response Set – Looking “good”
After coming up with an answer to a question, respondent “screens” the answer– Will this make the person like me less?
May “edit” their answer to be more desirable Example: a woman has 2 drinks of alcohol a
day, but responds that she drinks a few times a week– Systematic underreporting of “risk” behavior
71
Ways to Minimize Socially Desirable Responding
Can write items to increase “acceptability” of an “undesirable” response– “Have you followed your doctor’s
recommendations?”– “Have you had any problems following your
doctor’s recommendations?” Can write instructions to encourage candid
reporting
72
Example of Bias Due to Cultural Norms or Beliefs
A person feels sad “most of the time” Unwilling to admit this to the interviewer so
answers “a little of the time”– Not culturally appropriate to admit to negative
feelings– Always presents a positive personality
Observed response reflects less sadness than “true” sadness of respondent
73
Acquiescent Response Set - Yea Saying Tendency to
– agree with statements regardless of content– give “positive” response such as yes, true, satisfied
Extent and nature of bias depends on direction of wording of the questions
Minimizing acquiescence:– Include positively- and negatively-worded items in
the same scale
74
Endpoint Aversion
Some people less likely to use extreme ends of any scale– believe that nothing is so absolute
Reduces range of observed scores
75
Minimizing Endpoint Aversion
Label endpoints in “non-absolute” terms–“usually” or “almost always” instead
of “always” Add “throwaway” endpoints
– If a 5-point scale is desired, create a 7-point scale with extreme endpoints
76
Halo Effects
Judgments on a specific question or statement about a person (e.g., physician) are based on attitudes about person as a whole – not discriminating each separate item
Especially in patient satisfaction measures– Natural tendency to not complain– People report high satisfaction with most items
regardless of true experiences
77
Observer Sources of Bias
Different “proxies” provide consistent sources of bias
In reporting on a patient’s well-being– Patients report highest levels
– Clinicians report levels in the middle
– Family members report the lowest levels
78
Multiple Sources of Information: Bias or Difference in Perspective?
In reporting on a patient’s well-being– Patients report highest levels– Clinicians report levels in the middle– Family members report the lowest levels
No way to know which is the “true” score– to say one score is “biased” implies another one
is the “true score”– in this situation, scores are simply different
79
Overview
Validity
– Including bias Responsiveness, sensitivity to change
– Meaningfulness of change
80
Two “Meanings” of Sensitivity and Responsiveness to Change
Measure is able to detect true changes One can know how much change in the
measure is meaningful– regardless of statistical significance
– change scores are interpretable in terms of meaningfulness
– clinical importance
81
Sensitivity to Change
Sensitive to true differences or changes in the attribute being measured
Sensitive enough to measure differences in outcomes that might be expected given the relative effectiveness of treatments– Ability of a measure to detect change
statistically
82
Importance of Sensitivity
Need to know measure can detect change if planning to use it as outcome of intervention
Approaches for testing sensitivity are often simultaneous tests of – effectiveness of an intervention – sensitivity or responsiveness of measures
Thus, may look for changes in a self-reported measure that parallel changes in clinical measures in a randomized trial
83
Sensitivity to True Change as a Function of Number of Scale Levels
A lotNo ofPain 0 1 2 3 4 5 6 7 8 9 10 Pain
84
Sensitivity to True Change as a Function of Number of Scale Levels
A lotNo ofPain 0 1 2 3 4 5 6 7 8 9 10 Pain
0 1 2 No Some A lot pain pain of pain
85
Sensitivity to True Change as a Function of Number of Scale Levels
A lotNo ofPain 0 1 2 3 4 5 6 7 8 9 10 Pain
0 1 2 No Some A lot pain pain of pain
Time 2 Time 1
86
Sensitivity to True Change as a Function of Number of Scale Levels
A lotNo ofPain 0 1 2 3 4 5 6 7 8 9 10 Pain
0 1 2 No Some A lot pain pain of pain
Time 2 Time 1
On this 3-level scale, the reduction occurred within the “some pain” response choice
87
Measuring Sensitivity
Measure must be stable in those who are not changing
Measure must change in those who are actually changing (true change)
Not easy to quantify– can administer multiple measures of same concept
in intervention» see which measures change the most
88
Responsiveness to Change:
Used DSM-IV criteria to classify patients with major depression at baseline into:– Persistent depression– Partial remission– Full remission
Examined change scores in the PHQ-9 in relation to these “criteria” – PHQ-9 – a short screener for depression
Löwe B et al. Med Care, 2004;42:1194-1201
89
Changes in PHQ-9 Scores by Criteria of Change in Depression
Mean change Effect size
3-mo follow-up
Persistent -3.2 -0.6
Partial remission -8.7 -1.9
Full remission -12.4 -2.2
6-mo follow-up
Persistent -4.4 -0.9
Partial remission -8.8 -1.8
Full remission -13.0 -2.6
Löwe et al, 2004, p. 1200
90
Relevant or Meaningful Change
“Is the observed change important?” To clinician:
– meaningful to clinician
– change might influence patient management To patient:
– patient notices change
– amount of change matters
91
Significance and Importance
Statistical significance is not sufficient for clinical importance– Depends on sample size
– Can obtain statistical significance of a very small change
92
“Minimal Important Difference” (MID)
MID = the minimal difference that is clinically important
Smallest difference considered to be worthwhile or important– Context specific
93
Anchor-Based Approaches to MID
Anchor: external information on amount of change
Identify group that has changed by a “minimal” amount– Clinical change– Patient reported change
MID is the change in the health measure for this group
94
Other Approaches to Assess Meaning of Change (Relative to a Measured Change)
Patient “noticed” change– Since …., how would you rate the amount of
change in your physical functioning?» 7-point scale: very much better ……. very much worse
Patient “satisfied” with change– How satisfied are you with the amount of change
in physical functioning?» 7-point scale: extremely satisfied … not at all satisfied
95
Other Measures of Perceived “Success”
Study of patients with hip or knee replacement How successful was your (hip, knee)
replacement in..– allowing you to return to your normal daily
activities?– relieving your pain?
Response choices: extremely, very, moderately, slightly, not at all successful
KB Bayley et al. Medical Care 1995:33:AS226
96
Next Class (Class 7)
Measurement issues in health disparities research and research with diverse populations
97
Readings for Next Week: Teresi and Holmes, 2001
Methodological issues in cross-cultural research
Jeanne Teresi is a leading methodologist in item response theory– Ways of examining bias in items
An introductory article in a special issue of the Journal of Mental Health & Aging on measurement issues in diverse populations
98
Readings for Next Week: Stewart and Nápoles-Springer 2003
Summarizes discussions of a small conference convened by six RCMAR centers in 2001
Reviews measurement issues for research in diverse race/ethnic and language groups
Suggestions for improving measures of socioeconomic status, discrimination, acculturation/enculturation, and quality of care– 4 key determinants of health disparities
99
Homework
Complete rows 24-29 in matrix for your two measures– Validity
– Responsiveness and sensitivity to change