reliability and validity introduction to study skills & research methods (hl10040) dr james...
TRANSCRIPT
Reliability and Validity
Introduction to Study Skills & Research Methods (HL10040)
Dr James Betts
Lecture Outline:•Definition of Terms
•Types of Validity
•Threats to Validity
•Types of Reliability
•Threats to Reliability
•Introduction to Measurement Error.
Commonly used terms…
“She has a valid point”
“My car is unreliable”
…in science…“The conclusion of the study was not valid”
“The findings of the study were not reliable”.
Some definitions…
• Validity
“The soundness or appropriateness of a test or instrument in measuring what it is designed to measure”
(Vincent 1999)
Some definitions…
• Validity
“Degree to which a test or instrument measures what it purports to measure”
(Thomas & Nelson 1996)
Some definitions…
• Reliability
“…the degree to which a test or measure produces the same scores when applied in the same circumstances…”
(Nelson 1997)
Some definitions…
• Objectivity
“…the degree to which different observers agree on measurements…”
(Atkinson & Nevill 1998)
Types of Experimental Validity• Internal
– Is the experimenter measuring the effect of the independent variable on the dependent variable?
• External
– Can the results be generalised to the wider population?
Logical Statistical
AKA Criterion
Face Content Predictive
Construct
Concurrent
Validity
ConsistencyReliability Objectivity
Logical Validity• Face Validity
– Infers that a test is valid by definition
– It is clear that the test measures what it is supposed to
e.g.If you want to assess reaction time, measuring how long it takes an individual to react to a given stimulus would have face validity
Externally Valid?
Logical Validity• Face Validity
– Infers that a test is valid by definition
– It is clear that the test measures what it is supposed to
Assessing face validity is therefore a subjective process.
i.e.Would assessing 15 m sprint time be a valid means of assessing reaction time?
Logical Validity• Content Validity
– Infers that the test measures all aspects contributing to
the variable of interest
…also a subjective process.
e.g.Who is the most physically fit?
VO2 max test?
Wingate test?
1 RM?
Overall:
A logically valid test simply appears to
measure the right variable in its entirety?
Statistical Validity• Concurrent Validity
– Infers that the test produces similar results to a previously validated test
e.g. VO2 max
Incremental Treadmill Protocol with expired gas analysis Multi-Stage Fitness (Beep) Test
Statistical Validity• Predictive Validity
– Infers that the test provides a valid reflection of future performance using a similar test
e.g.Can performance during test A be used to predict
future performance in test B?
A Bhttp://www.youtube.com/watch?v=vdPQ3QxDZ1s
Overall:
A statistically valid test produces results
that agree with other similar tests?
Logical/Statistical Validity• Construct Validity
– Infers not only that the test is measuring what it is supposed to, but also that it is capable of detecting what should exist, theoretically
– Therefore relates to hypothetical or intangible constructs
e.g. Team Rivalry
Sportsmanship.
Logical/Statistical Validity• Construct Validity
– Infers not only that the test is measuring what it is supposed to, but also that it is capable of detecting what should exist, theoretically
– Therefore relates to hypothetical or intangible constructs
– This makes assessment difficult, i.e. if what should exist cannot be detected, this could mean:
a) Test Invalid? b) Theory Incorrect? c) Sensitivity/Specificity Issues?
Interesting Example: Breast Cancer• Incidence: ~1 % (0.8 %)
(i.e. a positive result should be detected for approximately 1 in every 100 women tested)
• Sensitivity: ~90 % (87 %)(the mammogram is sensitive enough that approximately 90 in every 100 breast cancer patients will receive a positive result)
• Specificity: ~90 % (93 %)(the mammogram is specific enough that approximately 90 in every 100 healthy patients will receive a negative result).
Data from Kerlikowske et al. (1996)
Quick Test
• What is the probability that a patient receiving a positive result actually has breast
cancer?
Threats to Validity(and possible solutions?)
Threats to Internal Validity• Maturation
– Changes in the DV over time irrespective of the IV
Threats to Internal Validity• Maturation
e.g. One Group Pre-test Post-test
TO1O2
Threats to Internal Validity• Maturation (possible solution)
Time series
TO1 O2 O3 O4 O5 O6
Threats to Internal Validity• Maturation (possible solution)Pre-test Post-test Randomised Group Comparison
O1 T O2
P O4
O3
Rn.b. RCT
Threats to Internal Validity• Maturation (possible solution)
Repeated measures designs can occasionally be an inappropriate solution, even when randomised and counterbalanced
e.g.
Muscle Damage (repeated bout effect)
Vitamin Supplementation (wash-out period)
In which case independent measures designs could be used.
Threats to Internal Validity• History
– Unplanned events between measurements
Threats to Internal Validity• History
TO1O2
e.g. exercise?
Therefore, solution = control extraneous variables!
Threats to Internal/External Validity
• Pre-testing– Interactive effects due to the pre-test (e.g. learning,
sensitisation, etc.)– Also influences External Validity
• Pre-testing
…but then respond better to the T than the P…
e.g.
O1 T O2
PO3
RO4
…so it is actually T+O1 that
is better than P, not T alone.
Threats to Internal/External Validity
Assessing muscle mass here could make them train harder in both trials…
• Pre-testing (possible solution)
Solomon Four-Group Design
O1
T O2
RO4
PO3
P O6
TO5
Threats to Internal/External Validity
Threats to Internal Validity
• Statistical Regression– AKA regression to the mean
– An initial extreme score is likely to be
followed by less extreme subsequent scores
e.g.
Training has the greatest effect on untrained individuals.
Therefore, solution = effective sampling.
Sophomore Slump & SI
‘Cover Jinx’
Threats to Internal Validity• Instrumentation
– A difference in the way 2 comparable variables were measured
e.g.
Uncalibrated equipment
Therefore, solution = calibrate!
Threats to Internal Validity• Selection Bias
– The groups for comparison are not equivalent
Threats to Internal Validity• Selection Bias
e.g. Groups not randomly assigned
Static Group Comparison
T O1
Oa
P
i.e.
Group T were resistance trained to start with
Threats to Internal Validity• Selection Bias (possible solution)
T O1
Oa
P
Either:
-Randomise group assignment,
-Pre-test and post-test difference,
-Repeated Measures Design.
Threats to Internal/External Validity• Experimental Mortality
– Missing Data due to subject drop-out– Reduced n = reduced statistical Power– Not only challenges quality of data gathered
(Internal Validity) but also our ability to generalise (External Validity).
Therefore, solution = recruit sufficient
participants(young?)
Threats to External Validity• Inadequate description
– 5th characteristic of research…
…should be replicable
If nobody can replicate the methods of a given study, then it is irrefutable and therefore lacks external validity.
Therefore, solution = comprehensive methodology
Threats to External Validity• Biased sampling
– Linked to statistical regression– Sample does not reflect target population– n ≠ N
Results generalised across gender
Therefore, solution = random sample (of target population).
Threats to External Validity• Hawthorne Effect
– DV is influenced by the fact that it is being recorded
e.g.
Fastest sprint when professor enters lab
Therefore, solution = control the lab environment.
Threats to External Validity
CHO H2O
Therefore, solution = double or single
blinding.
• Demand Characteristics– Participants detect the purpose of the study and
behave accordingly
e.g.
Sports Science students already know that the carbohydrate drink is supposedly superior
Threats to External Validity• Operationalisation
– AKA Ecological Validity– The DV must have some relevance in the
‘real world’
e.g.TTE has no Olympic equivalent
Therefore, solution = choose your DV carefully.
Reliability• Reliability is a pre-requisite of validity
e.g. Direct versus Indirect measures of VO2 max
-Gold Standard -Expensive -Complex
-Predictive -Cheap -Easy
(i.e. valid and reliable)
Reliability
Subject 1 60 ml.kg-1.min-1 60 ml.kg-1.min-1 60 ml.kg-1.min-1
Subject 2 55 ml.kg-1.min-1 55 ml.kg-1.min-1 55 ml.kg-1.min-1
Subject 3 70 ml.kg-1.min-1 70 ml.kg-1.min-1 70 ml.kg-1.min-1
Valid and Reliable
Reliability
Subject 1 60 ml.kg-1.min-1 65 ml.kg-1.min-1 65 ml.kg-1.min-1
Subject 2 55 ml.kg-1.min-1 60 ml.kg-1.min-1 60 ml.kg-1.min-1
Subject 3 70 ml.kg-1.min-1 75 ml.kg-1.min-1 75 ml.kg-1.min-1
Not Valid but Reliable5 ml.kg-1.min-1
correction?
Reliability
Subject 1 60 ml.kg-1.min-1 72 ml.kg-1.min-1 57 ml.kg-1.min-1
Subject 2 55 ml.kg-1.min-1 61 ml.kg-1.min-1 52 ml.kg-1.min-1
Subject 3 70 ml.kg-1.min-1 40 ml.kg-1.min-1 84 ml.kg-1.min-1
Not Valid and not Reliablei.e. a test can never be valid without being reliable?
Types of Reliability
• Relative
• Absolute
• Rater reliability (Objectivity)– Intrarater reliability– Interrater reliability.
Relative Reliability
Subject 1 60 ml.kg-1.min-1 63 ml.kg-1.min-1 57 ml.kg-1.min-1
Subject 2 55 ml.kg-1.min-1 56 ml.kg-1.min-1 48 ml.kg-1.min-1
Subject 3 70 ml.kg-1.min-1 65 ml.kg-1.min-1 66 ml.kg-1.min-1
Relatively Reliablei.e. Individuals maintain position in the group
Absolute Reliability
Subject 1 60 ml.kg-1.min-1 63 ml.kg-1.min-1 57 ml.kg-1.min-1
Subject 2 55 ml.kg-1.min-1 56 ml.kg-1.min-1 48 ml.kg-1.min-1
Subject 3 70 ml.kg-1.min-1 65 ml.kg-1.min-1 66 ml.kg-1.min-1
Not Absolutely Reliablei.e. Test-Retest within individuals
Rater Reliability• Intrarater reliability
– The consistency of a given observer or measurement tool on more than one occasion
Rater Reliability• Interrater reliability
– The consistency of a given measurement from more than one observer or measurement tool
e.g.
Score for the American Gymnast
British Judge = 9.9
French Judge = 4.4
Japanese Judge = 7.0
Threats to Reliability• Fatigue
Subject 1 60 ml.kg-1.min-1 55 ml.kg-1.min-1 50 ml.kg-1.min-1
8 am 9 am 10 am
Therefore, solution = increase time between tests.
Threats to Reliability• Habituation
Subject 1 60 ml.kg-1.min-1 65 ml.kg-1.min-1 70 ml.kg-1.min-1
Therefore, solution = familiarise prior to test.
Threats to Reliability
• Standardisation of Procedures– Control of extraneous variables
• Precision of Measurements– i.e. if we are happy to measure VO2 max to the nearest
10 ml.kg-1.min-1, then it could probably be reliably predicted from your training volume and age.
Measurement Errors
• Ultimately, reliability is dependent on the degree of measurement error in a given study
• The overall error in any measurement is comprised of both systematic and random error
• We will address measurement error further next week…
Literature Search Assignment
• The handout lists 8 questions which can be answered through retrieving the corresponding source articles
• Answer as many as possible and bring them to next week’s lecture
• DO NOT contact author or order articles.
Selected Reading• Atkinson, G. and A. M. Nevill. Statistical methods for
assessing measurement error (Reliability) in variables relevant to sports medicine. Sports Medicine. 26:217-238, 1998.
• Holmes, T. H. Ten categories of statistical errors: a guide for research in endocrinology and metabolism. American Journal of Physiology. 286: E495-501.
• Thomas J. R. & Nelson J. K. (2001) Research Methods in Physical Activity, 4th edition. Champaign, Illinois: Human Kinetics