measurement theory - physical therapy

1987; 67:1834-1839.PHYS THER. David E KrebsMeasurement Theory

http://ptjournal.apta.org/content/67/12/1834be found online at: The online version of this article, along with updated information and services, can

Collections

Tests and Measurements Research: Other

in the following collection(s): This article, along with others on similar topics, appears

e-Letters

"Responses" in the online version of this article. "Submit a response" in the right-hand menu under

or click onhere To submit an e-Letter on this article, click

E-mail alerts to receive free e-mail alerts hereSign up

by guest on January 28, 2013http://ptjournal.apta.org/Downloaded from

http://ptjournal.apta.org/cgi/collection/research_other

http://ptjournal.apta.org/cgi/collection/tests_and_measurements

http://ptjournal.apta.org/letters/submit/ptjournal;67/12/1834

http://ptjournal.apta.org/subscriptions/etoc.xhtml

http://ptjournal.apta.org/

Measurement Theory

DAVID E. KREBS

The usefulness and truthfulness of assessments, whether performed in the clinic on one patient or in the research laboratory on many subjects, depends on valid measurements. Measurement theory is the thought process and interrelated body of knowledge that form the basis of valid measurements. Translation of measurement theory to behaviors helps to ensure the integrity and relevancy of tests and the data that result from them. In the final analysis, useful and truthful data depend for their existence on scalable and detectable events being translated into pertinent, valid, and reliable measurements. The rules by which numbers are assigned to events form the basis of measurement theory. Key Words: Physical therapy, Research.

Measurement theory is the conceptual foundation of all scientific decisions. If the measurements are erroneous, no amount of statistical or verbal sophistry can right them. For example, if goniometric joint positioning rules are not obeyed, then research or clinical decisions based on those range-of-motion measurements probably will lead to incorrect, possibly even harmful, conclusions.

Measurement is the assignment of numbers to events according to rules.1 Measurement theory primarily concerns the rules of measurement, because these rules link the clinician's or researcher's ideas to the concrete arithmetic notations that typically are reported as "data." Measurement juxtaposes science and philosophy, because only through measurement does science approach real life.2 Measurement theory must reflect real life and conform to the tenets of epistemology— the philosophy of meaning and knowledge. Thus, common sense and clear thinking often are more important than technical knowledge in generating useful measurements.

The purpose of this article is to lay the conceptual foundation on which concrete measurement applications build. It will present principles that determine valid rules by which numbers are assigned to events. The most important factors are the type of event to be measured, the logical criteria of measurement, the measurement's reliability, the event's stability, and ultimately the way in which the measurements are to be used.

DATA CLASSIFICATION

The characteristics of the events to be measured determine in large part the types or classes of data that the measurement generates. Data are recordings of observations. Datum—the singular form of data—is Latin for "the given." Thus, a recording of an observation sometimes is accepted naively as an alias for the event itself, as a given. Quantitative data are numerical recordings of observations of events whose characteristics can be counted, ranked, or assigned a position along a proportionate scale. Physical therapy research usually

focuses on quantitative data because it can be summarized and analyzed statistically, so this article will not discuss non-numeric, or qualitative, verbal data.

The extent to which data are affected by factors that are idiosyncratic to the machine or person performing the measurements is denoted in part by the data order. The closer the data to the life event, the less they may be affected by the measurement process.

Zero-order data are the life events themselves. The term "zero order" is used to indicate that the term "data" is not used in the usual sense, because life events cannot be recordings of observations; the term, however, does serve to remind researchers that we never report zero-order data, but rather an observation based on life events. First-order data are descriptions of the life event; they are the fundamental level of measurement. Second-order data are inferences based on the life event descriptions; they are conjectures deduced from the lower level data and may be assigned some truth level or statistical probability. Third-order data are reactions based on one or more pieces of inferential data; they are derived from accumulated experience conceptually associated with the life event in question.

By and large, measurement in physical therapy deals with first-order data. Some social sciences have few first-order data, that is, naturalistic observational measurements such as might occur in sociology or anthropology by necessity intermix inference and reaction with attempts to gauge the life event. The naturalistic observer explicity records his or her perceptions, inferences, and reactions to the events, which unavoidably generate second- and third-order data.3 Meta-analyses and literature research rely on second- and third-order data for their raw material, from which they generate new third-order data, that is, other results, opinions, and literature.4-5

First-order data can be reanalyzed and reinterpreted validly by another researcher or clinician if the original rules of measurement were specified sufficiently. These reanalyses and reinterpretations may agree or disagree with the original researcher's inferences and reactions regarding the same life events. For example, if a therapist measures electromyographic activity of the vastus medialis muscle during gait, these first-order data (ie, raw EMG data) might be used by one researcher to characterize some aspect of muscle physiology. Another researcher might average or integrate the same first-order EMG data to reach conclusions about locomotor

Dr. Krebs is Associate Professor, Massachusetts General Hospital Institute of Health Professions, 15 River St, Boston, MA 02108-3402 (USA). He was Associate Research Scientist, New York University Post-Graduate Medical School, Prosthetics and Orthotics, 317 E 34th St, New York, NY 10016, at the time this article was written.

This work was supported in part by a grant from the Shriners Hospitals for Crippled Children.

1834 PHYSICAL THERAPY by guest on January 28, 2013http://ptjournal.apta.org/Downloaded from


control.6 A third investigator, however, might suspect that what actually is being measured is motion artifact generated by the moving limbs, unless the original researcher specified how such artifacts were controlled. The important message is that the researcher must provide reasons justifying the descriptions and inferences based on the measurements. That is, the experimenter must provide for all to inspect the rules by which the numbers were assigned to experimental events. At a minimum, the clinician or researcher must be aware that data can be qualitative or quantitative, close to or far from the life event.

OPERATIONALIZATION

The conceptual rules of measurement must be operation-alized into behaviors that maximize the chances that the data actually represent the life event. Minimization of artifact and other specific biases is the particular task of operationaliza-tion. Some examples of rules governing most clinical assessment are: "What are the test beginning and end points?" "What are the patient and tester positions during the assessment?" "What instructions are given to the patient and raters?" "What constitutes a successful trial, and when should a trial be judged invalid?"

The operationalization of measurement rules into rater behaviors determines what is recorded and finally accepted as data. To convince the reader that measurements are valid, the logical connections between the numbers analyzed and all important events that took place during data collection must be established clearly.

LOGICAL AND STATISTICAL CONSIDERATIONS

Statistics are used to summarize and analyze numerical data. Any statistical procedure assumes that its data satisfy at least two logical criteria: 1) The event must be detectable, and 2) the event must be scalable.

To be detectable, the event at a minimum must be definable. Counting the number of subjects in a group is a familiar example involving detection; this variable is measurable because we can agree on what to count: We count people. Specifically, we develop a rule that uniquely characterizes the group's attributes and tally the number of subjects who share the group's distinguishing features. A more subtle example, however, is the definition of spasticity. Although physical therapists have an empirical understanding of what a spastic limb feels like, we have yet to agree on a variable that represents degree of spasticity meaningfully. Thus, quantitative spasticity data have eluded therapists because the concepts and constructs of spasticity have not been operationalized, which in turn, I suspect, is because we cannot agree on what to detect.

To be scalable, the event must be detectable and definable along at least one dimension, which can have many or few gradations. For example, the unit of mass can be kilograms or micrograms, but the resulting data will be statistically and logically equivalent. Scales generally are classified as nominal, ordinal, interval, or ratio.7

Types of Scales

Nominal scales assign numerals that merely name or classify events. For example, no one would consider one football player better than another simply because his jersey had a

higher number. Nominal scales can only differentiate; one would only say that one player was different from another.

Ordinal scales classify and ascribe a hierarchy to events. A manual muscle test score of 4 (Good) denotes a muscle contraction that is stronger, higher, or better than a score of 2 (Poor), but not necessarily twice as high, good, or strong, nor even the same actual difference in muscle function as between 3 (Fair) and 1 (Trace).

Interval scales classify, ascribe a hierarchy, and denote numerical differences that are isomorphically equal to real differences. Isomorphic measurements are those measurements that literally take the same shape as reality (Figure). Thus, two temperatures with a difference of only 2° on the Celsius scale are the same, in terms of real heat difference, regardless of whether the difference is between 100° and 98°C or between 12° and 10°C. Thus, addition and subtraction may be used freely on interval scales, but not multiplication or division.

Ratio scales classify; ascribe a hierarchy; denote isomorphically equal intervals; and possess a real, naturally occurring zero value. Thus, not only do 20° and 60° on the Kelvin temperature scale differ by 40°, but 20°K also isomorphically represents one third as much real heat as 60°K. Thus, all arithmetic operations will yield meaningful results on ratio data.

Measurement-Scale Statistical Restrictions

Arithmetic operations on data from a scale are restricted because we want the results of such operations to have meaning. Statistical operations are subject to restrictions by the same logic, but measurement-scale properties do not determine the type of descriptive and inferential statistical analyses that may be applied except for the arithmetic restrictions noted previously; statistical theory and measurement theory are quite independent in this regard.8 For example, a mean of MMT scores would only be a useful description of a group of patients' muscle strength if the MMT data had a normal distribution. If the data have a normal distribution, a known probability interval can be established between any points on the normal curves, thus elevating MMT to interval-scale status in this instance.

The choice of inferential statistic (eg, parametric vs non-parametric, or distribution-free) likewise is determined chiefly by the underlying distribution and by the degree of homogeneity and independence of error variances of the sample data gathered.9 For example, scales such as EMG microvolts or isokinetic torque, which appear to have a naturally occurring zero value, may not satisfy the isomorphically equal gradation requirement of interval scales, so they may not possess ratio scales. Such data, however, may be analyzed with parametric statistics, such as correlations and analyses of variance (ANOVAs), so long as the data gathered satisfy the mathematical requirements of the statistical test. As another example, a nominal scale has, in one circumstance, ratio-scale properties; if the trait being measured has only two possible states (eg, a dichotomous trait such as sex), then its "male or female" or "yes or no" data can be represented statistically as "0,1" and used for ANOVAs or correlations.10 The take-home message here is that numerical measurement indexes should most importantly correspond explicitly to the phenomena being scaled and are concerned only secondarily with the statistical analyses that later may be applied.

Volume 67 / Number 12, December 1987 1835 by guest on January 28, 2013http://ptjournal.apta.org/Downloaded from


RELIABILITY

Reliability is simply the extent to which measurements yield similar results consistently. Ideally, variability in test data should result only from real differences in the underlying factor being examined. In reality, however, measurement error is always a component of test scores. As Stanley points out, "If the unit of measurement is fine enough in relation to the accuracy of the measurement, discrepancies will always appear."11 Indeed, as measurement technology improves, even the fundamental physics constants, such as Planck's constant and the speed of light, must be adjusted periodically.12 Measurement error can be subdivided into systematic and random errors.13

Systematic errors are consistent biases built into the test situation. For example, some persons believe they can increase cranial suture motion in adults. If such a person attempts to measure cranial suture motion by palpation, that person will be more likely to believe that motion indeed has increased than an unbiased observer. (Measuring cranial suture motion in adults, of course, also involves problems of detection.) Systematic errors also may be subtle: An EMG telemeter that causes a 5-mV direct-current gain generates a systematic error, systematically overestimating actual EMG voltage. Thus, the data may be reliable only if the same telemeter is used repeatedly, but not if different telemeters are used interchangeably. Systematic unreliability also affects data validity.

Random errors occur by chance. If a measurement were conducted many times, random errors would be as likely to affect one datum as much as any other. Unlike systematic bias, random error is essentially "noise" in the system that prevents accurate, precise data collection. For example, the telemeter that is sensitive to random events such as stray radio signals will be inaccurate under all conditions.

Precise data are the progeny of precise measurement rules. Measurement error can be minimized by providing fair and unambiguous instructions and a consistent physical environment to the test subjects and to the data collectors. In short, all components of the test environment must be controlled to the maximum extent possible to decrease the chances that some factor other than the one of interest will affect the measurements.

PERFORMANCE STABILITY

A special problem of measurement reliability is presented by empirical studies such as clinical research. Subjects must be able to perform similar tasks similarly if they were to be retested. The idiosyncratic contributions of the subject's mood and other elements unrelated to the factor of interest must be acknowledged in any research involving humans.4 During repeated testing, however, differential motivation, learning and practice effects, inattention, and contrivances of the test environment that interact with the subject in an uncontrolled

TEMPERATURE SCALE RELATIONSHIPS

Figure. Relationship between nominal, ordinal, interval, and ratio scales of temperature. The nominal scale (abscissa axis), using verbal descriptive measures, has been arranged purposely to parallel the general order of the quantitative measurement scales (ordinate axis). Note that degrees Celsius (an interval scale) parallel the real-life event, absolute heat in degrees Kelvin (a ratio scale), but that 0°C is not the naturally occurring zero-heat value. The Fahrenheit scale (an ordinal scale) values share neither the same slope nor the same zero value as the Celsius and Kelvin scales because the Fahrenheit scale's between-integer intervals and its zero value are not isomorphic with reality.



fashion may affect the test results as much as the factor of interest.14 If such confounding variables cannot be controlled through the research design, they at least should be measured and reported. For example, motor control researchers frequently attempt to gain insight into neurophysiological reflexes underlying gait perturbation recovery. No matter how well-controlled the test otherwise might be, only the first exposure of a given subject could possibly represent "reflex" activity; ensuing trials necessarily reflect, to some extent, exposure to the stimulus and consequent motor learning. As a more clinical example, attempts to characterize maximum muscle power will be unreliable if the patient is more motivated during one test than another. No matter how good the data collection apparatus, if the performance being assessed is not stable, the measurements will not be stable.

UNITS AND DIMENSIONS OF MEASUREMENT

Most research events, like life, are complex and multidimensional, and often more than one measurement is required to understand the life event. For example, to study the effect of a prosthetic hand on activities-of-daily-living skills, a researcher might analyze both the quantity and quality of performance during an ADL task, such as tying a shoelace into a bow. The units of measurement of these two performance dimensions—quantity and quality—might be defined operationally as duration in seconds and degree of normal appearance, respectively.15,16 Measurement dimensions are concepts that are operationalized into variables; variables are operationalized into measurement units. Michels describes the process well:

Identify a dimension...of interest; operationally define the dimension to make it publicly observable; and operationally define two or more categories or units on the dimension in such a way that they are mutually exclusive and exhaustive.17(p2U)

The unit of measurement, strictly speaking, is arbitrary. The meaningfulness of a measurement unit is ultimately a function of the concept and construct interpretations of the researcher, which in turn are based loosely or explicitly on other measurements. For example, an inch was defined in 1324 by King Edward II as being equal to three average barleycorns' length when laid end to end. Thus, a 1-in unit of measurement originally was quite arbitrary, although it now has a commonly accepted definition.

The dimensions of measurement sometimes are defined as the apparent aspects or characteristics of the concept under study, but dimensions also may be latent, or not directly observable.18 The cause of low back pain probably is multidimensional, involving psychosocial characteristics in addition to mechanical aspects; the complex interaction of these two factors may form a latent third factor. Measures of all three dimensions may be required to categorize back pains according to the treatments to which they respond. To measure a latent dimension, statistical techniques such as factor analysis are used to generate new numerical combinations or patterns of empirically measured variables.

New measures are derived most commonly from prior measures, because the new measures automatically have some a priori relationship to those currently accepted. In the prosthetic example cited previously, the amputee's ADL quality could be judged according to criteria usually applied to non-amputees, but nonamputees may perform ADL—particularly tying a shoe—differently than do amputees. To avoid missing important dimensions of ADL skill assessment, a prosthetic test properly would modify the normal ADL tests' definitions

in light of these differences. The unit of measurement for both amputees and nonamputees could be a six-point rating scale, with 0 being "poor performance" and 5 being "performs activity normally," but poor and normal could be defined differently depending on whether amputees or nonamputees were being tested.

UNITS OF ANALYSIS

The unit of analysis is not necessarily the same as the unit of measurement; analyses can be based on raw, summarized, or transformed data. Which unit of analysis most accurately represents the characteristic under study usually is decided according to current theory that relates to the event being investigated (Appendix). If the purpose of measurement in the prosthetic ADL example discussed previously was to characterize the below-elbow (BE) amputees' skill in relation to all upper limb amputees, an explicitly derived score might be more meaningful than the raw (0-5) scores. One such derived score would be a standardized score (ie, where the BE amputees' raw scores fell in relation to the mean of the group of similar subjects). Common examples of standardized scores are intelligence quotients, Scholastic Aptitude Test scores, and developmental test scores.

More typical in physical therapy research is the use of averaged scores, where the mean of two or more performance measures are analyzed. For example, EMG microvolts may be averaged over two trials, and data analysis is performed on the resulting average. Summarized or standardized scores often are easier to interpret than raw scores and may be less sensitive to random unreliability of the test situation, test form, or other idiosyncrasies specific to the unique circumstances of the test. Of course, the transformation criteria must be specified clearly or critical information will be lost.19

MEASUREMENT MEANINGFULNESS

Meaning, in scientific inquiry, depends on the context of the measurement and the assumptions that will permit data to be useful.20 If the rules by which measurements are made are valid, then logical inferences based on these measurements can lead to valid conclusions. Logical questions of measurement largely examine the meaningfulness of events and variables and how to measure them validly.17,21

Event Meaningfulness

Substantial controversy exists regarding what to measure during clinical trials, particularly in randomized trials.22 The necessity to define clearly the event being measured and to relate this event to the theoretical notion being tested, distinguishes good science from sloppy, haphazard empiricism.

Measurement in physical therapy primarily is clinical measurement. In the casual clinical setting, most measurements are performed solely on the basis of convenience and a presumed relevance to the information sought; unfortunately, test assumptions rarely are examined. Nothing about the clinical setting, however, inhibits critical thinking, clear definitions, and theoretically sound measurement rules. Clinical measurements, therefore, can be as reliable, valid, and elegant as measurements used in controlled experiments. In controlled experiments, the logical connection between real fife and the numbers assigned to the clinical events often determines whether the experiment's conclusions will be meaningful, or indeed relevant, to clinical practice.

Volume 67 / Number 12, December 1987 1837 by guest on January 28, 2013http://ptjournal.apta.org/Downloaded from


For example, correct nosologic classifications of types of patients and diseases are an important step in sound clinical science. To conclude that a certain therapeutic intervention works well for patients with multiple sclerosis (MS), for example, the diagnostic criteria for MS must be sufficiently well-defined to permit such inferences. The meaningfulness of the events that determine the nosologic grouping of MS must be established clearly, and accepted generally before clinical or experimental results can be presumed to be applicable to patients with MS.

Event meaningfulness also determines how the measurement results can be used, that is, their theoretical value. The theoretical evidence pertinent to the investigation (eg, previously published reports on the fate of patients with similar diagnoses and symptoms, anticipated value of the measurements to be taken, instruments to be used) should be examined thoroughly and taken into account before data collection.

For example, suppose that subjects with Buerger's disease are assigned randomly to either exercise or thermal biofeedback treatment groups. After treatment, their foot blood flow is measured using plethysmography and Doppler devices to determine vascular differences "caused" by the treatments and to predict amputation incidence. The six-week postex-periment results might reveal significant blood flow increases favoring the biofeedback group, but suppose a limb count at the five-year follow-up showed that the exercise group fared better. Logically, we might conclude that either the biofeedback failed or the plethysmographic and Doppler measures are unrelated to blood flow and hence to amputation. Because the second conclusion is unlikely to be true, however, we might assert that biofeedback failed. Or has it? Perhaps some confounding, unmeasured event unrelated to biofeedback or to the randomization process, such as differential cessation of cigarette smoking, by chance occurred more in the exercise group than in the biofeedback group. This confounding influence would influence amputation incidence independently, thus vitiating the meaningfulness of the short-term blood flow measures. The increased rate of amputation, therefore, should not be attributed to the effects of biofeedback, at least until the study can be replicated with better controls and with more inferential data. At a minimum, we can conclude that biofeedback was not a sufficiently strong treatment to prevent amputation in these subjects. This example, thus, demonstrates why "the data" should not be accepted facilely as "the given," because the measurements are not sufficiently relevant to the factor of interest.

The real dilemma of what events to charge to which treatment's "account" is rarely so clear. The premise of the preceding example is that one should measure only those events that reasonably seem to be connected to the experiment. Unfortunately, investigators who hold diametrically opposite views rarely agree on what events are important. For example, most physical therapists dismiss the small irregularities in isokinetic torque curves as noise, but some researchers believe that these "bumps" in the torque curve represent legitimate neurophysiological events and responses.23 Thus, at the opposite pole from the researchers in the previous example are those who argue that all events should be accounted for and charged to the treatment during whose tenure that event occurred. This entirely deterministic view is equally impractical because it compels the measurement of and accounting for an infinite number of events.

A reasonable, logical course would seem to be measurement of, and therefore researcher responsibility for, events that most fairly (ie, with least bias) reflect the clinical or experimental

conditions, as might be judged by an impartial arbiter. Sackett and Gent list four criteria to help in deciding which events to measure: "the nature of the question posed; the perspective from which the question is posed; the consideration of why the experimental [or clinical] maneuver might be abandoned or violated; and avoidance of specific bias."22

Validity

Validity is the extent to which an item actually measures what the researcher purports the item measures. Measurement validity is the paramount goal of data collection. Rothstein presents a variety of validity assessment techniques,24 all of which attempt to answer the underlying question: "Does this measurement really capture what I think I am measuring?" Validity assessment, because it queries the extent of a measurement's satisfaction of the assumptions of reliability, scalability, and detectability, is chiefly a logical function. Measures, for example, can be reliably incorrect: A bathroom scale that invariably reads 5 lb over true weight will yield perfectly reliable (if systematically biased) data, but never valid weight data. Just as verbal statements may be true or false, valid or invalid, so behaviors and their measurement may express a true underlying state or they may express invalid, false representations.

Some theorists assert that measurement validity depends fundamentally on whether a test or measurement and its results can be replicated by others. Merely because a finding can be replicated, however, does not mean that the measurements are valid. The same factors that determine whether a scientist or clinician will investigate a problem are frequently the same factors that guide—and limit—the measurements taken. These factors may be related more to sociology and current customs than to ultimate truth and validity. The prevailing ideology may be right or wrong, but its effects are pernicious unless the observer acknowledges the force of such factors. For example, when Aristotelian "science" was the prevailing ideology, an accumulating body of research demonstrated the usefulness and predictability of its measurement theories. Many Aristotelian theories, including teleology, or internal impetus, were rejected by Sir Isaac Newton, and teleology long since has been displaced in modern science's lexicon by terms including gravity and the mechanical laws of physics.

Measurement validity depends on all of the factors described in this article, but ultimately it depends on the use to which the data will be put. If data are intended merely to describe an event, they may be valid ipso facto. Descriptive data must be accepted at face value. Most research, however, attempts to generalize beyond the test situation to real life.

Inferences from data are limited primarily by the representativeness of the measured event relative to real life. This concept of measurement veridicality, or test isomorphism with reality, is assumed in all inferential assessments. For example, therapists rarely wish merely to describe muscle force output by using an isokinetic device; most often, they wish to make an inference about the ability of the patient to generate sufficient muscle power during ADL or ambulation outside the clinic or laboratory. Currently, despite all the isokinetic data that have been reported, little evidence relates these test results to other clinically meaningful factors. To the extent that the isokinetic readings portray the subjects' motivation, tester bias, or any other nonphysiological characteristic of a muscle's performance, the test cannot permit physiological inferences. The resulting measurements have not been



demonstrated to be isomorphic with reality and so should not be treated as inferential data.

Causal explanations require inferential data. For example, to say a certain treatment caused a clinically relevant improvement in muscle power requires a measure that characterizes isomorphically muscle work over time that is meaningfully related to real, clinical situations, Thus, isokinetic testing of the lower limbs of patients with hemiplegia, for example, certainly produces numbers, but what is the meaning of isokinetic values when reported without associated evidence of their relationship to other meaningful measures? Patients with hemiplegia typically lack motor control. Do the data, therefore, reflect lower limb motor control or muscle power during ambulation? Valid data can be inferred to reflect reality and not merely fallacious results that stem from muddled thinking or uncontrolled observation. Causal explanations in clinical research require careful linking of the rules by which numbers are assigned to events or situations that can be replicated in other clinics. When the words "caused by" or "because" are used in association with measurements, notice thereby is given that valid, inferential data are being reported. Anything less is fallacy.

SUMMARY The ultimate purpose of measurement is to produce truth

ful, valid data. Validity assumes reliability, reliability assumes scalability, and scalability assumes detectability. The rules by which numbers are assigned to events determine the meaning and ultimate utility of interpretations based on the measurements. Fallacy grants no more clemency to measurements from a single case in a clinical setting than to dozens of cases in structured systematic research.

Observations cannot be analyzed statistically and interpreted unless they are translated first into arithmetic notations. Statistics, in turn, are simply a convenient method of summarizing and analyzing measurements. Therefore, if the measurements are flawed, analyses and interpretations based on these measurements are fundamentally and irreparably fallacious.

Acknowledgments. I thank Suzann K. Campbell, PhD, PT; Mary Day, MA, PT; Alan Jette, PhD, PT; and Jules Rothstein, PhD, PT, for their critical comments during the development of this article. Special gratitude is due Dean

Currier, PhD, PT, for his seminal contributions during the formulation of the ideas presented.

REFERENCES

1. Stevens SS: Mathematics, measurement and psychophysics. In Stevens SS (ed): Handbook of Experimental Psychology. New York, NY, John Wiley & Sons Inc. 1951, p 22

2. Jones LV: The nature of measurement. In Thorndike RL (ed): Educational Measurement, ed 2. Washington, DC, American Council on Education, 1971, pp 335-355

3. Krebs D: The fallacy of qualitative performance tests. J Rehabil Res Dev [Clin Suppl] 22:106-108, 1985

4. Ottenbacher KJ, Biocca Z, DeCremer G, et al: Quantitative analysis of the effectiveness of pediatric therapy: Emphasis on the neurodevelopmental treatment approach. Phys Ther 66:1095-1101, 1986

5. Ottenbacher KJ, Petersen P: Meta-analysis of applied vestibular stimulation research. Physical and Occupational Therapy in Pediatrics 5(2/3): 119-134, 1985

6. Yang JF, Winter DA: Electromyographic amplitude normalization methods: Improving their sensitivity as diagnostic tools in gait analysis. Arch Phys Med Rehabil 65:517-521, 1984

7. Stevens SS: On the theory of scales of measurement. Science 103:667-680,1946

8. Gaito J: Measurement scales and statistics: Resurgence of an old misconception. Psychol Bull 87:564-567, 1980

9. Siegel S: Nonparametric Statistics for the Behavioral Sciences. New York, NY, McGraw-Hill Book Co, 1956

10. Krebs DE, Malgady RG: Understanding correlation coefficients and regression. Phys Ther 66:110-120, 1986

11. Stanley JC: Reliability. In Thorndike RL (ed): Educational Measurement, ed 2. Washington, DC, American Council on Education, 1971, p 356

12. Robinson AL: Values of fundamental constants adjusted. Science 235:633-634, 1987

13. Carmines EG, Zeller RA: Reliability and Validity Assessment. Beverly Hills, CA, Sage Publications Inc, 1979

14. Heise DR: Separating reliability and stability in test-retest correlations. Am Sociol Rev 34:93-101, 1969

15. Bridgman PW: Dimensional Analysis. New Haven, CT, Yale University Press, 1931

16. Coombs CH: Psychological scaling without a unit of measurement. Psychol Rev 57:145-158, 1950

17. Michels E: Measurement in physical therapy: On the rules for assigning numerals to observations. Phys Ther 63:209-215, 1983

18. Kyburg HE Jr: Theory and Measurement. Cambridge, England, Cambridge University Press, 1984, pp 113-142

19. Standards for Educational and Psychological Testing. Washington, DC, American Psychological Association, 1985, p 31

20. Ellis B: Measurement. In Edwards P (ed): Encyclopedia of Philosophy. New York, NY, Macmillan Publishing Co, 1967, vol 5, pp 241-250

21. Michels E: Evaluation and research in physical therapy. Phys Ther 62:828-834, 1982

22. Sackett DL, Gent M: Controversy in counting and attributing events in clinical trials. N Engl J Med 301:1410-1412, 1979

23. Hart DL, Miller LC, Stauber WT: Effects of cooling on force oscillations during maximal voluntary eccentric exercises. Exp Neurol 90:73-80, 1985

24. Rothstein JM: Measurement and clinical practice: Theory and application. In Rothstein JM (ed): Measurement in Physical Therapy: Clinics in Physical Therapy. New York, NY, Churchill Livingstone Inc, 1985, vol 7, pp 1-46

APPENDIX Schematic Illustration of Paradigmatic Approach to Quantitative Measurement of Activities of Daily Living in Subjects with a Prosthetic Hand

General description:

Formal description:

Thought process:

Behavior:

abstract thinking

theory→

What ideas and physiological evidence relate to prosthetic hand performance?

Think about and examine prior literature on amputees' real-life ADL requirements.

concept →

What variables should I measure?

Choose a common ADL event, define "prosthetic prehension" and "tie a shoelace."

→

construct → How should I measure them?

Only if the prosthesis is used for prehension, measure duration of tying a shoelace bow starting with first touch of shoelace and ending with a bow that keeps the shoe on the foot.

concrete measurement

operationalized variable

What units of measurement should be used?

Measure time in seconds.

Volume 67 / Number 12, December 1987 1839



1987; 67:1834-1839.PHYS THER. David E KrebsMeasurement Theory

Cited by

http://ptjournal.apta.org/content/67/12/1834#otherarticles

This article has been cited by 1 HighWire-hosted articles:

Information Subscription http://ptjournal.apta.org/subscriptions/

Permissions and Reprints http://ptjournal.apta.org/site/misc/terms.xhtml

Information for Authors http://ptjournal.apta.org/site/misc/ifora.xhtml


http://ptjournal.apta.org/content/67/12/1834#otherarticles


measurement theory - physical therapy

Documents

rules of measurement

measurement theory david

basis of measurement

measurements research

useful measurements

valid rules

measurements reliability

basis of valid measurements