chapter 6: selecting measurement instruments

Educational Research: Competencies for Analysis and Application, 9th edition.Gay, Mills, & Airasian

© 2009 Pearson Education, Inc.All rights reserved.1

Chapter 6:Selecting Measurement Instruments Objectives

State the relation between a variable and a construct, and distinguish among categories of variables (e.g., categorical and quantitative; dependent and independent) and the scales to measure them (e.g., nominal, ordinal, interval, and ratio).

Define measurement, and describe ways to interpret measurement data.



Selecting Measurement Instruments

Objectives Describe the types of measuring

instruments used to collect data in qualitative and quantitative studies (e.g., cognitive, affective, and projective tests).

Define validity, and differentiate among content, criterion-related, construct, and consequential validity.



Selecting Measurement Instruments

Objectives Explain how to measure reliability,

and differentiate among stability, equivalence, equivalence and stability, internal consistency, and scorer/rater reliability.

Identify useful sources of information about specific tests, and provide strategies for test selection.

Provide guidelines for test construction and test administration.



Data & Constructs

Data are the pieces of information you collect and use to examine your topic.

You must determine what type of data to collect.

A construct is an abstraction that cannot be observed directly but is invented to explain behavior. e.g., intelligence, motivation, ability



Constructs & Variables Constructs must be operationally-

defined to be observable and measurable.

Variables are operationally-defined constructs.

Variables are placeholders that can assume any one of a range of values.

Variables may be measured by instruments.



Measurement Scales

The measurement scale is a system for organizing data.

Knowing your measurement scale is necessary to determine the type of analysis you will conduct.



Measurement Scales

Nominal variables describe categorical data. e.g., gender, political party affiliation,

school attended, marital status Nominal variables are qualitative. Quantitative variables range on a

continuum with ordinal, interval, and ratio variables.



Measurement Scales

Ordinal variables describe rank order with unequal units. e.g., order of finish, ranking of

schools or groups as levels Interval variables describe

equal intervals between values. e.g., achievement, attitude, test

scores



Measurement Scales

Ratio variables describe all of the characteristics of the other levels but also include a true zero point. e.g., total number of correct items on

a test, time, distance, weight



Independent & Dependent Variables Dependent variables are those

believed to depend on or to be caused by another variable. Dependent variables are also called

criterion variables. Independent variables are the

hypothesized cause of the dependent variable. There must be at least two levels of an independent variable. Independent variables are also called an

experimental variables, manipulated variables, or treatment variables.



Characteristics of Instruments There are three major ways for

researchers to collect data. A researcher can administer a

standardized test. e.g., an achievement test

A researcher can administer a self-developed instrument.

e.g., a survey you might develop A researcher can record naturally-

occurring events or use already available data.

e.g., recording off-task behavior of a student in a classroom



Instruments Using standardized instruments

takes less time than developing an instrument.

With standardized instruments, results from different studies that use the same instrument can be compared.

At times researchers may need to develop their own instruments. To effectively design an instrument one

needs expertise and time.



Instruments Tests are a formal systematic

procedure for gathering information about people. Cognitive characteristics (e.g.,

thinking, ability) Affective characteristics (e.g.,

feelings, attitude)



Instruments A standardized test is

administered, scored, and interpreted the same way across administrations. e.g., ACT or SAT or Stanford

Achievement test



Instruments

Assessment refers to the process of collecting, synthesizing, and interpreting information, including data from tests as well as from observations. Formal or informal Numerical or textual

Measurement is the process of quantifying or scoring assessment information. Occurs after data collection



Instruments Qualitative researchers often use

interviews and observations. Quantitative researchers often use

paper and pencil (or electronic) methods. Selection methods: The respondent

selects from possible answers (e.g., multiple choice test).

Supply methods: The respondent has to provide an answer (e.g., essay items).



Instruments Performance assessments

emphasize student process and require creation of a product (e.g., completing a project).



Interpreting Instrument Data Raw Score

Number or point value of items correct (e.g., 18/20 items correct).

Norm-referenced scoring Student’s performance is compared

with performance of others (e.g., grading on a curve).



Interpreting Instrument Data Criterion-referenced scoring

Student’s performance is compared to preset standard (e.g., class tests).

Self-referenced scoring How individual student’s scores change

over time is measured (e.g., speeded math facts tests).



Types of Instruments

Cognitive tests measure intellectual processes (e.g., thinking, memorizing, calculating, analyzing). Standardized tests measure individual’s

current proficiency in given areas of knowledge or skill.

Standardized tests are often given as a test battery (e.g., Iowa test of basic skills, CTBS).



Types of Instruments

Diagnostic tests provide scores to facilitate identification of strengths and weaknesses (e.g., tests given for diagnosing reading disabilities).

Aptitude tests measure prediction or potential versus what has been learned (e.g., Wechsler Scales).



Affective Instruments

Affective tests measure affective characteristics (e.g., attitude, emotion, interest, personality).

Attitude scales measure what a person believes or feels.

Likert scales measure agreement on a scale.

Strongly agree, Agree, Undecided, Disagree, Strongly disagree



Affective Instruments Semantic differential scales require the

individual to indicate attitude by position on a scale.

Fair Unfair 3 2 1 0 -1 -2 -3

Rating scales may require a participant to check the most appropriate description.

5=always; 4=almost always, 3=sometimes… The Thurstone Scale & Guttman Scales

are also used to measure attitudes.



Additional Inventories Interest inventories assess personal

likes and dislikes (e.g., occupational interest inventories).

Values tests assess the relative strength of a person’s values (e.g., Study of Values instrument).



Additional Inventories Personality inventories provide

participants with statements that describe behaviors characteristic of given personality traits and the participant answers each statement (e.g., MMPI).

Projective tests were developed to eliminate some of the concerns with self-report measures. These tests are ambiguous so that presumably the respondent will project true feelings (e.g., Rorschach).



Criteria for Good Instruments Validity refers to the degree that the

test measures what it is supposed to measure.

Validity is the most important test characteristic.



Criteria for Good Instruments There are numerous established

validity standards. Content validity Criterion-related validity

Concurrent validity Predictive validity

Construct validity Consequential validity



Content Validity Content validity addresses whether

the test measures the intended content area.

Content validity is an initial screening type of validity.

Content validity is sometimes referred to as Face Validity.

Content validity is measured by expert judgment (content validation).



Content Validity Content validity is concerned with both:

Item validity: Are the test items measuring the intended content?

Sampling validity: Do the items measure the content area being tested?

One example of a lack of content validity is a math test with heavy reading requirements. It may not only measure math but also reading ability and is therefore not a valid math test.



Criterion-Related Validity Criterion-related validity is

determined by relating performance on a test to performance on an alternative test or other measure.

Correlation coefficients are used to determine relative validity.



Criterion-Related Validity Two types of criterion-related validity

include: Concurrent: The scores on a test are

correlated to scores on an alternative test given at the same time (e.g., two measures of reading achievement).

Predictive: The degree to which a test can predict how well a person will do in a future situation, e.g., GRE, (with predictor represented by GRE score and criterion represented as success in graduate school).



Construct Validity Construct validity is the most

important form of validity. Construct validity assesses what

the test is actually measuring. It is very challenging to establish

construct validity.



Construct Validity Construct validity requires

confirmatory and disconfirmatory evidence. Scores on tests should relate to scores on

similar tests and NOT relate to scores on other tests.

For example, scores on a math test should be more highly correlated with scores on another math test than they are to scores from a reading test.



Consequential Validity

Consequential validity refers to the extent to which an instrument creates harmful effects for the user. Some tests may harm the test taker. For example, a measure of anxiety may

make a person more anxious.



Validity

Some factors that threaten validity include: Unclear directions Confusing or unclear items Vocabulary or required reading ability

too difficult for test takers Subjective scoring Cheating Errors in administration



Self-Report Instruments

There are some concerns with data derived from self-report instruments. One concern is response set, or the

tendency for a participant to respond in a certain way (e.g., social desirability).

Bias may also play a role in self-report instruments (e.g., cultural norms).



Reliability

Reliability refers to the consistency of an instrument to measure a construct.

Reliability is expressed as a reliability coefficient based upon a correlation.

Reliability coefficients should be reported for all measures.

Reliability affects validity. There are several forms of reliability.



Reliability

Test-Retest (Stability) reliability measures the stability of scores over time. To assess test-retest reliability, a test is

given to the same group twice and a correlation is taken between the two scores.

The correlation is referred to Coefficient of Stability.



Reliability

Alternate forms (Equivalence) reliability measures the relationship between two versions of a test that are intended to be equivalent. To assess alternate forms reliability,

both tests are given to the same group and the scores on each test are correlated.

The correlation is referred to as the Coefficient of Equivalence.



Reliability

Equivalence and stability reliability is represented by the relationship between equivalent versions of a test given at two different times. To assess equivalence and stability

reliability, first one test is given, after a time a similar test is given, and the scores are correlated.

The correlation is referred to as the Coefficient of Stability and Equivalence.



Reliability

Internal Consistency reliability represents the extent to which items in a test are similar to one another. Split-half: The test is divided into halves

and a correlation is taken between the scores on each half.

Coefficient alpha and Kuder-Richardson measure the relationship between and among all items and total scale of a test.



Reliability

Scorer and rater reliabilities reflect the extent to which independent scorers or a single scorer over time agree on a score. Interjudge (inter-rater) reliability:

Consistency of two or more independent scorers.

Intrajudge (intra-rater) reliability: Consistency of one person over time.



Reliability

Standard Error of Measurement is an estimate of how often one can expect errors of a given size in an individual’s test score.SEm=SD * SQT 1-r

Sem=Standard error of measurement

SD=Standard deviation of the test scoresr=the reliability coefficient



Selecting a Test

Once your have defined the purpose for your study:

1. Determine the type of test that you need.

2. Identify and locate appropriate tests.

3. Determine which test to use after a comparative analysis.



Selecting a TestThere are several locations where one

can obtain information and reviews about available tests. These are a good place to start when selecting a test.

MMY: The Mental Measurements Yearbook is the most comprehensive source of test information

Pro-Ed Publications ETS Test Collection Database Professional Journals Test publishers and distributors



Selecting a Test

When comparing tests you have located and you are deciding which to use attend to each of the following:

First, examine validity. Next, consider reliability. Consider ease of test use. Assure participants have not been

previously exposed to the test. Assure sensitive information is not

unnecessarily included.

chapter 6: selecting measurement instruments

Documents