reliability bachman 1990 chapter 6

35
RELIABILITY BASED ON: FUNDAMENTAL CONSIDERATIONS IN LANGUAGE TESTING BACHMAN (1990) CHAPTER 6 Prepared by: Amirhamid Foroughameri [email protected] November 2015

Upload: ahfameri

Post on 19-Jan-2017

902 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Reliability bachman 1990 chapter 6

RELIABILITYBASED ON:

FUNDAMENTAL CONSIDERATIONS IN LANGUAGE TESTING BACHMAN (1990)

CHAPTER 6

Prepared by: Amirhamid [email protected] 2015

Page 2: Reliability bachman 1990 chapter 6

INTRODUCTION A fundamental concern in the development and use of language tests is to identify

potential sources of error in a given measure of communicative language ability and to minimize the effect of these factors on that measure.

We must be concerned about errors of measurement, or unreliability, as we know that test performance is affected by factors other than the abilities we want to measure.

Poor health, fatigue, lack of interest test method facets or motivation, and test-wiseness … Minimizing the effects of these factors Minimizing measurement error maximizing reliability.

Unsystematic (unpredictable) Systematic

Page 3: Reliability bachman 1990 chapter 6

INTRODUCTION A necessary condition for validity: in order for a test score to be valid, it must be

reliable.

Reliability and validity are not two distinct concepts; they are complementary aspects of a common concern in measurement

Reliability answers the question: ‘How much of an individual’s test performance is due to measurement error, or to factors 0ther than the language ability we want to measure?’ and with minimizing the effects of these factors on test scores.

Validity answers the question: ‘How much of an individual’s test performance is due to the language abilities we want to measure?’ and with maximizing the effects of these abilities on test scores.

Page 4: Reliability bachman 1990 chapter 6

INTRODUCTION The investigation of reliability involves both logical analysis and empirical research. We must identify sources of error and estimate the magnitude of their effects on test

scores. To identify sources of error, we need to distinguish the effects of the language abilities

we want to measure from the effects of other factors, which is a complex problem due to:

1. The interaction between components of language ability and test method facets makes it difficult to mark a clear ‘boundary’ between the ability being measured and the method facets → a particular topic of a conversational interaction an oral interview. 2. Other characteristics, such as sex, age, cognitive style, and native language.

Estimating the magnitude of the effects of different sources of error, once these sources have been identified, is a matter of empirical research, and is a major concern of measurement theory.

Page 5: Reliability bachman 1990 chapter 6

FACTORS THAT AFFECT LANGUAGE TEST SCORES

Thorndike (1951) and Stanley (1971) begin their treatments of reliability withgeneral frameworks for describing the factors that cause test scores to vary from individual to individual: general and specific lasting characteristics, general and specific temporary characteristics,and systematic and chance factors related to test

administration and scoring.

Page 6: Reliability bachman 1990 chapter 6

FACTORS THAT AFFECT LANGUAGE TEST SCORES Factors that affect language test scores (Bachman 1990):

Note: In a ‘path diagram’ rectangles: observed variables,

ovals: unobserved variables, straight arrows: causal relationships

Systematic Unsystematic

Test Score

Communicative

language ability

Personalattribute

s

Test methodfacets

Randomfactors

Page 7: Reliability bachman 1990 chapter 6

FACTORS THAT AFFECT LANGUAGE TEST SCORES Test method facets are systematic to the extent that they are uniform fromone test administration to the next. That is, if the input format facet is multiple-choice, this will not vary, whether the test is given in the morning or afternoon.

Attributes of individuals include individual characteristics such as cognitive style and knowledge of particular content areas, and group characteristics such as

sex, race, and ethnic background. These are also systematic in the sense that they are likely to affect a given individual’s test performance regularly.

An individual’s test score will be affected to some degree by unsystematic, or random factors: unpredictable and largely temporary conditions, such as his mental alertness or emotional state, and uncontrolled differences in test method facets, such as changes in the test environment from one day to the next, or idioosyncratic differences in the way different test administrators carry out their responsibilities.

Page 8: Reliability bachman 1990 chapter 6

FACTORS THAT AFFECT LANGUAGE TEST SCORES Random factors and test method facets are generally considered to

be sources of measurement error, and have thus been the primary concern of approaches to estimating reliability.

Personal attributes that are not considered part of the ability tested, such as sex, ethnic background, cognitive style and prior knowledge of content area, on the other hand, have traditionally been discussed as sources of test bias, or test invalidity.

Page 9: Reliability bachman 1990 chapter 6

CLASSICAL TRUE SCORE (CTS) MEASUREMENT THEORY

This theory consists of a set of assumptions about the relationships between actual, or observed test scores and the factors that affect these scores.

Assumption 1: an observed score on a test comprises two factors or components: a true score that is due to an individual’s level of ability and an error score, that is due to factors other than the ability being tested.’ → x = Xt + Xe where x is the observed score, Xt is the true score, and Xe the error score. the variance of a set of test scores consists of two components: S2

x = S2t + S2

e

where S2x is the observed score variance, S2

t is the true score variance component, and S2

e is the error score variance component.

Page 10: Reliability bachman 1990 chapter 6

Assumption 2: the relationship between true and error scores: error scores are unsystematic, or random, and are uncorrelated with true scores.

CTS model’s definition of measurement error: that variation in a set of test scores that is unsystematic or random

In CTS→ Two sources of variance:True score

variance due to differences in

ability

Measurement error

(unsystematic)

Page 11: Reliability bachman 1990 chapter 6

PARALLEL TESTS In order for two tests to be considered parallel, we assume that they are measures of the same ability, that is, that an individual’s true score on one test will be the same as his true score on the other.

Two tests are parallel if, for every group of persons taking both tests, (1) the true score on one test is equal to the true score on the other, and (2) the error variances for the two tests are equal.

Operational definition: parallel tests are two tests of the same ability that have the same means and variances and are equally correlated with other tests of that ability.

Virtually we never have strictly parallel tests, we treat two tests as if they were parallel if the differences between their means and variances are not statistically significant. Equivalent tests; alternate forms.

Page 12: Reliability bachman 1990 chapter 6

RELIABILITY AS THE CORRELATION BETWEEN PARALLEL TESTS

Since we never know what the true or error scores are, we cannot know the reliability of the observed scores. To be able to estimate the reliability of observed scores, then, we must define reliability operationally in a way that depends only on

observed scores. Thus, if the observed scores on two parallel tests are highly correlated, this

indicates that effects of the error scores are minimal, and that they can be considered reliable indicators of the ability being measured.

The definition of reliability (the basis for all estimates of reliability within CTS theory): the correlation between the observed scores on two parallel tests, which we can symbolize as rxx’.

Assumption: the observed scores on the two tests are experimentally independent. That is, an individual's performance on the second test should not depend on how she performs on the first.

Page 13: Reliability bachman 1990 chapter 6

CORRELATIONS BETWEEN TRUE AND OBSERVED SCORES ON

PARALLEL TESTS

=

rxx’

Ability

x’t

x’e

xt

xe

x X’

Page 14: Reliability bachman 1990 chapter 6

RELIABILITY AND MEASUREMENT ERROR AS PROPORTIONS OF

OBSERVED SCORE VARIANCEIf an individual's observed score on a test is composed of a

true score and an error score, the greater the proportion of true score, the less the proportion of error score, and thus the more reliable the observed score.

Thus, one way of defining reliability is as the proportion of the observed score variance that is true score variance:

rxx’ = s2t/ s2

x

Note: reliability refers to the test scores, and not the test itself.

Page 15: Reliability bachman 1990 chapter 6

APPROACHES TO ESTIMATING RELIABILITY WITHIN THE CTS

Internal consistency: concerned primarily with sources of error from within the test and scoring procedures,

Stability: how consistent test scores are over time,

Equivalence: an indication of the extent to which scores onalternate forms of a test are equivalent.

• The estimates of reliability that these approaches yield are called reliability coefficients.

Page 16: Reliability bachman 1990 chapter 6

INTERNAL CONSISTENCY

Internal consistency is concerned with how consistent test takers’ performances on the different parts of the test are with each other.

Inconsistencies in performance on different parts of tests can be caused by a number of factors, including the test method facets.

SPLIT-HALF RELIABILITY ESTIMATES One approach to examining the internal consistency of a test is the split-half

method, in which we divide the test into two halves and then determine the extent to which scores on these two halves are consistent with each other.

In so doing, we are treating the halves as parallel tests, and so we must make certain assumptions about the equivalence of the two halves, specifically that they have equal means and variances. In addition, we must also assume that the two halves are independent of each other.

Page 17: Reliability bachman 1990 chapter 6

INTERNAL CONSISTENCY In some cases, where we are not sure that the items are measuring the same ability

or that they are independent of each other, the test-retest and parallel forms methods, are more appropriate for estimating reliability.

The Spearman-Brown split-half estimate Once the test has been split into halves, it is rescored, yielding two scores - one

for each half - for each test taker.

In one approach to estimating reliability, we then compute the correlation between the two sets of scores. This gives us an estimate of how consistent the halves are,

however, and we are interested in the reliability of the whole test. In general, a long test will be more reliable than a short one, assuming that the

additional items correlate positively with the other items in the test.

Page 18: Reliability bachman 1990 chapter 6

INTERNAL CONSISTENCY

Two assumptions must be met in order to use this method: First, since we are in effect treating the two halves as parallel tests, we must assume that they have equal means and variances (an assumption wecan check). Second, we must assume that the two halves are experimentally independent of each other (an assumption that is very difficult to check). That is, that an individual's performance on one half does not affect how he performs on the other.

Page 19: Reliability bachman 1990 chapter 6

INTERNAL CONSISTENCY The Guttman sp!it-bdf estimate Another approach to estimating reliability from split-halves is that developed by

Guttman (1945), which does not assume equivalence of the halves, and which does not require computing a correlation between them. This split-half reliability coefficient is based on the ratio of the sum of the variances of the two halves to the variance of the whole test:

Since this formula is based on the variance of the total test, it provides a directestimate of the reliability of the whole test. Therefore, unlike Spearman-Brown, the Guttman split-half estimate does not require an additional correction for length.

Page 20: Reliability bachman 1990 chapter 6

INTERNAL CONSISTENCY Reliability estimates based on item variances 1. Kuder-Richardson reliability coefficients There is a way of estimating the average of all the possible split-half coefficients

on the basis of the statistical characteristics of the test items.

This approach developed by Kuder and Richardson (1937), involves computing the means and variances of the items that constitute the test.

The mean of a dichotomous item (one that is scored as either right or wrong) is the proportion, symbolized as p, of individuals who answer the item correctly. The

proportion of individuals who answer the item incorrectly is equal to 1 - p, and is symbolized as q.

The variance of a dichotomous item is the product of these two proportions, or pq.

Page 21: Reliability bachman 1990 chapter 6

INTERNAL CONSISTENCY The reliability coefficient provided by Kuder-Richardson formula 20 (KR-20),based on the ratio of the sum of the item variances to the total test score variance, is as follows:

Assumption: the items are of nearly equal difficulty and independent of each other. If the items are of equal difficulty, the reliability coefficient can be computed by

using Kuder-Richardson formula 21 (KR-21:

This formula will generally yield a reliability coefficient that is lower than that given by KR-20. The Kuder-Richardson formulae are based on total score variance, and thus they do not require any correction for length.

Page 22: Reliability bachman 1990 chapter 6

INTERNAL CONSISTENCY 2. Coefficient alpha Both the Guttman split-half estimate and the Kuder-Richardson formulae estimate

reliability on the basis of ratios of the variances of test components - halves and items - to total test score variance.

Cronbach (1951) developed a general formula for estimating internal consistency which he called ‘coefficient alpha’, and which is often referred to as ‘Cronbach’s alpha’:

It can thus be shown that all estimates of reliability based on the analysis of variance components can be derived from this formula and are special cases of coefficient alpha.

Page 23: Reliability bachman 1990 chapter 6

INTERNAL CONSISTENCY In summary, the reliability of a set of test scores can be estimated on the basis of a

single test administration only if certain assumptions about the characteristics of the parts of the test are satisfied.

Assumptions for internal consistency reliability estimates, and effects of violating assumptions

Page 24: Reliability bachman 1990 chapter 6

RATER CONSISTENCY In test scores that are obtained subjectively, such as ratings of compositions or

oral interviews, a source of error is inconsistency in these ratings.

In the case of a single rater, we need to be concerned about the consistency within that individual’s ratings, or with intrarater reliability.

When there are several different raters, we want to examine the consistency across raters, or inter-rater reliability.

In both cases, the primary causes of inconsistency will be either the application of different rating criteria to different samples or the inconsistent application of the rating criteria to different samples.

Page 25: Reliability bachman 1990 chapter 6

RATER CONSISTENCY Intra-rater reliabilityFactors introducing inconsistency: Sequence of paying attention to different kinds of errors Sequence of scoring from the 1st to the last person

We need to obtain at least two independent ratings from a rater for each individual language sample.

This is typically accomplished by rating the individual samples once and then re-rating them at a later time in a different, random order.

Once the two sets of ratings have been obtained, the reliability between them can be estimated in two ways.

Page 26: Reliability bachman 1990 chapter 6

RATER CONSISTENCY One way is to treat the two sets of ratings as scores from parallel tests and

compute the appropriate correlation coefficient (commonly the Spearman rank-order coefficient) between the two sets of ratings, interpreting this as an estimate of reliability.

Another approach to examining the consistency of multiple ratings is to compute a coefficient alpha, treating the independent ratings as different parts:

Page 27: Reliability bachman 1990 chapter 6

RATER CONSISTENCY Inter-rater reliabilityFactors introducing inconsistency: Different criteria for rating Different interpretations of the same rating criteria

The reliability can be estimated in two ways: We can compute the correlation between two different raters and interpret this as

an estimate of reliability. When more than two raters are involved, however, rather than computing

correlations for all different pairs, a preferable approach is that recommended by Ebel (1979), in which we sum the ratings of the different raters and then estimate the reliability of these summed ratings by computing a coefficient alpha.

Page 28: Reliability bachman 1990 chapter 6

STABILITY (TEST-RETEST RELIABILITY)

This approach can be used in three cases: For tests such as cloze and dictation we cannot appropriately estimate the internal

consistency of the scores because of the interdependence of the parts of the test.

There are also testing situations in which it may be necessary to administer a test more than once as part of a time-series design.

This might also be the concern of a language program evaluator who is interested in relating changes in language ability to teaching and learning activities in the program.

This approach to reliability provides an estimate of the stability of the test scores over time.

Page 29: Reliability bachman 1990 chapter 6

STABILITY (TEST-RETEST RELIABILITY)

In this approach, we administer the test twice to a group of individuals and then compute the correlation between the two sets of scores. This correlation can then be interpreted as an indication of how stable the scores are over time.

Two sources of inconsistency - differential practice effects and differential changes in ability -pose a dilemma for the test-retest approach.

That is, we must assume that both practice and learning (or unlearning) effects are either uniform across individuals or random.

Practice effects may occur if certain individuals remember some of the items or feel more comfortable with the test method, and consequently perform better on the second administration of the test.

If, on the other hand, there is a considerable time lapse between test administrations, some individuals’ language ability may actually improve or decline more than that of others, causing them to perform differently the second time.

Page 30: Reliability bachman 1990 chapter 6

STABILITY (TEST-RETEST RELIABILITY)

Possible Solution:There is no single length of time between test

administrations that is best for all situations. In each situation, the test developer or user must

attempt to determine the extent to which practice and learning are likely to influence test performance, and choose the length of time between test and retest so as to optimize reduction in the effects of both.

Page 31: Reliability bachman 1990 chapter 6

EQUIVALENCE (PARALLEL FORMS RELIABILITY) Like the test-retest approach, this is an appropriate means of

estimating the reliability of tests for which internal consistency estimates are either inappropriate or not possible.

It is of particular interest in testing situations where alternate forms of the test may be actually used, either for security reasons, or to minimize the practice effect.

Assumption: that the different forms of the test are equivalent, particularly that they are at the same difficulty level and have similar standard deviations.

Page 32: Reliability bachman 1990 chapter 6

EQUIVALENCE (PARALLEL FORMS RELIABILITY) To estimate the reliability of alternate forms of a given test, the procedure used is

to administer both forms to a group of individuals. One way to minimize the possibility of an ordering effect is to use a ‘counterbalanced’ design, in which half of the individuals take one form first and

the other half take the other form first.

The means and standard deviations for each of the two forms can then be computed and compared to determine their equivalence, after which the correlation between the two sets of scores (Form A and Form B) can be computed. This correlation is then interpreted as an indicator of the equivalence of the two tests, or as an estimate of the reliability of either one.

Page 33: Reliability bachman 1990 chapter 6

Note: o As a matter of procedure, we generally attempt to estimate the internal

consistency of a test first, since if a test is not reliable in this respect, it is not likely to be equivalent to other forms or stable over time.

o This is because measurement error is random and therefore not correlated with anything.

o Thus, the greater the proportion of measurement error in the scores from a given test, the lower the correlation of those scores with other scores will be.

Page 34: Reliability bachman 1990 chapter 6

PROBLEMS WITH THE CLASSICAL TRUE SCORE MODEL

1. Different sources of error may interact with each other, even when we carefully design our reliability study. One problem with the CTS model, then, is that it treats error variance as homogeneous in origin. Each of the estimates of reliability addresses one specific source of error, and treats other potential sources either as part of that source, or as true score.

In the classical model, therefore, different sources of error may be confused, or confounded with each other and with true score variance, since it is not possible to examine more than one source of error at a time, even though performance on

any given test may be affected by several different sources of error simultaneously.

Page 35: Reliability bachman 1990 chapter 6

PROBLEMS WITH THE CLASSICAL TRUE SCORE MODEL2. A second, related problem is that the CTS model considers all error to be random, and consequently fails to distinguish systematic error from random error. Factors other than the ability being measured that regularly affect the performance

of some individuals and not others can be regarded as sources of systematic error or test bias.

Sources of systematic error: Test method Cultural content Psychological task set Guessing or what Carroll (1961b) called ‘topastic error’.