principles of language assessment brown(2004): chapter 2

Principles of Language Assessment

Brown(2004): Chapter 2

How do you know if a test is effective? Five cardinal criteria for “testing a test”

(p.19)1. Practicality2. Reliability3. Validity4. Authenticity5. washback

Practicality

An effective test is Practical. This means that it:

1. Is not excessively expensive2. Stays within appropriate time constraints3. Relatively easy to administer4. Has a scoring/evaluation procedure that

is specific and time-efficient (p.19)

Reliability

a reliable test is consistent and dependable. If you give the same test to the same student or matched students on two different occasions, the test should yield similar results. (Brown, 2004, p.20

…a number of factors that may contibute to the unreliability of a test…fluctuation in the student, in scoring, in test administration, and in the test itself. (p.20~21)

Student-related reliability

The most common learner-related issue in reliability is caused by temporary illness, fatigue, a “bad”day, anxiety, and other physical or psychological factors…test-takers “test-wiseness” or strategies for efficient test taking (Mousavi, 2002, p. 804)…(Brown, 2004, p.21)

Rater reliability

Inter-rater reliability: occurs when two or more scorers yield inconsistent scores of the same test, possibly for lack of attention to scoring criteria, inexperience, inattention, or even preconceived biases.

Intra-rater reliability: is a common occurrence for classroom teachers because of unclear scoring criteria, fatigue, bias toward particular “good” and “bad” students, or simply carelessness. (p.21)

Test Administration reliability

Unreliability may also result from the conditions in which the test is administered.

E.g. tape recorder, photocopying variations, the amount of light, temperature, the conditions of the chairs and desks…(p.21)

Test reliability

The nature of the test itself can cause measurement errors.

E.g. test is too long; timed test; poorly written test items…

Validity

the degree to which the test actually measures what it is intended to measure. A valid test of reading ability is one that actually measures reading ability… (Brown, 1994, p.254)

Different types of Validity: Content validity; Criterion-related Validity; Construct Validity; Consequential Validity; Face Validity

Content validity

If a test actually samples the subject matter about which conclusions are to be drawn, and if it requires the test-taker to perform the behavior that is being measured, it can claim content-related evidence of validity, often popularly referred to as Content Validity. (p.22)

You can usually identify content-related evidence observationally if you can clearly define the achievement that you are measuring.

And if a course has perhaps ten objectives but only two are covered in a test, then content validity suffers.

Cont.

It is possible to contend, for example, that standard language proficiency tests, with their context reduced, academically oriented language and limited stretches of discourse, lack content validity since they do not require the full spectrum of communicative performance on the part of the learner…nevertheless, what such proficiency tests lack in content-related evidence they may gain in other forms of evidence, not to mention practicality and reliability.

Direct and Indirect Content Validity

Direct Testing: involves the test-taker in actually performing the target task.

Indirect Testing: learners are not performing the task itself but rather a task that is related in some way.

Criterion-related Validity

The extent to which the “criterion” of the test has actually been reached……in such tests, specified classroom objectives are measured, and implied predetermined leves of performance are expected to be reached. (p.24)

Concurrent vs. Predictive Criterion Validity Concurrent: A test has

concurrent validity if its results are supported by other concurrent performance beyond the assessment itself. For example, the validity of a high score on the final exam of a foreign language course will be substantiated by actual proficiency in the language.

Predictive Validity: of an assessment becomes important in the case of placement tests, admissions assessment batteries, language aptitude tests, etc. The assessment criterion in such case is not to measure concurrent ability but to assess (and predict) a test-taker’s likelihood of future success.

Construct Validity

A construct is any theory, hypothesis, or model that attempts to explain observed phenomena in our universe of perceptions…”Proficiency” and “communicative competence” are linguistic constructs; “Self-esteem” and “motivation” are psychological constructs…

Does this test actually tap into the theoretical construct as it has been defined?

Consequential Validity

Consequential validity encompasses all the consequences of a test, including such considerations as its accuracy in measuring intended criteria, its impact on the preparation of test-takers, its effect on the learner, and the (intended and unintended) social consequences of a test’s interpretation and use.

McNamara (2000, p.54) cautions against test results that may reflect socioeconomic conditions such as opportunities for coaching that are “differentially available to the students being assessed.

Face Validity

The extent to which “student view the assessment as fair, relevant, and useful for improving learning.

Face validity refers to the degree to which a test looks right, and appears to measure the knowledge or abilities it claims to measure, based in the subjective judgment of the examinees who take it, the administrative personnel who decide on its use, etc…

Cont.

Face Validity:1. A well-constructed, expected format with familiar tasks,2. A test that is clearly doable within the allotted time

limit,3. Items that are clear and uncomplicated,4. Directions that are crystal clear,5. Tasks that relate to their course work6. A difficulty level that presents a reasonable challenge

Authenticity

The degree of correspondence of the characteristics of a given language test task to the features o f a target language task” and then suggest an agenda for identifying those target language tasks and for transforming them into valid test items.

Cont.

Authenticity:1. The language in the test is as natural as possible2. Items are contextualized rather than isolated3. Topics are meaningful4. Some thematic organization to items is provided, such

as through a story line or episode5. Tasks represent, or closely approximate, real-world

tasks

Washback

The effect of testing on teaching and learning (Hughes, 2003, p.1) In large-scale assessment, washback generally refers to the effect the tests have

on instruction in terms of how students prepare for the test. E.g. “cram” courses and “teaching to the test.”

…in the classroom assessment is the information that “wash back” to students in the form of useful diagnoses of strengths and weaknesses.

Informal performance assessment is by nature more likely to have built-in washback effects because the teacher is usually providing interactive feedback.

Formal teats can also have positive washback, but they provide no washback if the students receive a simple letter grade or a single overall numerical score.

Cont.

Washback enhances a number of basic principles of language acquisition: intrinsic motivation, autonomy, self-confidence, language ego, interlanguage, and strategic investment…

Note: An interlanguage is an emerging linguistic system that has been developed by a learner of a second language (or L2) who has not become fully proficient yet but is only approximating the target language: preserving some features of their first language (or L1) in speaking or writing the target language and creating innovations. An interlanguage is idiosyncratically based on the learners' experiences with the L2. It can fossify in any of its developmental stages. The interlanguage consists of: L1 transfer, transfer of training, strategies of L2 learning (e.g.simplification), strategies of L2 communication (e.g. do not think about grammarwhile talking), and overgeneralization of the target language patterns. (http://en.wikipedia.org/wiki/Interlanguage)

Cont. Formative tests, by definition, provide washback in the

form of information to the learner on progress toward goals. But teachers might be tempted to feel that summative tests, which provide assessment at the end of a course or program, do not need to offer much in the way of washback. Such an attitude is unfortunate because the end of every language course or program is always the beginning of further pursiuts, more learning, more goals, and more challenges to face. Even a final exam in a course should carry with it some means for giving washback to students. (p.30)

Cont. Washback also implies that students have

ready access to you to discuss the feedback and evaluation you have given…for learning to continue, students need to have a chance to feed back on your feedback, to seek clarification of any issues that are fuzzy, and to set new and appropriate goals for themselves for the days and weeks ahead. (p.30)

Applying Principles to the Evaluation of Classroom Tests

1. Are the test procedures practical?

2. Is the test reliable?

3. Does the procedure demonstrate content validity?

4. Is the procedure face valid and “biased for the best”?

5. Are the test tasks as authentic as possible?

6. Does the test offer beneficial washback to the learner?

Are the test procedures practical? Practicality checklist (p.31) If you need to tailor a test to fit your oen time

frame, as teachers often do, you need to accomplish this without damaging the test’s validity and washback…yet good teaching almost always implies an investment of the

teacher’s time in giving feedback—comments and suggestions—to students on their tests.

Is the test reliable?

Physical context (p.31)

Intra-rater reliability (p.32)

Does the procedure demonstrate content validity?1. Are the classroom objectives identified and appropriately

framed? The first measure of an effective classroom test is the identification

of objectives. Example on p.32~33

2. Are lesson objectives represented in the form of test specifications?

Test specifications…simply means that a test should have a structure that follows logically from the lesson or unit you are testing. (Guidelines See p.33)

The content validity of an existing classroom test should be apparent in the form of the content of items, clusters of items, and item types.

Is the procedure face valid and “biased for the best”?

Guidelines for Face Validity, see p.33.

Biased for Best: a term that goes a little beyond how the students views the test to a degree of strategic involvement on the part of student and teacher in preparing for, setting up, and following up on the test itself. (Guidelines, see p.34)

Test-taking Strategies: Before the test, during the test, and after the test (see p.34)

Are the test tasks as authentic as possible?

Evaluate the extent to which a test is authentic by asking the following questions: (see p.35)

Two examples on p.35~36.

Does the test offer beneficial washback to the learner?

Teachers can raise the washback potential by asking students to use test results as a guide to setting goals for their future effort…self-assessment may sometimes be an appropriate way to challenge students to discover their own mistakes…peer discussion of the test results may also be an alternative to simply listening to the teacher tell everyone what they got right and wrong and why. Journal writing may offer students a specific place to record their feelings, what they learned, and their resolutions for future effort. (p.37)

principles of language assessment brown(2004): chapter 2

Documents

efficient test

timed test

test reliabilitythe

reliabilitya reliable

valid test of reading

lack content validity

indirect content

different types of validity