investigating the reliability and validity of high-stakes esl tests

INVESTIGATING THE RELIABILITY AND VALIDITY OF HIGH-STAKES ESL TESTS

Dr. Paula WinkeMichigan State

[email protected]

March 22, 2013, TESOL Dallas

Background

Tests must be valid.

Validity = evidence that a test actually measures what it is intended to measure.

I can fly planes!

Background

Validity can be measured: Quantitatively

Psychometric measures of reliability

Qualitatively Social, ethical, practical,

and consequential considerations(e.g. Messik, 1980, 1989, 1994; Moss, 1998; McNamara & Roever, 2006; Ryan, 2002; Shohamy, 2001, 2006)

Narrow

Tests should be reliable and consistent.

Broad

Tests should be fair,meaningful, cost efficient, developmentally appropriate, and do no harm.

4

K-12, NCLB

ELPA in Michigan

Who determines reliability & validity?

• Test developers• Researchers• Test Stakeholders

Validity is a multifaceted concept that includes value judgments and social consequences. (Messick, 1980; 1989; 1994; 1995; McNamara & Roever, 2006)

“Justifying the validity of test use is the responsibility of all test users.” (Chapelle, 1999, p. 258)

Study 1: The ELPA in Michigan The purpose of this study was to

evaluate the perceived effectiveness of the English Language Proficiency Assessment (ELPA), which is used in the state of Michigan to fulfill the No Child Left Behind (NCLB) requirements.

In particular, I wanted to look at the test validity of the ELPA.

Goal of this Study

Investigate the views of teachers and school administrators (educators) on the administration of the ELPA in Michigan

ELPA in Michigan

Administered to 70,000 English Language Learners (ELLs) in K-12 annually since spring of 2006

Fulfillment of No Child Left Behind (NCLB) and Federal Title I and Title III requirements

The ELPA

Based on standards adopted by the State

Subtests Listening Reading Writing Speaking Comprehension

Scoring Basic Intermediate Proficient

Levels of the 2007 ELPA

Level I for kindergarteners Level II for grades 1 and 2 Level III for grades 3 through 5 Level IV for grades 6 through 8 Level V for grades 9 through 12

ELPA Technical Manual

2006 MI-ELPA Technical Manual claimed ELPA is valid.A. Item writers were trainedB. Items and test blueprints were reviewed by

content expertsC. Item discrimination indices were calculatedD. Item response theory was used to measure item

fit and correlations among items and test sections Validity argument did not consider the test’s

consequences, fairness, meaningfulness, or cost and efficiency, all of which are part of a test’s validation criteria (Linn et al., 1991)

Research Questions

1. What are educators’ opinions about the ELPA and its administration?

2. Do educators’ opinions vary according to the demographic or teaching environment in which the ELPA was administered?

267 Participants

Educator Type Number Percentage

Teachers of ESL & Language Arts

166 62.2%

Teachers of English Literature & Other Subjects

5 1.9%

School Principals & Administrators

62 23.2%

Others 30 11.2%

Not Identified 4 1.5%

Total 267 100%

Materials

3-part online surveyPart 1: 6 items on demographic

informationPart 2: 40 belief statements +

commentsPart 3: 5 open-ended questions

Procedure

ELPA Testing Window 3/19-4/27Online Survey Window 3/29-

5/20

MITESOL Listserv email sent 3/29

Names and emails culled off Web 3/30-4/7

Reminder email to MITESOL and culled list sent 5/14

Analysis

Quantitative: factor analysis of the data generated from the 40 belief items; derived factor scores and used those to see if the ELL demographic or teaching environment affected survey outcomes

Qualitative: deductive-analytic, qualitative analysis of answers to the open-ended questions and comments on the 40 closed items.

21

How Factor Analysis Works

Factor analysis runs correlations among the answered items to see what items are related—that is, what questions tap into the same underlying construct.

Item 1

Item 3

Item 2

Item 4

Item 5

Item 6

Item 7

Item 8

Item 9

Item 10

• After items are clustered together with other items that were answered in the same way, we can look at the items in the cluster and label the cluster—what is the theme of the cluster?

Factor 1

Factor 2

• Items that don’t correlate with any larger factor (that don’t fit into a cluster) can be dropped from further analysis or discussion.

Speaking ability

Listening ability

Results: 1. Factor AnalysisFactor Alpha Mean SE SD

1. Reading and writing tests

0.9 5.59 0.15 2.54

2. Effective administration

0.88 7.34 0.15 2.77

3. Impacts 0.88 5.43 0.15 2.61

4. Speaking test

0.9 5.88 0.15 2.49

5. Listening test

0.88 5.64 0.15 2.62

ANOVA results Factor 2 (affective administration) by ELL concentration

23

-2.14

-1.99

-0.97

-3

-2.5

-2

-1.5

-1

-0.5

0

Less than 5% 5 to 25% More than 25%

UB

Mean

LB

ANOVA results Factor 4 (speaking test) by ELL concentration

24

-0.71

-0.27

0.19

-1.5

-1

-0.5

0

0.5

1

Less than 5% 5 to 25% More than 25%

UB

Mean

LB

Qualitative Results

A. Perceptions on the ELPA subtests Factor 1. Reading and writing Factor 4. Speaking Factor 5. Listening

B. Logistics Factor 2. Effective Administration

C. Test impacts Factor 3. Impacts

A. Perceptions on the ELPA subtests (Factors 1, 4 and 5)

65 wrote that the test was too difficult for lower grade levels.

Out of the 145 responders who administered ELPA Level I, the test for kindergarteners, 30 commented that it was too hard or inappropriate.

Most of these comments centered on the reading and writing portions of the test.


Example 1: Having given the ELPA grades K-4, I feel it is somewhat appropriate at levels 1-4. However I doubt many of our American, English speaking K students could do well on the K test. This test covered many literacy skills that are not part of our K curriculum. It was upsetting for many of my K students. The grade 1 test was very difficult for my non-readers who often just stopped working on it. I felt stopping was preferable to having them color in circles randomly. Educator 130, ESL teacher, administered Levels I , II, & III.


Speaking subsection of the ELPA was too subjective.

31 educators wrote the rubrics did not contain example responses or did not provide enough levels to allow for an accurate differentiation among student abilities.

4 educators wrote ELLs who were shy or did not know the test administrator did poorly on the speaking test.

15 mentioned problems with two-question prompts that were designed to elicit two-part responses; some learners only answered the second question, resulting in low scores.


Example 2: [I]t [the rubric] focuses on features of language that are not important and does not focus on language features that are important; some items test way too many features at a time; close examination of the rubric language makes it impossible to make a decision to give 2 or 3 points because it is too subjective and not quantifiable enough because too many features are being assessed at a time Educator 79, ESL teacher, administered Levels I, II, & III.

A. Perceptions on the ELPA subtests (Factors 1, 4 and 5) Example 3: For some this was adequate

but for those students who are shy it didn't give an accurate measure at all of their true ability. I have a student who is an excellent English speaker but painfully shy and she scored primarily zeros because she was nervous and too shy to speak. Educator 30, ESL teacher, administered Levels III & IV.

A. Perceptions on the ELPA subtests (Factors 1, 4 and 5) Example 4: When you say in

succession: "What would you say to give your partner directions? What pictures would be fun to make? The students are only going to answer the last question and 100% of mine did! So it is already half wrong. Too many questions had a two part answer but only one bubble to fill in how they did. Not very adequate to me! Educator 133, ESL and bilingual teacher, administered Levels I & II.


30 of the 151 educators who administered lower levels of the ELPA commented that the listening section was unable to hold the students’ attention because it was too long and repetitive or had bland topics.

9 complained of the reliance on memory and reading skills to answer questions.

A. Perceptions on the ELPA subtests (Factors 1, 4 and 5) Example 5: Some of the stories were

very long for kindergarten and first graders to sit through and then they were repeated! The kids didn't listen well the second time it was read and had trouble with the questions because of the length of the story. Educator 176, Reading recovery and literacy coach, administered Levels I, II, & III.

A. Perceptions on the ELPA subtests (Factors 1, 4 and 5) Example 6: The kindergarten level is

where I saw the most difference between actual ability and test results. Here students could give me the correct answer orally but mark the wrong answer in the test booklet. Educator 80, ESL teacher, administered Levels I, II, & III.

B. Effective Administration(Factor 2)

There was not adequate space or time for testing.

Test materials did not arrive on time or were missing.

Other national and state tests were being conducted at the same time, which overburdened the test administrators and the ELLs.

B. Effective Administration(Factor 2) Example 7: There are only six teachers

to test over 600 students in 30 schools. We had to [administer the ELPA] ourselves because it was during IOWA testing and that was where the focus was. Because we are itinerant in our district we were given whatever hallway or closet was available. Educator 249, ESL teacher, administered all Levels.

B. Effective Administration(Factor 2) Example 8: The CD's and cassettes

were late. Several of the boxes of materials were also late and I spent a lot of time trying to track them. NONE of the boxes came to the ESL office as requested. It was difficult with four grades in a building to find the right time to pull students and not mess with classwork and schedules. Educator 151, school administrator, administered all Levels.

C. Impacts (Factor 3)

49 educators wrote that the test did not directly impact the ESL curricular content.

86 educators wrote that the test reduced the quantity and quality of ESL services during all or part of the ELPA test window.

79 educators commented on the negative psychological impact on students.

7 out of the 79 wondered about how much money and resources the ELPA cost Michigan.

C. Impacts (Factor 3) Example 9: I feel it makes those students

stand out from the rest of the school because they have to be pulled from their classes. They thoroughly despise the test . One student told me and I quote, "I wish I wasn't Caldean* so I didn't have to take this" and another stated she would have her parents write her a note stating that she was not Hispanic (which she is) Educator 71, Title 1 teacher, administered Levels I , II, III, & IV.

*Chaldeans speak Aramaic, are from the part of Iraq originally called Mesopotamia. As Christians, they are a minority among Iraqis. Many have immigrated to Michigan.

C. Impacts (Factor 3) Example 10: This test took too long to

administer and beginning students, who did not even understand the directions, were very frustrated and in some cases crying because they felt so incapable. Educator 109, ELL and resource room teacher, administered Levels II & III.

C. Impacts (Factor 3) Example 11: Just that the ELPA is a

really bad idea and bad test... we spend the entire year building students confidence in using the English language, and the ELPA successfully spends one week destroying it again Educator 171, ESL teacher, administered Level IV.

Discussion

Research question 1 What are educators’ opinions about the

ELPA and its administration? Answers

Q1. 1. Difficult for lower grade students Q1. 2. Speaking tests are problematic for

youngsters Q1. 3. Logistic problems with the

administration

Q1. 1. Difficult for lower grade students

The educators may be right. What do we know about young

language learners? The attention span of young learners in

the early years of schooling is short, 10 to 15 minutes.

They are easily diverted and distracted. They may drop out of a task when they

find it difficult, though they are often willing to try a task in order to please the teacher (McKay, 2006, p. 6).


Children under 8 are unable to use language to talk about language. (They have no metalanguage.) (McKay, 2006)

Children do not develop the ability to read silently to themselves until between the ages of 7 and 9 (Pukett & Black, 2000). Before that, children are just starting to understand how writing and reading work.


Children between 5 and 7 are only beginning to develop feelings of independence. They may become anxious when separated from familiar people and places. Having unfamiliar adults administer tests in unfamiliar settings might be introducing a testing environment that is not “psychologically safe” (McKay, 2006) for children.

Children between 5 and 7 are still developing their gross and fine motor skills (McKay, 2006). They may not be able to fill in bubble-answer-sheets.

Recommendations Test administrators should read

directions aloud and be allowed to clarify them if necessary.

Students should be allowed to give verbal answers in the listening and reading subsections.

Better would be to have a test that could be stopped if the educator felt like it should stop.


Computer Adaptive Testing for kids? May be extremely problematic… http://www.upi.com/blog/2013/03/04/

Seattle-teacher-revolt-other-districts-join-fight-against-high-stakes-tests/2741362401320/

47

48

http://www.upi.com/blog/2013/03/04/Seattle-teacher-revolt-other-districts-join-fight-against-high-stakes-tests/2741362401320

http://www.upi.com/blog/2013/03/04/Seattle-teacher-revolt-other-districts-join-fight-against-high-stakes-tests/2741362401320/

http://www.upi.com/blog/2013/03/04/Seattle-teacher-revolt-other-districts-join-fight-against-high-stakes-tests/2741362401320/

Q1. 2. Speaking tests are unreliable

Recommendations Stop the standardized, high-stakes,

oral proficiency testing of young children.

If it must be done, use a puppet to reduce test anxiety and stranger-danger.

Allow non-verbal responses to count as responses.

Q1. 3. Effective AdministrationWhat do test logistics have to do

with validity? A test cannot be reliable if the physical

context of the exam is not in order (Brown 2004). Materials and equipment should be ready, audio should be clear, and the classroom should be quiet, well lit, and a comfortable temperature.

If a test is not reliable, it is not valid (Bachman, 1990; Brown, 2004; Chapelle, 1999; McNamara, 2000).

Q1. 3. Effective Administration

Recommendations Test administrators should be required

to flag tests given under unfavorable conditions and provide information to the testing agency about the situation.

This could spur improvements in the conditions of future test administrations, especially if districts and states were held accountable for providing adequate evidence of fair test conditions.

Discussion

The unintended effect of reduction in ELL services raises the question of whether the ELPA program in Michigan is feasible (Nevo & Shohamy, 1986).

Feasibility standards are intended to “ensure that a testing method will be realistic, prudent and frugal” (Shohamy, 2001, p. 152).

It may be unrealistic to expect ESL teachers with limited resources to administer the existing ELPA within the given time frame.

Discussion

Language tests can affect students’ psychological well-being by mirroring back to students a sense of how society views them and how they should view themselves (Crooks, 1998; Shohamy, 2000, 2001, 2006; Spratt, 1995; Taylor, 1994).

The test does not recognize the ELLs’ heritage languages or cultural identities, perhaps communicating that these are unimportant to their education and thus to society (Schmidt, 2000).

Recommendation 1 of 4

Collect validity data anonymously from educators immediately after the test is administered.

Educators can complete a survey about the test logistics, their perceptions of the test, and the test’s impacts on the curriculum and students’ psyches. Sampling issues can be avoided if all educators are required to complete the surveys. If educators know in advance that they will be asked about the validity of the test, they may feel more ownership over the testing process.


Use an outside evaluator to construct and conduct the validity survey.

States and for-profit test agencies like Harcourt Assessment, Pearson, or any testing body, have an incentive to avoid criticizing the NCLB tests they manage. It is better to have a neutral party conduct the survey, summarize results and present them to the public, and perhaps suggest ways to improve the test (Ryan, 2002).


Use the validity data to form research questions.

These questions should be answered by literature reviews and additional data collection. The answers will shed light on why certain aspects of the test went wrong or were viewed negatively by the educators.


Publish all validity evidence publicly. Presenting the data in open forums, on

the web, at conferences, and in the test’s technical manual will encourage discussion about the uses of the test and the inferences that can be drawn from it.

Disseminating validity data may also increase trust in the state and the corporate and non-profit testing agencies they hire.

The way forward in Michigan? The ELPA is not feasible. Michigan just joined

WIDA. The ELPA is being given now (right now!) in Michigan, but this will be the last year of its administration.

WIDA assessments do not give reading or writing tests to young children still working on pre-literacy skills. (Yeah!)

Yet still, the ELL WIDA tests are not normed on like-aged children who are native speakers of English.

There is much work to be done.

58

Questions? THANK YOU!

Dr. Paula Winke, Michigan

State University59

[email protected]

Clarification on two kinds of scales, with correlated or uncorrelated indices!http://www.stat-help.com/factor.pdf (This is the overview paper I gave out a few weeks ago.)

Self Estee

m

Worth

Much proudGood as others

No coping skills

Home probs

Job probs

Stress

1. These three observed variables are effect indicators. They are appropriate for FA. Self-esteem causes the scores on the three indicator variable scales. This is modeled with factor analysis.

2. These three indicator variables are what cause stress. Note that these three things may not correlate. One can use Principle Components Analysis to investigate this type of structure.

We expect these three to correlate because they are all caused by self-esteem.

http://www.stat-help.com/factor.pdf

Factor Analysis How-to Papers Costello, A. B., & Osborne, J. W. (2005). Best

practices in exploratory factor analysis: Four recommendations for getting the most from your analysis. Practical Assessment, Research & Evaluation, 10(7), Available online: http://pareonline.net/getvn.asp?v=10&n=17

DiStefano, C., Zhu, M., & Mîndrilă, D. (2009). Understanding and using factor scores: Considerations for the applied researcher. Practical Assessment, Research & Evaluation, 14(20), Available online: http://pareonline.net/getvn.asp?v=14&n=20

61

investigating the reliability and validity of high-stakes esl tests

Documents

nclb elpa

test validity

validity of test use

reliability validity

mielpa technical manual

test blueprints

test users

validity of high