1/27 cresst/ucla research findings on the impact of language factors on the assessment and...

1/27CRESST/UCLA

Research findings on the impact of language factors on the assessment and instruction of English language

Learners

Jamal Abedi

University of California, Davis

National Center for Research on Evaluation, Standards, and Student Testing

UCLA Graduate School of Education

January 23, 2007

2/27CRESST/UCLA

Why was assessment first mentioned in this title?

For ELL students assessment starts before instruction

Assessment results affect ELL students in the following areas:

Classification

Instruction

Accountability (the NCLB issues)

Promotion

Graduation

Thus assessment of ELL students is very high stakes.

3/27CRESST/UCLA

How do ELL students do in assessments in comparison with non-ELL students?

ELL students perform lower than non-ELL students in general

The performance-gap between ELL and non-ELL students increases as the language demand of test items increases

The performance-gap approaches zero in content areas with a minimal level of linguistic complexity (e.g. math computation)

4/27CRESST/UCLA

Subgroup Reading Math Language Spelling

LEP Status

LEP

Mean 26.3 34.6 32.3 28.5

SD 15.2 15.2 16.6 16.7

N 62,273 64,153 62,559 64,359

Non-LEP

Mean 51.7 52.0 55.2 51.6

SD 19.5 20.7 20.9 20.0

N 244,847 245,838 243,199 246,818

SES

Low SES

Mean 34.3 38.1 38.9 36.3

SD 18.9 17.1 19.8 20.0

N 92,302 94,054 92,221 94,505

High SES

Mean 48.2 49.4 51.7 47.6

SD 21.8 21.6 22.6 22.0

N 307,931 310,684 306,176 312,321

Site 2 Grade 7 SAT 9 Subsection Scores

5/27CRESST/UCLA

Reading Science Math M SD M SD M SD

Grade 10SWD only 16.4 12.7 25.5 13.3 22.5 11.7LEP only 24.0 16.4 32.9 15.3 36.8 16.0LEP & SWD 16.3 11.2 24.8 9.3 23.6 9.8Non-LEP/SWD 38.0 16.0 42.6 17.2 39.6 16.9All students 36.0 16.9 41.3 17.5 38.5 17.0

Grade 11SWD Only 14.9 13.2 21.5 12.3 24.3 13.2LEP Only 22.5 16.1 28.4 14.4 45.5 18.2LEP & SWD 15.5 12.7 26.1 20.1 25.1 13.0Non-LEP/SWD 38.4 18.3 39.6 18.8 45.2 21.1All Students 36.2 19.0 38.2 18.9 44.0 21.2

Normal Curve Equivalent Means & Standard Deviations for Students in Grades 10 and 11, Site 3 School District

6/27CRESST/UCLA

Are the Standardized Achievement Tests Appropriate for ELLs?

The Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999) elaborated on this issue:

For all test takers, any test that employs language is, in part, a measure of their language skills. This is of particular concern for test takers whose first language is not the language of the test. Test use with individuals who have not sufficiently acquired the language of the test may introduce construct-irrelevant components to the testing process. (p. 91)

7/27CRESST/UCLA

Are the Standardized Achievement Tests Reliable and Valid for these Students?

The reliability coefficients of the test scores for ELL students are substantially

lower than those for non-ELL students

ELL students’ test outcomes show lower criterion-related validity

Structural relationships between test components and across measurement domains are lower for ELL students

8/27CRESST/UCLA

Site 2 Stanford 9 Sub-scale Reliabilities (Alpha), Grade 9

Non-LEP Students

Sub-scale (Items) Hi SES Low SES English Only

FEP RFEP LEP

Reading, N= 205,092 35,855 181,202 37,876 21,869 52,720

-Vocabulary (30) .828 .781 .835 .814 .759 .666

-Reading Comp (54) .912 .893 .916 .903 .877 .833

Average Reliability .870 .837 .876 .859 .818 .750

Math, N= 207,155 36,588 183,262 38,329 22,152 54,815

-Total (48) .899 .853 .898 .898 .876 .802

Language, N= 204,571 35,866 180,743 37,862 21,852 52,863

-Mechanics (24) .801 .759 .803 .802 .755 .686

-Expression (24) .818 .779 .812 .804 .757 .680

Average Reliability .810 .769 .813 .803 .756 .683

Science, N= 163,960 28,377 144,821 29,946 17,570 40,255

-Total (40) .800 .723 .805 .778 .716 .597

Social Science, N= 204,965 36,132 181,078 38,052 21,967 53,925

-Total (40) .803 .702 .805 .784 .722 .530

9/27CRESST/UCLA

Why these tests are not reliable for ELL students

There must be additional sources of measurement error affecting the assessment outcome for these students

These sources include:

Linguistic complexity of test items

Cultural factors

Interaction between linguistic and cultural factors with other student background variables

10/27CRESST/UCLA

Assumptions of Classical True-Score Test Theory

• 1. X = T + E (Total observed score is the sum of true score plus error score)

• 3. ET = 0

(Correlation between error and true scores is zero)

• 2. (X) = T (Expected value of observed score is true score)

• 5. E1T2 = 0

(Correlation between error scores and true scores is zero)

• 4. E1E2 = 0

(Correlation between two error scores is zero)

11/27CRESST/UCLA

Classical Test Theory: Reliability

• 2X = 2

T + 2E

X: Observed ScoreT: True ScoreE: Error Score

XX’= 2T /2

X

XX’= 1- 2E /2

X

Textbook examples of sources of measurement error:

Rater; Occasion; Item; Test Form

12/27CRESST/UCLA

Gneralizability Theory:Language as an Additional Source of Measurement Error

• 2(Xprl) = 2p + 2

r + 2l + 2

pr + 2pl + 2

rl + 2prl,e

p: Personr: Raterl: Language

Are there any sources of measurement error that may specifically influence ELL performance?

13/27CRESST/UCLA

How can we improve reliability in the assessment for ELL students?

Add more test items

Control for the random sources of measurement error

Control for systematic sources of measurement error

14/27CRESST/UCLA

Add More Test Items

• “As a general rule, the more items in a test, the more reliable the test” (Sylvia & Ysseldyke, 1998, p. 149).

• “The fact that a longer assessment tends to prove more reliable results was implied earlier...” (Linn, 1995, p. 100).

• “However, longer tests are generally more reliable, because, under the assumptions of classical true-score theory, as N increases, true-score variance increases faster than error variance” (Allen & Yen, 1979, p. 87).

15/27CRESST/UCLA

•Formula:

• N = XX’ (1 - yy’ ) / yy’ (1 - xx’ )

•For example, if we want to increase reliability of a 25-item test from .6 (yy’)to .8 (xx’), we need to add 43 items.

Add More Test Items

16/27CRESST/UCLA

A research example showing the effects of increasing test items

Source: O'Neil, H. F. & Abedi, J. (1996). Reliability and validity of a state metacognitive inventory: Potential for alternative assessment. Journal of Educational Research, 89(4), 234-245.

• Subscale N of items Alpha ()

• Effort 31

• Effort 17

• Effort 7

• Worry 14

• Worry 11

• Cognitive Strategy 14

• Cognitive Strategy 8

• 0.84

• 0.90

• 0.90

• 0.83

• 0.90

• 0.81

• 0.81

17/27CRESST/UCLA

Increasing number of test items for ELL

students may cause further complexities: If the new items are more linguistically complex, they may add to construct-irrelevant variance/ measurement error and reduce the validity and reliability even further

The cognitive load for the added items maybe greater than the cognitive load for the original items

Providing extra time for the new items to the already extended time may cause more logistical problems

18/27CRESST/UCLA

Does reliability of a test affect test validity?

Reliability sets the upper limit of a test’s validity, so reliability is a necessary but not a sufficient condition for valid measurement (Sylvia & Ysseldyke, 1998, p. 177)

Reliability is a necessary but not sufficient condition for validity (Linn, 1995, p. 82)

• Reliability limits validity, because xy < √ xx’

(Allen & Yen, p. 113)

For example, the upper limit of validity coefficient for a test with a reliability of 0.530 is 0.73

19/27CRESST/UCLA

Grade 11 Stanford 9 Reading and Science Structural Modeling Results (DF=24), Site 3

All Cases (N=7,176)

Even Cases (N=3,588)

Odd Cases (N=3,588)

Non-LEP (N=6,932)

LEP (N=244)

Goodness of Fit

Chi Square 1786 943 870 1675 81

NFI .931 .926 .934 .932 .877

NNFI .898 .891 .904 .900 .862

CFI .932 .928 .936 .933 .908

Factor Loadings

Reading Variables

Composite 1 .733 .720 .745 .723 .761

Composite 2 .735 .730 .741 .727 .713

Composite 3 .784 .779 .789 .778 .782

Composite 4 .817 .722 .712 .716 .730

Composite 5 .633 .622 .644 .636 .435

Math Variables

Composite 1 .712 .719 705 709 .660

Composite 2 .695 .696 .695 .701 .581

Composite 3 .641 .628 .654 .644 .492

Composite 4 .450 .428 .470 .455 .257

Factor Correlation

Reading vs. Math

.796 .796 .795 .797 .791

Note. NFI = Normed Fit Index. NNFI = Non-Normed Fit Index. CFI = Comparative Fit Index.

20/27CRESST/UCLA

Language of Assessment

A clear and concise language is a requirement for reliable and valid assessments for ELL students

It may also be important consideration for students with learning disabilities since a large majority of students with disabilities are in the Learning Disability category

Students in the Learning Disability category may have difficulty processing complex language in assessment

Simplifying the language of test items will also help students with disabilities, particularly those with learning disabilities

21/27CRESST/UCLA

Original:

A certain reference file contains approximately six billion facts. About how many millions is that?

A. 6,000,000B. 600,000C. 60,000D. 6,000E. 600

Modified:

Mack’s company sold six billion pencils. About how many millions is that? A. 6,000,000B. 600,000C. 60,000D. 6,000E. 600

Example

22/27CRESST/UCLA

Original:

The census showed that three hundred fifty-six thousand, ninety-seven people lived in Middletown. Written as a number, that is:

A. 350,697B. 356,097C. 356,907D. 356,970

Modified:Janet played a video game. Her score was three hundred fifty-six thousand, ninety-seven. Written as number, that is: A. 350,697B. 356,097C. 356,907D. 356,970

Example

23/27CRESST/UCLA

•CRESST Studies on the Assessment and Accommodation of ELL

Students:

•Impact of Language Factors On Assessment of ELLs•A Chain of Events

• Fourteen studies on the assessment and 3 on the instruction (OTL)of ELL students

24/27CRESST/UCLA

Study #1

Analyses of extant data (Abedi, Lord, & Plummer, 1995).

Used existing data from NAEP 1992 assessments in math and science.

SAMPLE: ELL and non-ELLs in grades 4, 8, and 12 main assessment.NAEP test items were grouped into long and short and linguistically complex/less complex items.

Findings

ELL students performed significantly lower on the longer test items. ELL students had higher proportions of omitted and/or not-reached items.ELL students had higher scores on the linguistically less-complex items.

25/27CRESST/UCLA

Study #2

Interview study (Abedi, Lord, & Plummer, 1997)

37 students asked to express their preference between the original NAEP items and the linguistically modified version of these same items. Math test items were modified to reduce the level of linguistic complexity.

Findings

Over 80% interviewed preferred the linguistically modified items over the original version.

26/27CRESST/UCLA

Many students indicated that the language in the revised item was easier:

“Well, it makes more sense.”

“It explains better.”

“Because that one’s more confusing.”

“It seems simpler. You get a clear idea of

what they want you to do.”

27/27CRESST/UCLA

Study #3

Impact of linguistic factors on students’ performance (Abedi, Lord, & Plummer, 1997).

Two studies: testing performance and speed.

SAMPLE: 1,031 grade 8 ELL and non-ELL students.41 classes from 21 southern California schools.

Findings

ELL students who received a linguistically modified version of the math test items performed significantly better than those receiving the original test items.

28/27CRESST/UCLA

Study #4

The impact of different types of accommodations on students with limited English proficiency (Abedi, Lord, & Hofstetter, 1997)

SAMPLE: 1,394 grade 8 students. 56 classes from 27 California schools.

Findings

Spanish translation of NAEP math test.Spanish-speakers taking the Spanish translation version performed significantly lower than Spanish-speakers taking the English version. We believe that this is due to the impact of language of instruction on assessment.

Linguistic ModificationContributed to improved performance on 49% of the items.

Extra TimeHelped grade 8 ELL students on NAEP math tests.Also aided non-ELL students. Limited potential as an assessment accommodation.

29/27CRESST/UCLA

Study #5

Impact of selected background variables on students’ NAEP math performance (Abedi, Hofstetter, & Lord, 1998).

SAMPLE: 946 grade 8 ELL and non-ELL students. 38 classes from 19 southern California schools.

Findings

Four different accommodations used (linguistically modified, a glossary only, extra time only, and a glossary plus extra time).

Language modification of test items was the only accommodation that reduced the performance-gap between ELL and non ELL students

30/27CRESST/UCLA

Study #6

The effects of accommodations on the assessment of LEP students in NAEP (Abedi, Lord, Kim, & Miyoshi, 2000)

SAMPLE: 422 grade 8 ELL and non-ELL students. 17 science classes from 9 southern California schools. A customized dictionary was used.

Findings

Included only non-content words in the test. Customized dictionary easier to use than published dictionary. ELL students showed significant improvement in performance. No impact on the non-ELL performance.

31/27CRESST/UCLA

Study #7

Language accommodation for large-scale assessment in science

(Abedi, Courtney, Leon, Mirocha, & Goldberg, 2001). SAMPLE: 612 grades 4 and 8 students. 25 classes from 14 southern California schools.

Findings

A published dictionary was both ineffective and administratively difficult as an accommodation. Different bilingual dictionaries had different entries, different content, and different format.

32/27CRESST/UCLA

Study #8

Language accommodation for large-scale assessment in science

(Abedi, Courtney, & Leon, 2001)

SAMPLE: 1,856 grade 4 and 1,512 grade 8 ELL and non-ELL students.132 classes from 40 school sites in four cities, three states.

Findings Results suggested: linguistic modification of test items improved performance of ELLs in grade 8. No change on the performance of non-ELLs with modified test. The validity of assessment was not compromised by the provision of an accommodation.

33/27CRESST/UCLA

Study #9 Impact of students’ language background on content-based performance: analyses of extant data (Abedi & Leon, 1999).

Analyses were performed on extant data, such as Stanford 9 and ITBSSAMPLE: Over 900,000 students from four different sites nationwide.

Study #10Examining ELL and non-ELL student performance differences and their relationship to background factors (Abedi, Leon, & Mirocha, 2001).

Data were analyzed for the language impact on assessment and accommodations of ELL students.

SAMPLE: Over 700,000 students from four different sites nationwide.

FindingsThe higher the level of language demand of the test items, the higher the performance gap between ELL and non-ELL students. Large performance gap between ELL and non-ELL students on reading, science and math problem solving (about 15 NCE score points). This performance gap was zero in math computation.

34/27CRESST/UCLA

Recent publications summarizing findings of our research on the assessment of ELLs :

• Abedi, J. and Gandara, P. (2007). Performance of English Language Learners as a Subgroup in Large-Scale Assessment: Interaction of Research and Policy. Educational Measurement: Issues and Practices. Vol. 26, Issue 5, pp. 36-46.

• Abedi, J. (in press). Utilizing accommodations in the assessment of English language learners. In: Encyclopedia of Language and Education. Heidelberg, Germany: Springer Science+ Business Media.

• Abedi, J. (2006). Psychometric Issues in the ELL Assessment and Special Education Eligibility. Teacher’s College Record, Vol. 108, No. 11, 2282-2303.

• Abedi, J. (2006). Language Issues in Item-Development. In Downing, S. M. and Haladyna, T. M. Handbook of Test Development (Ed.). New Jersey: Lawrence Erlbaum Associates, Publishers.

• Abedi, J. (in press). English Language Learners with Disabilities. In Cahlan, C. & Cook, L. Accommodating student with disabilities on state assessments: What works? (Ed.) New Jersey: Educational Testing Service.

• Abedi, J. (2005). Assessment: Issue and Consequences for English Language Learners. In Herman, J. L. and Haertel, E. H. Uses and Misuses of Data in Accountability Testing (Ed.) Massachusetts: Blackwell Publishing Malden.

35/27CRESST/UCLA

Conclusions and Recommendation

Assessment for ELL students:

Must be based on a sound psychometric principlesMust be controlled for all sources of nuisance or confounding variablesMust be free of unnecessary linguistic complexitiesMust include sufficient number of ELLs in its development process (field testing, standard setting, etc.)Must be free of biases, such as cultural biasesMust be sensitive to students’ linguistics and cultural needs

1/27 cresst/ucla research findings on the impact of language factors on the assessment and...

Documents

nonell students ell

ell students assessment

ell studentssite

test components

test takers

test scores

test use

nonlep mean51