1/27 cresst/ucla research findings on the impact of language factors on the assessment and...
TRANSCRIPT
1/27CRESST/UCLA
Research findings on the impact of language factors on the assessment and instruction of English language
Learners
Jamal Abedi
University of California, Davis
National Center for Research on Evaluation, Standards, and Student Testing
UCLA Graduate School of Education
January 23, 2007
2/27CRESST/UCLA
Why was assessment first mentioned in this title?
For ELL students assessment starts before instruction
Assessment results affect ELL students in the following areas:
Classification
Instruction
Accountability (the NCLB issues)
Promotion
Graduation
Thus assessment of ELL students is very high stakes.
3/27CRESST/UCLA
How do ELL students do in assessments in comparison with non-ELL students?
ELL students perform lower than non-ELL students in general
The performance-gap between ELL and non-ELL students increases as the language demand of test items increases
The performance-gap approaches zero in content areas with a minimal level of linguistic complexity (e.g. math computation)
4/27CRESST/UCLA
Subgroup Reading Math Language Spelling
LEP Status
LEP
Mean 26.3 34.6 32.3 28.5
SD 15.2 15.2 16.6 16.7
N 62,273 64,153 62,559 64,359
Non-LEP
Mean 51.7 52.0 55.2 51.6
SD 19.5 20.7 20.9 20.0
N 244,847 245,838 243,199 246,818
SES
Low SES
Mean 34.3 38.1 38.9 36.3
SD 18.9 17.1 19.8 20.0
N 92,302 94,054 92,221 94,505
High SES
Mean 48.2 49.4 51.7 47.6
SD 21.8 21.6 22.6 22.0
N 307,931 310,684 306,176 312,321
Site 2 Grade 7 SAT 9 Subsection Scores
5/27CRESST/UCLA
Reading Science Math M SD M SD M SD
Grade 10SWD only 16.4 12.7 25.5 13.3 22.5 11.7LEP only 24.0 16.4 32.9 15.3 36.8 16.0LEP & SWD 16.3 11.2 24.8 9.3 23.6 9.8Non-LEP/SWD 38.0 16.0 42.6 17.2 39.6 16.9All students 36.0 16.9 41.3 17.5 38.5 17.0
Grade 11SWD Only 14.9 13.2 21.5 12.3 24.3 13.2LEP Only 22.5 16.1 28.4 14.4 45.5 18.2LEP & SWD 15.5 12.7 26.1 20.1 25.1 13.0Non-LEP/SWD 38.4 18.3 39.6 18.8 45.2 21.1All Students 36.2 19.0 38.2 18.9 44.0 21.2
Normal Curve Equivalent Means & Standard Deviations for Students in Grades 10 and 11, Site 3 School District
6/27CRESST/UCLA
Are the Standardized Achievement Tests Appropriate for ELLs?
The Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999) elaborated on this issue:
For all test takers, any test that employs language is, in part, a measure of their language skills. This is of particular concern for test takers whose first language is not the language of the test. Test use with individuals who have not sufficiently acquired the language of the test may introduce construct-irrelevant components to the testing process. (p. 91)
7/27CRESST/UCLA
Are the Standardized Achievement Tests Reliable and Valid for these Students?
The reliability coefficients of the test scores for ELL students are substantially
lower than those for non-ELL students
ELL students’ test outcomes show lower criterion-related validity
Structural relationships between test components and across measurement domains are lower for ELL students
8/27CRESST/UCLA
Site 2 Stanford 9 Sub-scale Reliabilities (Alpha), Grade 9
Non-LEP Students
Sub-scale (Items) Hi SES Low SES English Only
FEP RFEP LEP
Reading, N= 205,092 35,855 181,202 37,876 21,869 52,720
-Vocabulary (30) .828 .781 .835 .814 .759 .666
-Reading Comp (54) .912 .893 .916 .903 .877 .833
Average Reliability .870 .837 .876 .859 .818 .750
Math, N= 207,155 36,588 183,262 38,329 22,152 54,815
-Total (48) .899 .853 .898 .898 .876 .802
Language, N= 204,571 35,866 180,743 37,862 21,852 52,863
-Mechanics (24) .801 .759 .803 .802 .755 .686
-Expression (24) .818 .779 .812 .804 .757 .680
Average Reliability .810 .769 .813 .803 .756 .683
Science, N= 163,960 28,377 144,821 29,946 17,570 40,255
-Total (40) .800 .723 .805 .778 .716 .597
Social Science, N= 204,965 36,132 181,078 38,052 21,967 53,925
-Total (40) .803 .702 .805 .784 .722 .530
9/27CRESST/UCLA
Why these tests are not reliable for ELL students
There must be additional sources of measurement error affecting the assessment outcome for these students
These sources include:
Linguistic complexity of test items
Cultural factors
Interaction between linguistic and cultural factors with other student background variables
10/27CRESST/UCLA
Assumptions of Classical True-Score Test Theory
• 1. X = T + E (Total observed score is the sum of true score plus error score)
• 3. ET = 0
(Correlation between error and true scores is zero)
• 2. (X) = T (Expected value of observed score is true score)
• 5. E1T2 = 0
(Correlation between error scores and true scores is zero)
• 4. E1E2 = 0
(Correlation between two error scores is zero)
11/27CRESST/UCLA
Classical Test Theory: Reliability
• 2X = 2
T + 2E
X: Observed ScoreT: True ScoreE: Error Score
XX’= 2T /2
X
XX’= 1- 2E /2
X
Textbook examples of sources of measurement error:
Rater; Occasion; Item; Test Form
12/27CRESST/UCLA
Gneralizability Theory:Language as an Additional Source of Measurement Error
• 2(Xprl) = 2p + 2
r + 2l + 2
pr + 2pl + 2
rl + 2prl,e
p: Personr: Raterl: Language
Are there any sources of measurement error that may specifically influence ELL performance?
13/27CRESST/UCLA
How can we improve reliability in the assessment for ELL students?
Add more test items
Control for the random sources of measurement error
Control for systematic sources of measurement error
14/27CRESST/UCLA
Add More Test Items
• “As a general rule, the more items in a test, the more reliable the test” (Sylvia & Ysseldyke, 1998, p. 149).
• “The fact that a longer assessment tends to prove more reliable results was implied earlier...” (Linn, 1995, p. 100).
• “However, longer tests are generally more reliable, because, under the assumptions of classical true-score theory, as N increases, true-score variance increases faster than error variance” (Allen & Yen, 1979, p. 87).
15/27CRESST/UCLA
•Formula:
• N = XX’ (1 - yy’ ) / yy’ (1 - xx’ )
•For example, if we want to increase reliability of a 25-item test from .6 (yy’)to .8 (xx’), we need to add 43 items.
Add More Test Items
16/27CRESST/UCLA
A research example showing the effects of increasing test items
Source: O'Neil, H. F. & Abedi, J. (1996). Reliability and validity of a state metacognitive inventory: Potential for alternative assessment. Journal of Educational Research, 89(4), 234-245.
• Subscale N of items Alpha ()
• Effort 31
• Effort 17
• Effort 7
• Worry 14
• Worry 11
• Cognitive Strategy 14
• Cognitive Strategy 8
• 0.84
• 0.90
• 0.90
• 0.83
• 0.90
• 0.81
• 0.81
17/27CRESST/UCLA
Increasing number of test items for ELL
students may cause further complexities: If the new items are more linguistically complex, they may add to construct-irrelevant variance/ measurement error and reduce the validity and reliability even further
The cognitive load for the added items maybe greater than the cognitive load for the original items
Providing extra time for the new items to the already extended time may cause more logistical problems
18/27CRESST/UCLA
Does reliability of a test affect test validity?
Reliability sets the upper limit of a test’s validity, so reliability is a necessary but not a sufficient condition for valid measurement (Sylvia & Ysseldyke, 1998, p. 177)
Reliability is a necessary but not sufficient condition for validity (Linn, 1995, p. 82)
• Reliability limits validity, because xy < √ xx’
(Allen & Yen, p. 113)
For example, the upper limit of validity coefficient for a test with a reliability of 0.530 is 0.73
19/27CRESST/UCLA
Grade 11 Stanford 9 Reading and Science Structural Modeling Results (DF=24), Site 3
All Cases (N=7,176)
Even Cases (N=3,588)
Odd Cases (N=3,588)
Non-LEP (N=6,932)
LEP (N=244)
Goodness of Fit
Chi Square 1786 943 870 1675 81
NFI .931 .926 .934 .932 .877
NNFI .898 .891 .904 .900 .862
CFI .932 .928 .936 .933 .908
Factor Loadings
Reading Variables
Composite 1 .733 .720 .745 .723 .761
Composite 2 .735 .730 .741 .727 .713
Composite 3 .784 .779 .789 .778 .782
Composite 4 .817 .722 .712 .716 .730
Composite 5 .633 .622 .644 .636 .435
Math Variables
Composite 1 .712 .719 705 709 .660
Composite 2 .695 .696 .695 .701 .581
Composite 3 .641 .628 .654 .644 .492
Composite 4 .450 .428 .470 .455 .257
Factor Correlation
Reading vs. Math
.796 .796 .795 .797 .791
Note. NFI = Normed Fit Index. NNFI = Non-Normed Fit Index. CFI = Comparative Fit Index.
20/27CRESST/UCLA
Language of Assessment
A clear and concise language is a requirement for reliable and valid assessments for ELL students
It may also be important consideration for students with learning disabilities since a large majority of students with disabilities are in the Learning Disability category
Students in the Learning Disability category may have difficulty processing complex language in assessment
Simplifying the language of test items will also help students with disabilities, particularly those with learning disabilities
21/27CRESST/UCLA
Original:
A certain reference file contains approximately six billion facts. About how many millions is that?
A. 6,000,000B. 600,000C. 60,000D. 6,000E. 600
Modified:
Mack’s company sold six billion pencils. About how many millions is that? A. 6,000,000B. 600,000C. 60,000D. 6,000E. 600
Example
22/27CRESST/UCLA
Original:
The census showed that three hundred fifty-six thousand, ninety-seven people lived in Middletown. Written as a number, that is:
A. 350,697B. 356,097C. 356,907D. 356,970
Modified:Janet played a video game. Her score was three hundred fifty-six thousand, ninety-seven. Written as number, that is: A. 350,697B. 356,097C. 356,907D. 356,970
Example
23/27CRESST/UCLA
•CRESST Studies on the Assessment and Accommodation of ELL
Students:
•Impact of Language Factors On Assessment of ELLs•A Chain of Events
• Fourteen studies on the assessment and 3 on the instruction (OTL)of ELL students
24/27CRESST/UCLA
Study #1
Analyses of extant data (Abedi, Lord, & Plummer, 1995).
Used existing data from NAEP 1992 assessments in math and science.
SAMPLE: ELL and non-ELLs in grades 4, 8, and 12 main assessment.NAEP test items were grouped into long and short and linguistically complex/less complex items.
Findings
ELL students performed significantly lower on the longer test items. ELL students had higher proportions of omitted and/or not-reached items.ELL students had higher scores on the linguistically less-complex items.
25/27CRESST/UCLA
Study #2
Interview study (Abedi, Lord, & Plummer, 1997)
37 students asked to express their preference between the original NAEP items and the linguistically modified version of these same items. Math test items were modified to reduce the level of linguistic complexity.
Findings
Over 80% interviewed preferred the linguistically modified items over the original version.
26/27CRESST/UCLA
Many students indicated that the language in the revised item was easier:
“Well, it makes more sense.”
“It explains better.”
“Because that one’s more confusing.”
“It seems simpler. You get a clear idea of
what they want you to do.”
27/27CRESST/UCLA
Study #3
Impact of linguistic factors on students’ performance (Abedi, Lord, & Plummer, 1997).
Two studies: testing performance and speed.
SAMPLE: 1,031 grade 8 ELL and non-ELL students.41 classes from 21 southern California schools.
Findings
ELL students who received a linguistically modified version of the math test items performed significantly better than those receiving the original test items.
28/27CRESST/UCLA
Study #4
The impact of different types of accommodations on students with limited English proficiency (Abedi, Lord, & Hofstetter, 1997)
SAMPLE: 1,394 grade 8 students. 56 classes from 27 California schools.
Findings
Spanish translation of NAEP math test.Spanish-speakers taking the Spanish translation version performed significantly lower than Spanish-speakers taking the English version. We believe that this is due to the impact of language of instruction on assessment.
Linguistic ModificationContributed to improved performance on 49% of the items.
Extra TimeHelped grade 8 ELL students on NAEP math tests.Also aided non-ELL students. Limited potential as an assessment accommodation.
29/27CRESST/UCLA
Study #5
Impact of selected background variables on students’ NAEP math performance (Abedi, Hofstetter, & Lord, 1998).
SAMPLE: 946 grade 8 ELL and non-ELL students. 38 classes from 19 southern California schools.
Findings
Four different accommodations used (linguistically modified, a glossary only, extra time only, and a glossary plus extra time).
Language modification of test items was the only accommodation that reduced the performance-gap between ELL and non ELL students
30/27CRESST/UCLA
Study #6
The effects of accommodations on the assessment of LEP students in NAEP (Abedi, Lord, Kim, & Miyoshi, 2000)
SAMPLE: 422 grade 8 ELL and non-ELL students. 17 science classes from 9 southern California schools. A customized dictionary was used.
Findings
Included only non-content words in the test. Customized dictionary easier to use than published dictionary. ELL students showed significant improvement in performance. No impact on the non-ELL performance.
31/27CRESST/UCLA
Study #7
Language accommodation for large-scale assessment in science
(Abedi, Courtney, Leon, Mirocha, & Goldberg, 2001). SAMPLE: 612 grades 4 and 8 students. 25 classes from 14 southern California schools.
Findings
A published dictionary was both ineffective and administratively difficult as an accommodation. Different bilingual dictionaries had different entries, different content, and different format.
32/27CRESST/UCLA
Study #8
Language accommodation for large-scale assessment in science
(Abedi, Courtney, & Leon, 2001)
SAMPLE: 1,856 grade 4 and 1,512 grade 8 ELL and non-ELL students.132 classes from 40 school sites in four cities, three states.
Findings Results suggested: linguistic modification of test items improved performance of ELLs in grade 8. No change on the performance of non-ELLs with modified test. The validity of assessment was not compromised by the provision of an accommodation.
33/27CRESST/UCLA
Study #9 Impact of students’ language background on content-based performance: analyses of extant data (Abedi & Leon, 1999).
Analyses were performed on extant data, such as Stanford 9 and ITBSSAMPLE: Over 900,000 students from four different sites nationwide.
Study #10Examining ELL and non-ELL student performance differences and their relationship to background factors (Abedi, Leon, & Mirocha, 2001).
Data were analyzed for the language impact on assessment and accommodations of ELL students.
SAMPLE: Over 700,000 students from four different sites nationwide.
FindingsThe higher the level of language demand of the test items, the higher the performance gap between ELL and non-ELL students. Large performance gap between ELL and non-ELL students on reading, science and math problem solving (about 15 NCE score points). This performance gap was zero in math computation.
34/27CRESST/UCLA
Recent publications summarizing findings of our research on the assessment of ELLs :
• Abedi, J. and Gandara, P. (2007). Performance of English Language Learners as a Subgroup in Large-Scale Assessment: Interaction of Research and Policy. Educational Measurement: Issues and Practices. Vol. 26, Issue 5, pp. 36-46.
• Abedi, J. (in press). Utilizing accommodations in the assessment of English language learners. In: Encyclopedia of Language and Education. Heidelberg, Germany: Springer Science+ Business Media.
• Abedi, J. (2006). Psychometric Issues in the ELL Assessment and Special Education Eligibility. Teacher’s College Record, Vol. 108, No. 11, 2282-2303.
• Abedi, J. (2006). Language Issues in Item-Development. In Downing, S. M. and Haladyna, T. M. Handbook of Test Development (Ed.). New Jersey: Lawrence Erlbaum Associates, Publishers.
• Abedi, J. (in press). English Language Learners with Disabilities. In Cahlan, C. & Cook, L. Accommodating student with disabilities on state assessments: What works? (Ed.) New Jersey: Educational Testing Service.
• Abedi, J. (2005). Assessment: Issue and Consequences for English Language Learners. In Herman, J. L. and Haertel, E. H. Uses and Misuses of Data in Accountability Testing (Ed.) Massachusetts: Blackwell Publishing Malden.
35/27CRESST/UCLA
Conclusions and Recommendation
Assessment for ELL students:
Must be based on a sound psychometric principlesMust be controlled for all sources of nuisance or confounding variablesMust be free of unnecessary linguistic complexitiesMust include sufficient number of ELLs in its development process (field testing, standard setting, etc.)Must be free of biases, such as cultural biasesMust be sensitive to students’ linguistics and cultural needs