automated scoring for speaking assessments arizona english language learner assessment irene hunting...

27
Automated Scoring for Speaking Assessments Arizona English Language Learner Assessment Irene Hunting - Arizona Department of Education Yuan D’Antilio - Pearson Erica Baltierra - Pearson June 24, 2015

Upload: douglas-hodges

Post on 02-Jan-2016

226 views

Category:

Documents


3 download

TRANSCRIPT

Slide 1

Automated Scoring for Speaking AssessmentsArizona English Language Learner Assessment

Irene Hunting - Arizona Department of EducationYuan DAntilio - PearsonErica Baltierra - Pearson

June 24, 2015Arizona English Language Learner AssessmentAZELLAAZELLA is Arizonas own English Language Proficient Assessment.AZELLA has been in use since school year 2006-2007.Arizona revised its English Language Proficiency (ELP) Standards due to the adoption of the Arizona College and Career Ready Standards in 2010. AZELLA had to be revised to align with the new ELP Standards.Arizona not only revised the alignment of AZELLA but also revised administration practices and procedures. Revisions to the Speaking portion of the AZELLA are particularly notable.

1AZELLA Speaking Test AdministrationPrior to School Year 2012-2013Administered orally by test administrator One-on-one administrationScored by test administratorImmediate scoresTraining for test administratorsMinimalNot required

23AZELLA Speaking Test ConcernsPrior to School Year 2012-2013Inconsistent test administrationNot able to standardize test deliveryInconsistent scoringNot able to replicate or verify scoring

AZELLA Speaking Test DesiresFor School Year 2012-2013 and beyondConsistent test administrationEvery student has the same testing experienceConsistent and quick scoringRecord student responsesReliability statistics for scoringMinimal burden for schoolsNo special equipmentNo special personnel requirements or trainingsSimilar amount of time to administer

4AZELLA Speaking Test AdministrationFor School Year 2012-2013 and beyondConsistent test administrationAdministered one-on-one via speaker telephoneConsistent and quick scoringStudent responses are recordedReliable machine scoring Minimal burden for schoolsRequires a landline speaker telephoneNo special personnel requirements or trainingSlightly longer test administration time

5Proposed Solution6

----- Meeting Notes (6/16/15 15:03) -----In order to provide a consistent test admin experience to all ELL students and provide a consistent scoring for all speaking tests, Pearson worked with the Department to implement a telephone-based speaking assessment solution. This solution includes automated delivery of the speaking assessment and automated scoring of the test responses.

Here is a quick walk-through of our solution. Tests were administrated one-on-one to students. Test admin dialed a toll-free number and enter a test idenfication number to access the right test form. The speaking test items were delivered through a speaker phone. The timing for item presentation is controled and standardized. students' oral responses are collected through the phone and the audio data are transfered back to our database for grading. A machine scoring algorithm goes through the audio responses to produce a score for each of the students' responses.67Development of Automated Scoring MethodTest SpecAutomated ScoresTestDevelopersTestingSystemItem TextRecorded ItemsValidation Field testing dataHuman ratersHuman Transcribers----- Meeting Notes (6/16/15 15:03) -----Next we're going to talk about how we developed the automated scoring for azella speaking and what it takes to set up a solution like this for states.7Why does automated scoring of speaking work?The acoustic models used for speech recognition are optimized for various accentsYoung children speech, foreign accents

The test questions have been modeled from field test dataThe system anticipates the various ways that students respond

89The test questions have been modeled from field test data the system anticipates the various ways that students respond

e.g. What is in the picture?

Field Tested Items

10Language modelsIprotractoradontknowprotractora compassItsThe system estimates the probability of each of those possible responses based on field test data. The responses from field tests were rated by human graders with the rubrics, so we know for each response what score a human grader will assign. We build the scoring algorithm based on those responses and human scores, so that the algorithm can perform like a human grader.1011Field Testing and Data PreparationTwo Field Testing: 2011-2012Number of students: 31,685 (1st -12th grade), 13,141 (Kindergarten)

StageTotal testsUsed for building modelsUsed for validation I 13,1841,200333II10,6461,200300III9,3691,200300IV6,4391,200300V5,2311,20030012Item Type for Automated ScoringItem TypeScore PointDomainSyllabification0-1Oral ReadingWordlist0-1Oral ReadingRepeat0-6SpeakingQuestions about an image0-4SpeakingSimilarities and differences0-4SpeakingGive directions from a map0-4SpeakingQuestions about a statement0-4SpeakingGive instructions to do something0-4SpeakingOpen questions about a topic0-4SpeakingDetailed responses to a topic0-4SpeakingAutomated scoring can handle a variety of item types. The item types ranges from confined item types such as wordlist to more open/less confined item type such as picture description and giving instruction.12Sample Speaking Rubric: 0 4 Point Item13PointsDescriptors4Student formulates a response in correct understandable English using two or more sentences based on given stimuli.Student responds in complete declarative or interrogative sentences.Grammar errors are not evident and do not impede communication.Student responds with clear and correct pronunciation.Student responds using correct syntax.3Student formulates a response in understandable English using two or more sentences based on a given stimuli.Student responds in complete declarative or interrogative sentences.Sentences have minor grammatical errors.Student responds with clear and correct pronunciation.2Student formulates an intelligible English response based on given stimuli.Student does not respond in two complete declarative or interrogative sentences. Student responds with errors in grammar.Student attempts to respond with clear and correct pronunciation.1Student formulates erroneous responses based on given stimuli.Student does not respond in complete declarative or interrogative sentences.Student responds with significant errors in grammar.Student does not respond with clear and correct pronunciation.Human rating rubrics is a holistic rubrics that capture both the content of speech production (what they say) and the manner of production (how they say it) in terms of pronunciation, fluency etc.1314Sample student responsesItemResponse Transcript Human ScoreMachine ScoreNext, please answer in complete sentences. Tell how to get ready for school in the morning. Include at least two steps.

first you wake up and then you put on your clothes # and eat breakfast3 3.35

15Validity evidence: Are machine scores comparable to human scores?Measures we looked at:Reliability (internal consistency)Candidate-level (or test-level) correlationsItem-level correlations

16Structural reliabilityStageHuman Cronbach Machine Cronbach I0.980.99II0.990.99III0.960.94IV0.950.94V0.960.95Average0.970.9617Scatterplot by Stage

Stage IIStage VStage IVStage III18Item-level performance: by item type

Item Type (Stage II)Human-human correlation Machine-human correlationQuestions about an image0.870.86Give directions from a map0.820.84Open questions about a topic0.750.72Give instructions to do something0.830.80Repeat0.950.85Human-human corr gives us a baseline. Machine performance very closely approximate human raters performance. For some item types, when human raters dont agree with each other on scoring an item, machine human agreement goes down as well.1819Item Type (Stage IV)Human-Human correlationMachine-Human correlationQuestions about an image0.840.84Give directions from a map0.840.90Open questions about a topic0.820.82Detailed response to a topic0.850.87Give instructions to do something0.870.90Repeat0.960.89Item-level performance: by item typeIn some cases, machine grading outperform human raters in terms of consistency.1920Summary of Score ComparabilityMachine-generated scores are comparable to human ratings

Reliability (internal consistency) Test-level correlations Item-type-level correlationsTest AdministrationPreparationOne-on-one practice student and test administratorDemonstration VideoLandline Speaker Telephone for one-on-one administrationStudent Answer Document Unique Speaking Test Code

21Test Administration22

Test AdministrationWarm Up QuestionsWhat is your first and last name?What is your teachers name?How old are you?Purpose of the Warm Up QuestionsStudent becomes more familiar with promptingSound check for student voice level, equipmentCapture Demographic data to resolve future inquiriesResponses are not scored

23ChallengesChallengeSolution24Landline Speaker telephone availabilityADE purchased speaker telephones for the first year of administrationDifficulty scoring young populationAdditional warm up questionsAdded beeps to prompt the student to respondAdjusting acceptable audio thresholdRubric Update and Scoring Engine Recalibration Incorrect Speaking CodesCaptured demographics from warm up questionsSpeaking code key entry process updatedDocumentation of test administrator name and time of administration SummaryAutomated delivery and scoring of speaking assessment is highly reliable solution for large-volume state assessments

Standardize test deliveryMinimal test set-up and training is requiredConsistent in scoringAvailability of test data for analysis and review

25Questions26