[email protected] the debate on automated essay...

16
Marti A. Hearst University of California, Berkeley [email protected] TRENDS & CONTROVERSIES Beyond Automated Essay Scoring Karen Kukich, Educational Testing Service The ability to communicate in natural language has long been considered a defin- ing characteristic of human intelligence. Furthermore, we hold our ability to express ideas in writing as a pinnacle of this uniquely human language facility—it defies formulaic or algorithmic specification. So it comes as no surprise that attempts to devise computer programs that evaluate writing are often met with resounding skepticism. Nevertheless, automated writing-evaluation systems might provide precisely the platforms we need to elucidate many of the features that character- ize good and bad writing, and many of the linguistic, cognitive, and other skills that underlie the human capacity for both reading and writing. Using computers to increase our under- standing of the textual features and cognitive skills involved in creating and comprehend- ing written text will have clear benefits. It will help us develop more effective instructional materials for improving reading, writing, and other human communication abilities. It will also help us develop more effective technolo- gies, such as search engines and question- answering systems, for providing universal access to electronic information. A sketch of the brief history of automated writing-evaluation research and its future directions might lend some credence to this argument. Pioneering research Ellis Page set the stage for automated writing evaluation (see the timeline in Fig- ure 1). 1 Recognizing the heavy demand placed on teachers and large-scale testing programs in evaluating student essays, Page developed an automated essay-grading sys- tem called Project Essay Grader. He started with a set of student essays that teachers had already graded. He then experimented with a variety of automatically extractable textual features and applied multiple linear regres- sion to determine an optimal combination of weighted features that best predicted the teachers’ grades. His system could then score other essays using the same set of weighted features. PEG’s scores showed a multiple R correlation with teachers’ scores of .78—almost as strong as the .85 correla- tion between two or more teachers. In the 1960s, the kinds of features we could automatically extract from text were limited to surface features. Some of the most predictive features Page found included average word length, essay length in words, number of commas, number of prepositions, and number of uncommon words—the latter being negatively correlated with essay scores. Page called these features proxies for some intrinsic qualities of writing compe- tence. He had to use indirect measures because of the computational difficulty of implementing more direct measures. Despite its impressive success at predict- ing teachers’ essay ratings, the early version of PEG received only limited acceptance in the writing and education community, pre- cisely because it used indirect measures of writing skill. Critics argued that using indi- rect measures left the system vulnerable to cheating, because students could artificially enhance their scores using tricks—they could simply write a longer essay, for exam- ple. Another, more important, criticism was that because indirect measures did not cap- ture important qualities of writing such as content, organization, and style, they couldn’t provide instructional feedback to students. Although the general approach—identifying 22 IEEE INTELLIGENT SYSTEMS The debate on automated essay grading In this installment of Trends & Controversies, we look at a controversy: the use of comput- ers for automated and semiautomated grading of exams. I am very pleased to have a well- rounded discussion of the topic, because in addition to three technical contributions, we have a commentary from Robert Calfee, the Dean of the School of Education at UC Riverside and an expert in the field of educational testing. First, Karen Kukich, the director of the Natural Language Processing group at Educational Testing Service, provides us with an insider’s view of the history of the field of automated essay grading and describes how ETS is currently using computer programs to supplement human judges in the grading process. Then, Tom Landauer, Darrell Laham, and Peter Foltz describe the use of Latent Semantic Analysis in a commercial essay-scoring system called IEA. They also address important ethical questions. Lynette Hirschman, Eric Breck, John Burger, and Lisa Ferro report on MITRE’s current efforts towards automated grading of short-answer questions and discuss the ramifications for the design of general question- answering systems. Finally, Robert Calfee places these developments in the framework of current educational theory and practice. After three years editing the Trends & Controversies feature, it is time for me to pass the column on to others. Thank you for reading, and I hope our trendy and controversial contrib- utors’ essays have enhanced your understanding of the shape of the field. Marti Hearst

Upload: vuongphuc

Post on 07-May-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: hearst@sims.berkeley.edu The debate on automated essay gradingpeople.ischool.berkeley.edu/~hearst/papers/ieee-essay-grading.pdf · developed an automated essay-grading sys- ... the

Marti A. HearstUniversity of California, Berkeley

[email protected]

T R E N D S & C O N T R O V E R S I E S

Beyond Automated Essay ScoringKaren Kukich, Educational Testing Service

The ability to communicate in naturallanguage has long been considered a defin-ing characteristic of human intelligence.Furthermore, we hold our ability to expressideas in writing as a pinnacle of this uniquelyhuman language facility—it defies formulaicor algorithmic specification. So it comes asno surprise that attempts to devise computerprograms that evaluate writing are often metwith resounding skepticism. Nevertheless,automated writing-evaluation systems mightprovide precisely the platforms we need toelucidate many of the features that character-ize good and bad writing, and many of thelinguistic, cognitive, and other skills thatunderlie the human capacity for both readingand writing.

Using computers to increase our under-

standing of the textual features and cognitiveskills involved in creating and comprehend-ing written text will have clear benefits. It willhelp us develop more effective instructionalmaterials for improving reading, writing, andother human communication abilities. It willalso help us develop more effective technolo-gies, such as search engines and question-answering systems, for providing universalaccess to electronic information.

A sketch of the brief history of automatedwriting-evaluation research and its futuredirections might lend some credence to thisargument.

Pioneering research Ellis Page set the stage for automated

writing evaluation (see the timeline in Fig-ure 1).1 Recognizing the heavy demandplaced on teachers and large-scale testingprograms in evaluating student essays, Page

developed an automated essay-grading sys-tem called Project Essay Grader. He startedwith a set of student essays that teachers hadalready graded. He then experimented witha variety of automatically extractable textualfeatures and applied multiple linear regres-sion to determine an optimal combination ofweighted features that best predicted theteachers’ grades. His system could thenscore other essays using the same set ofweighted features. PEG’s scores showed amultiple R correlation with teachers’ scoresof .78—almost as strong as the .85 correla-tion between two or more teachers.

In the 1960s, the kinds of features wecould automatically extract from text werelimited to surface features. Some of the mostpredictive features Page found includedaverage word length, essay length in words,number of commas, number of prepositions,and number of uncommon words—the latterbeing negatively correlated with essayscores. Page called these features proxies forsome intrinsic qualities of writing compe-tence. He had to use indirect measuresbecause of the computational difficulty ofimplementing more direct measures.

Despite its impressive success at predict-ing teachers’ essay ratings, the early versionof PEG received only limited acceptance inthe writing and education community, pre-cisely because it used indirect measures ofwriting skill. Critics argued that using indi-rect measures left the system vulnerable tocheating, because students could artificiallyenhance their scores using tricks—theycould simply write a longer essay, for exam-ple. Another, more important, criticism wasthat because indirect measures did not cap-ture important qualities of writing such ascontent, organization, and style, they couldn’tprovide instructional feedback to students.Although the general approach—identifying

22 IEEE INTELLIGENT SYSTEMS

The debate on automatedessay grading

In this installment of Trends & Controversies, we look at a controversy: the use of comput-ers for automated and semiautomated grading of exams. I am very pleased to have a well-rounded discussion of the topic, because in addition to three technical contributions, we have acommentary from Robert Calfee, the Dean of the School of Education at UC Riverside and anexpert in the field of educational testing.

First, Karen Kukich, the director of the Natural Language Processing group at EducationalTesting Service, provides us with an insider’s view of the history of the field of automatedessay grading and describes how ETS is currently using computer programs to supplementhuman judges in the grading process. Then, Tom Landauer, Darrell Laham, and Peter Foltzdescribe the use of Latent Semantic Analysis in a commercial essay-scoring system calledIEA. They also address important ethical questions. Lynette Hirschman, Eric Breck, JohnBurger, and Lisa Ferro report on MITRE’s current efforts towards automated grading ofshort-answer questions and discuss the ramifications for the design of general question-answering systems. Finally, Robert Calfee places these developments in the framework ofcurrent educational theory and practice.

After three years editing the Trends & Controversies feature, it is time for me to pass thecolumn on to others. Thank you for reading, and I hope our trendy and controversial contrib-utors’ essays have enhanced your understanding of the shape of the field.

— Marti Hearst

Page 2: hearst@sims.berkeley.edu The debate on automated essay gradingpeople.ischool.berkeley.edu/~hearst/papers/ieee-essay-grading.pdf · developed an automated essay-grading sys- ... the

SEPTEMBER/OCTOBER 2000 23

textual features correlated with good writ-ing—was sound, a significant research chal-lenge remained: identifying and automati-cally extracting more direct measures ofwriting quality.

In the early 1980s, the Writer’s Work-bench tool set took a first step toward thisgoal.2 WWB was not an essay-scoring sys-tem. Instead, it aimed to provide helpfulfeedback to writers about spelling, diction,and readability. In addition to its spellingprogram—one of the first spelling check-ers—WWB included a diction programthat automatically flagged commonly mis-used and pretentious words, such as irre-gardless and utilize. It also included pro-grams for computing some standardreadability measures based on word, sylla-ble, and sentence counts, so in theprocess it flagged lengthy sen-tences as potentially problematic.Although WWB programs barelyscratched the surface of text, theywere a step in the right directionfor the automated analysis ofwriting quality.

Recent researchBy the 1990s, progress in the

fields of natural-language pro-cessing and information retrievalencouraged researchers to applynew computational tools andtechniques to the challenge ofautomatically extracting fromessays more direct measures ofwriting quality.

Finding more direct measures.Essay-scoring guidelines for theAnalytical Writing Assessmentportion of the Graduate Manage-

ment Admissions Test specify a set of generalqualities of writing to evaluate (seewww.gmat.org). Examples include syntacticvariety, topic content, and organization ofideas. A team of ETS researchers, led by JillBurstein, hypothesized a set of linguistic fea-tures that might more directly measure thesegeneral qualities—features they could auto-matically extract from essays using NLP andIR techniques.

For example, the ETS researchers couldmeasure syntactic variety using featuresthat quantify types of sentences and clausesfound in essays, and they could approxi-mate values for these features using syntac-tic processing tools available in the NLPcommunity. They could measure topic con-tent using vocabulary content analyses,

deriving values for these features using vec-tor space modeling techniques now com-mon in IR. They used these techniques tocompute similarity measures between docu-ments based on weighted frequencies ofvocabulary terms occurring in documents.

However, the researchers needed moresophisticated techniques to identify theessays’ individual arguments and to evaluatetheir rhetorical structure. So, they devised atechnique for approximating values for thesefeatures by first partitioning an essay intoindividual arguments using NLP techniquesbased on the identification of specific lexicaland syntactic cues. They then applied vocab-ulary content analysis to each argument.

The e-rater prototype. A pilot version ofthe computerized GMAT Analyti-cal Writing Assessment providedthe data for a series of preliminaryautomated essay scoring studies.The AWA requires each student towrite two essays, one to analyzean argument presented in a shorttext and another to express anopinion on a specific issue pre-sented in a brief statement. Pre-liminary studies began with twoessay sets, one for each essaytype. Each set contained over 400essays, and all the essays in eachset addressed the same topic. Twowriting experts using the GMATguidelines scored each essay on asix-point scale. If the scores theyassigned differed by more thanone point, which happened inapproximately 10% of the cases, athird expert reader resolved thediscrepancy.

ETS researchers defined more

Figure 1. A timeline of research developments in writing evaluation. (This timeline is not comprehensive. This article focuses mainly on research and development at EducationalTesting Service.)

Illustration by Sally Lee

Page 3: hearst@sims.berkeley.edu The debate on automated essay gradingpeople.ischool.berkeley.edu/~hearst/papers/ieee-essay-grading.pdf · developed an automated essay-grading sys- ... the

than 100 automatically extractable essayfeatures—including the linguistic featuresmentioned earlier and a variety of proxyfeatures. Then, they implemented computeralgorithms to extract values for every featurefrom each essay. For both essay topics, theysubjected various subsets of features to step-wise linear regression to determine optimalscoring models, or sets of weighted features,predictive of the scores the experts assigned.

They then tested each scoring model onan additional set of essays written on oneof the same two topics. They extractedmodel-relevant features from the newessays and summed the weighted featurevalues for each essay to predict the scorethe writing experts assigned to that essay.Many models built in this manner achievedexcellent results. The scores they assignedhad the same level agreement as the twowriting experts—that is, they agreedapproximately 90% of the time. The mostimportant result was that models consisting

of mainly linguistic features worked aswell as those containing only proxy fea-tures, thus providing evidence that wecould automatically score essays usingmore direct measures of writing quality.

ETS patented the resulting automatedessay scoring system, CAEC (ComputerAnalysis of Essay Content). Subsequentstudies refined the linguistic features andtheir algorithms and tested the system onnumerous other essay sets, each addressinga different topic. These studies demonstratedthat the automated scoring technique,renamed e-rater, generalized across essaytopics.3 Since then, research has confirmedthe psychometric validity of e-rater scores,in terms of external measures of students’writing abilities, cultural and second-lan-guage differences, and subject-specificapplications such as advanced-placementtests in US history and English literature.(Visit www.ets.org/research/erater.html formore information on e-rater studies.)

PEG. Meanwhile, the PEG system wasalso undergoing transformations to includemore direct measures of writing quality. In1995, Page reported that PEG’s “currentprograms explore complex and rich vari-ables, such as searching each sentence forsoundness of structure and weighing theseratings across the essay.”4 Other publica-tions also discuss PEG’s performance,although the specific features it now mea-sures remain undisclosed.

The Intelligent Essay Assessor. At the sametime, Tom Landauer and his colleagues weredeveloping another approach to more directmeasures of writing quality, called LatentSemantic Analysis. LSA aims at goingbeneath the essay’s surface vocabulary toquantify its deeper semantic content.5 Itsmain advantage is that it captures transitivityrelations and collocation effects amongvocabulary terms, thereby letting it accu-rately judge the semantic relatedness of two

24 IEEE INTELLIGENT SYSTEMS

Karen Kukich is a principal research scientist andDirector of the Natural Language ProcessingResearch Group at Educational Testing Service inPrinceton, N.J. Highlights of her career in NLPresearch and development include creating the firstfully automated natural language report generationsystem, publishing an award-winning article onautomated spelling correction, deploying a docu-mentation generator for telephone network plan-ning, and spearheading the operational e-rater

team. She holds a BS in mathematics and philosophy and an MS andPhD in information science, all from the University of Pittsburgh. Con-tact her at [email protected].

Lynette Hirschman is the chief scientist atMITRE’s Information Technology Division in Bed-ford, Mass. Her research interests include naturallanguage processing for both written and spokenlanguage, as well as conversation interaction. Shehas had a long-standing interest in the evaluation ofnatural-language systems and has written exten-sively on the subject. She is responsible for theHuman Language Technology research at MITREand is leading a long-term research project focused

on building an automated system that can take and pass elementaryschool reading comprehension tests. She received her PhD from theUniversity of Pennsylvania in formal linguistics. Contact her at theMITRE Corp., Bedford, MA; [email protected].

Eric Breck is a senior artificial intelligence engi-neer at MITRE’s Information Technology Center inBedford, Mass. At MITRE, he has worked exten-sively on automated scoring methods for questionanswering and reading comprehension. He also isinterested in reverse engineering. He received hisBS in linguistics and mathematics from the Uni-versity of Michigan at Ann Arbor. Contact him atthe MITRE Corp., Bedford, MA; [email protected].

Marc Light is a lead scientist at the MITRE Corpo-ration. He leads MITRE’s research in question-answering, creating a search engine that canrespond to factual questions with concise answersinstead of documents. He also works on MITRE’sreading comprehension research and led the JohnsHopkins Summer Workshop on Reading Compre-hension during July and August 2000. His researchfocuses on empirically driven approaches tohuman-language processing, specifically those

involving statistical methods. He received a PhD in computer sciencefrom the University of Rochester. Contact him at the MITRE Corp.,Bedford, MA; [email protected].

John Burger is a lead scientist at the MITRE Cor-poration. He has been involved in MITRE’s naturallanguage processing research since 1988. Hisresearch interests include the application of statisti-cal approaches to language processing and informa-tion retrieval, as well as machine learning in gen-eral. He received a BS in mathematics andcomputer science from Carnegie-Mellon Univer-sity. Contact him at the MITRE Corp., Bedford,MA; [email protected].

Lisa Ferro is a senior artificial intelligence engi-neer at MITRE’s Information Technology Center inBedford, Mass. She has recently been focusing onproducing resources for the evaluation of naturallanguage systems and for the development of cor-pus-based systems. She received her PhD in lin-guistics from the University of Connecticut. Con-tact her at the MITRE Corp., Bedford, MA; [email protected].

Page 4: hearst@sims.berkeley.edu The debate on automated essay gradingpeople.ischool.berkeley.edu/~hearst/papers/ieee-essay-grading.pdf · developed an automated essay-grading sys- ... the

SEPTEMBER/OCTOBER 2000 25

documents regardless of their vocabularyoverlap (see “The Intelligent Essay Asses-sor” on page 27).

Landauer and his colleagues recognizedthat LSA could provide a novel contribu-tion to writing evaluation applications.They carried out research and developmentto test LSA’s potential to score essays, eval-uate summaries students wrote, and evenevaluate college students’ classroom writ-ing assignments.6,7 Their work culminatedin the development of the Intelligent EssayAssessor system. They report essay-scoringaccuracy similar to e-rater and PEG usingIEA’s measures of semantic quality andquantity, providing additional evidence thatwe can automatically derive more directmeasures of writing quality.

Operational writing-evaluationsystems

The availability of more direct anddefensible measures of writing quality,

along with a growing need for gradingassistance for teachers and large-scale test-ing programs, ultimately opened minds anddoors to the feasibility of automated writ-ing evaluation. In the late 1990s, severalautomated writing-evaluation systems,including e-rater, PEG, and IEA, made thetransition from research prototypes intofully operational systems.

In February 1999, e-rater became fullyoperational within ETS’s Online ScoringNetwork for scoring GMAT essays. Eachtime ETS test developers introduce a newessay topic, OSN sends examinees’ essaysto two or more ETS writing experts to bescored in the usual manner. Once a suffi-ciently large sample of expertly scoredessays accumulates, OSN invokes e-rater’sautomated model builder to create andcross-validate a scoring model for that essaytopic. Thus, new e-rater scoring models arecertified in the same way new writingexperts are certified. Once certified, a new e-

rater scoring model automatically becomesone of the first two “experts” to score subse-quent essays on that topic. None of OSN’smechanics change with the introduction ofe-rater. All essays still receive at least tworeadings and require a third human expert toresolve scores that differ by more than onepoint. With almost half a million GMATessays being scored each year, using e-raterclearly relieves a significant portion of theload on human scoring experts.

For high-stakes assessments, such as theGMAT exam, at least one human is alwaysin the scoring loop. This safeguard helpsprevent any radically creative or otherwiseanomalous essays from slipping throughthe system unnoticed.

For low-stakes writing-evaluation appli-cations, such as a Web-based practice essaysystem, a single reading by an automatedsystem is often acceptable and economi-cally preferable. For this purpose, ETSTechnologies, a new subsidiary of ETS, hasdeveloped a fully automated service calledCriterion. (See www.etstechnologies.comfor more information about ETS Technolo-gies products and services.) This serviceincorporates a set of safeguards for detect-ing off-topic and statistically anomalousessays. In addition, research is currentlyunderway to enhance Criterion with addi-tional writing-evaluation features that willprovide students and teachers not only withholistic scores but also with diagnosticfeedback about the specific strengths andweaknesses of the essays. Producing suchfeedback requires more basic research toidentify even deeper, more direct measuresof writing quality. Fortunately, ETS andother NLP researchers are now well posi-tioned to pursue this challenge.

Current ETS writing researchClearly, we’ve made progress toward

identifying and automatically extractingmore direct measures of writing quality.Research on e-rater, PEG, and IEA hasidentified automatically extractable seman-tic, syntactic, and rhetorical structure fea-tures that correlate with writing quality.These features are all measuredholistically—that is, in terms of statisticalaverages over the whole essay text. Butholistic scores do not tell the whole story.8

A student who receives a low score wantsto know precisely where specific problemsoccurred in the essay. To give students andteachers useful feedback, automated sys-

Thomas K. Landauer is a professor of psychology and a fellow of theInstitute of Cognitive Science at the University of Colorado. He is also thepresident of Knowledge Analysis Technologies, a small company doingR&D and applications of computational simulations of human cognition.His research is devoted to developing, testing, and applying mathematicalmodels of learning and language. He received his PhD from Harvard. Con-tact him at [email protected].

Darrell Laham is a cofounder and the CTO of Knowledge Analysis Tech-nologies. He leads several ongoing LSA-based R&D projects on technolo-gies for education, personnel management, and cooperative learning anddecision-making. He holds an MA in educational measurement and a PhD incognitive science from the University of Colorado. Contact him at Knowl-edge Analysis Technologies, 4001 Discovery Dr., Ste. 2110, Boulder, CO,80303; [email protected].

Peter W. Foltz is an associate professor at New Mexico State University andalso a cofounder of Knowledge Analysis Technologies. His research inter-ests are in computational models of text and discourse, human memory, andtechniques for information retrieval and filtering. He has a PhD in cognitivepsychology from the University of Colorado. Contact him at the Dept. ofpsychology, New Mexico State Univ., Box 30001, Las Cruces, NM, 88003;[email protected].

Robert Calfee is professor and the dean of the School of Education at theUniversity of California, Riverside. He is a cognitive psychologist withresearch interests in the effects of schooling on the intellectual potential ofindividuals and groups. He earned his degrees at UCLA. Contact him [email protected].

Page 5: hearst@sims.berkeley.edu The debate on automated essay gradingpeople.ischool.berkeley.edu/~hearst/papers/ieee-essay-grading.pdf · developed an automated essay-grading sys- ... the

tems must identify and extract finer-grained features of writing. This is a muchgreater research challenge, but three recentETS studies have already made progresstoward this goal.

One study demonstrated the feasibility ofa novel technique for detecting lexical-gram-matical errors in essays,9 including word-specific usage errors such as “pollutions” or“knowledge at math,” as well as generalgrammar violations such as “I concentrates”or “this conclusions.” This technique, calledALEK (assessment of lexical knowledge),employs statistical models of probabilities ofoccurrences of word and part-of-speechbigrams and trigrams to detect unexpectedwords and word sequences such as the usageerrors noted in the previous sentence(bigrams and trigrams are two-word andthree-word sequences, such as “in the” and“in the beginning”). In a study that focusedon 20 words, 79% of the usages that ALEKflagged were errors. However, a humanreader detected many more errors thanALEK did (ETS researchers are working toaddress this problem). Furthermore, the totalnumber of lexical-grammatical errors ALEKdetected showed an inverse correlation withessay scores, indicating we could employthis feature as a more direct measure of writ-ing quality and use it to provide explicitdiagnostic feedback to the essay writer.

Another study demonstrated the feasibil-ity of using a current linguistic theorycalled Centering Theory to detect roughshifts in topic within essays.10 CenteringTheory posits four types of transitionsbetween sentences, ranging from easy todifficult (rough), based on the salience ofentities referred to in succeeding sentences.The syntactic role an entity plays in a sen-tence—for example, subject, indirectobject, direct object, and so forth—deter-mines salience. A study of 100 GMATessays showed that the ratio of rough shiftsdetected in essays was inversely correlatedwith essay scores. So, not only could weemploy a rough shift feature as a compo-nent in scoring models to measure incoher-ence in essays, we could also use it todirect essay writers to specific sentencesthat need improvement.

Yet another study focused on using auto-matically generated summaries to improveessay-scoring performance and to providefeedback.11 This study generated sum-maries based on the essays’ rhetorical rela-tions—implicit relations between sen-

tences or clauses such as cause, contrast, orelaboration—found in essays. Rhetoricalrelations are sometimes cued by transitionwords such as because, however, further-more, and so forth. This study showed thatusing a rhetorical-structure-based summa-rizer to extract an essay’s salient contentcould not only enhance a scoring model’sperformance but also point essay writersdirectly to their salient content—or informthem of the lack thereof.

Lexical-grammatical errors, rough shifts,and rhetorical relations are just three exam-ples of finer-grained measures of writingquality that have proven to be statistically

correlated with essay scores. However, weneed further basic NLP research before anyof these measures become operational.Although fully automated techniques areavailable for detecting lexical-grammaticalerrors and rhetorical relations, we needresearch to improve their accuracy. Rough-shift detection is partially automated;coreference resolution is the big challenge.We also need to identify other text featuresand cognitive skills correlated with writingquality. A parallel route to identifying thesefeatures and skills is through reading-com-prehension research.

Future research and applicationsMany of the features that affect writing

quality also affect ease of text comprehen-sion. For example, lexical-grammaticalerrors, rough shifts, and inappropriate cuemarkers for rhetorical relations will likelyincrease the difficulty of understanding anessay or any text passage. Researchersstudying reading comprehension have sug-gested additional features, such as ease oflocating antecedents of pronouns, use ofliteral versus abstract or metaphorical lan-guage, use of infrequent word senses, andease of identifying topic chains.

One way to determine whether featuressuch as these play a role in writing quality isto determine the role they play in students’

abilities to answer questions about writtenpassages. Fortunately, large databases ofquestion-difficulty statistics based on stu-dent responses on reading comprehensionexams are available. Just as we can extractfeatures from essays and submit them tostatistical analysis to determine which onesare most predictive of essay scores, we canalso extract features from reading-compre-hension passages and submit them to statis-tical analysis to determine which ones aremost predictive of question difficulty. Someresearch studies using linear-regressiontechniques12 and tree-based-regression tech-niques13,14 have already demonstrated thatfeatures such as those mentioned earlier arepredictive of question difficulty. So, addi-tional NLP research into developing toolsfor automatically evaluating question diffi-culty is likely to apply equally to evaluatingwriting quality.

Currently, the NLP research communityhas expressed much interest in the challengeof automated question answering.15 Peoplewould like to be able to submit a question to aWeb-based search engine and receive a short-answer instead of an extensive list of more orless relevant documents. Unfortunately, auto-mated question answering poses at least asgreat a challenge as short-answer scoring.Furthermore, as MITRE NLP researcherspoint out in this issue (see the essay “Auto-mated Grading of Short-Answer Tests” onpage 31) and elsewhere,16 we need short-answer scoring systems to manage the task ofevaluating automated question-answeringsystems.

To the uninitiated, it might seem counterintuitive that scoring short answers poses agreater challenge than scoring essays. Butit should be clear from the preceding sec-tions that automated essay-scoring tech-niques can rely on statistical averages ofgeneral features such as overall vocabularycontent and syntactic variety to derive evi-dence of writing quality. In contrast, shortanswers seem to provide little textual evi-dence of the writer’s underlying meaning,hence the need for finer-grained measuresand deeper analysis.

Researchers at ETS Technologies havebeen exploring techniques for scoring stu-dents’ short-answer responses to end-of-chapter textbook questions. They havefound this seemingly simple task requires agreat deal of NLP power. Pronoun and othercoreference resolution tools are essentialbecause anaphora abounds in students’ free-

26 IEEE INTELLIGENT SYSTEMS

To the uninitiated, it mightseem counter intuitive that

scoring short answers posesa greater challenge than

scoring essays.

Page 6: hearst@sims.berkeley.edu The debate on automated essay gradingpeople.ischool.berkeley.edu/~hearst/papers/ieee-essay-grading.pdf · developed an automated essay-grading sys- ... the

SEPTEMBER/OCTOBER 2000 27

form responses. In addition to general lexi-cal databases, we need subject-specific the-sauri. We also need special tokenizing andtagging techniques to derive parts of speechand partial parses from incomplete sen-tences and clauses. And all this computa-tional machinery must derive some approxi-mation of underlying predicate-argument orprepositional structure to ultimately make ajudgment about how much a short answer’stext represents the target concepts that con-stitute a “correct answer.”

So, while the research challenges remaingreat, the benefits of using computers toincrease our understanding of features andprocesses involved in creating and compre-hending written text seem clearly worth-while. The abilities to devise student-cen-tered instructional systems for reading andwriting, more effective search engines andquestion-answering systems, and universalaccess to electronic information will be justthe beginning.

AcknowledgmentsContributors to this work from the ETS NLP

Research Group include Magdalena Wolska,Daniel Zuckerman, Ramin Hemat, DennisQuardt, Slava Andreyev and Lisa Hemat; consul-tant Krishna Jha; graduate students Eleni Milt-sakaki, Krishna Prasad, Konrad Szczesniak, HoaTrang Dang, and Tom Morton; and former groupmembers Susanne Wolff and Probal Tahbildar.

Contributors to this work from the ETS Tech-nologies NLP Research Group include JillBurstein, Claudia Leacock, Chi Lu, and EleanorBolge; consultant Martin Chodorow; and sum-mer student Irek Szczesniak.

References 1. E.B. Page, “The Use of the Computer in

Analyzing Student Essays,” Int'l Rev. Edu-cation, Vol. 14, 1968, pp. 210–225.

2. N. MacDonald et al., “The Writer’s Work-bench: Computer Aids for Text Analysis,”IEEE Trans. Comm., Vol. COM-30, No. 1,1982, pp. 105–110.

3. J. Burstein et al., “Automated Scoring UsingA Hybrid Feature Identification Technique,”Proc. Ann. Meeting Association of Computa-tional Linguistics, Montreal, Canada, 1998;www.ets.org/research/erater.html (currentNov. 2000).

4. E.B. Page and N.S. Petersen, “The Com-puter Moves into Essay Grading: Updatingthe Ancient Test,” Phi Delta Kappan, Vol.76, 1995, pp. 561–565.

5. T. Landauer, and S. Dumais, “A Solution toPlato’s Problem: The Latent SemanticAnalysis Theory of Acquisition, Induction,and Representation of Knowledge,” Psycho-logical Rev., Vol. 104, 1997, pp. 211–240.

6. P. Foltz, W. Kintsch, and T. Landauer, “TheMeasurement of Textual Coherence withLatent Semantic Analysis,” DiscourseProcesses, Vol. 25, 1998, pp. 285–308.

7. D. Laham, Automated Content Assessmentof Text using Latent Semantic Analysis toSimulate Human Cognition, PhD Disserta-tion, University of Colorado, Boulder,Colo., 2000.

8. R.E. Bennett and I.I. Bejar, “Validity andAutomated Scoring: It’s Not Only the Scor-ing,” Educational Measurement: Issues andPractice, Vol. 17, No. 4, 1998. pp. 9–17.

9. M. Chodorow and C. Leacock, “An Unsu-pervised Method for Detecting Grammati-cal Errors,” Proc. First Meeting of the NorthAmerican Chapter of the Association forComputational Linguistics (ANLP-NAACL-2000), Morgan Kaufmann, SanFrancisco, 2000, pp. 140–147.

10. E. Miltsakaki and K. Kukich, “AutomatedEvaluation of Coherence in Student Essays,”Proc. LREC-2000, Linguistic Resources inEducation Conf., Athens, Greece, 2000;www.ling.upenn.edu/~elenimi/grad.html(current Nov. 2000).

11. J. Burstein and D. Marcu, “Toward UsingText Summarization for Essay-Based Feed-back,” Proc. TALN 2000 Conf., Lausanne,Switzerland, 2000; www.isi.edu/~marcu/papers.html (current Nov. 2000).

12. R. Freedle and I. Kostin,“The Prediction ofTOEFL Reading Item Difficulty: Implica-tions for Construct Validity,” LanguageTesting, Vol. 10, 1993, pp. 133–170.

13. K.M. Sheehan, “A Tree-Based Approach toProficiency Scaling and Diagnostic Assess-ment,” J. Educational Measurement, Vol.34, 1997, pp. 333–352.

14. K. Sheehan, A. Ginther, and M. Schedl,Development of a Proficiency Scale for theTOEFL Reading Comprehension Section,TOEFL Research Report, ETS, Princeton,NJ, 1999.

15. M. Light (Ed.), Proc. Workshop on ReadingComprehension Tests as Evaluation forComputer-Based Language UnderstandingSystems, Association for ComputationalLinguistics, New Brunswick, 2000.

16. E.J. Breck et al., “How to Evaluate YourQuestion Answering System Every Day …and Still Get Real Work Done,” Proc.LREC-2000, Linguistic Resources in Educa-tion Conf., Athens, Greece, 2000.

The Intelligent Essay AssessorThomas K. Landauer and Darrell Laham,University of Colorado and KnowledgeAnalysis TechnologiesPeter W. Foltz, New Mexico StateUniversity and Knowledge AnalysisTechnologies

There is a widespread belief that moststudents have inadequate studying, criticalthinking, and writing skills. Two likelycauses of these deficiencies are an overre-liance on multiple-choice testing and toofew opportunities to assess verbalizedknowledge. Although textbooks have longsupplemented face-to-face student–teacherinteraction with independent learning foracquiring knowledge, students might alsoprofit from tools for independent learningfor expressing knowledge. One such toolmight be an intelligent system that quicklyand consistently gives useful feedback onfreely expressed knowledge.

The Intelligent Essay Assessor’score technology

The Intelligent Essay Assessor (IEA), anessay-analysis, scoring, and tutorial-feed-back system, is one of many current andpotential applications of Latent SemanticAnalysis. LSA is a machine-learning tech-nology for simulating the meaning of wordsand passages.1–4 The fundamental idea isthat the aggregate of all the contexts inwhich words appear provides an enormoussystem of simultaneous equations that deter-mines the similarity of meaning of wordsand passages to each other. LSA uses thematrix algebra technique of singular valuedecomposition to analyze a corpus of ordi-nary text of the same size and content as thatfrom which students learn the vocabulary,concepts, and knowledge needed to write anexpository essay. LSA represents everyword and passage as a point in a high-dimensional semantic space. Relative posi-tion in the space estimates the similarity ofmeaning between any two words orpassages. Simulations of many linguistic,psycholinguistic, and learning phenomena,as well as several other educational and per-sonnel applications, show that LSA closelyreflects corresponding similarities of mean-ing for humans.5–7 As measured by simula-tions of human performance on standardizedvocabulary and domain-knowledge multi-ple-choice tests, LSA is always significantly

Page 7: hearst@sims.berkeley.edu The debate on automated essay gradingpeople.ischool.berkeley.edu/~hearst/papers/ieee-essay-grading.pdf · developed an automated essay-grading sys- ... the

more accurate—sometimes by factors ofthree or more times—than traditional keyword approaches that rely on the occurrenceof the same words or word stems in twopassages.

LSA is the basis of IEA’s assessment andtutorial feedback concerning the knowl-edge content of essays. It deals effectivelywith the fact that there are an unlimitednumber of ways to express nearly the samemeaning in different words. LSA also letsIEA base its scores primarily on the opin-ions of human experts about similar essays,rather than relying on specific key wordsand other index variables that correlatewith human scores on other essays.

Using LSA, IEA always keeps knowl-edge content the dominant factor in itsscores. Because expressing knowledge wellrequires good writing, graders cannot com-pletely isolate the content of an essay fromits stylistic and mechanical qualities. How-ever, making content primary has favorableconsequences for face validity, immunity tocoaching and counterfeiting, utility for diag-nosis and advice at a conceptual level, andthe potential to encourage valuable studyand thought. IEA measures the content,style, and mechanics components sep-arately, and whenever possible computeseach component in the same way, so thatscore interpretation is comparable acrossapplications. Because LSA is based onmachine learning from ordinary text ratherthan, for example, from coding of language-dependent rules, we can automatically applyIEA’s content measures with nearly equalfacility for any language, including ones thatdon’t use the Latin alphabet.

How IEA worksThe user first trains IEA on a corpus of

domain-representative text (for example,when scoring biology essays, a biologytextbook, or when scoring creative narra-tive essays, a sample representing the life-time reading of a typical test-taker). LSAcharacterizes student essays by represent-ing their meaning and compares them withhighly similar texts of known quality. Itadds corpus-statistical writing-style andmechanics measures to help determineoverall scoring, validate an essay as appro-priate English (or other language), detectplagiarism or attempts to fool the system,and provide tutorial feedback.

IEA computes and combines the threemajor components (content, style, andmechanics), plus two or more accessorymeasures, as illustrated in Figure 2. For cus-tomized application, it can adjust the rule forcombining the components within definedlimits. The default application is constrainedmultiple regression on human scores in atraining sample. IEA always computes self-validation, confidence, and counterfeitingmeasures, such as the plagiarism detectionmeasure depicted in Figure 2.

The main technical difference betweenIEA and other approaches is this: Othersystems work primarily by finding essayfeatures they can count and that correlatewith ratings human graders have assigned.They determine a formula for choosing andcombining the variables that produces thebest results on the training data. They thenapply this formula to every to-be-scoredessay. What principally distinguishes IEAis its LSA-based direct use of evaluations

by human experts of essays that are verysimilar in semantic content. This method,called vicarious human scoring, lets theimplicit criteria for each individual essaydiffer. Thus, different students can focus ondifferent aspects of a question, using differ-ent words and styles, and get the samescore if expert opinions so dictate.

IEA in use. The Web-based version of IEAsupplies instantaneous evaluations and, whenimplemented, tutorial advice. As reviewedlater, IEA’s overall scores are as reliable asthat of a teacher or professional essay reader.The detail in its comments and suggestions ispotentially unlimited, although not necessar-ily the same as what a human tutor wouldprovide. For example, it can tell studentswhat important content is missing from theirpapers and point them to relevant sources intheir textbooks. It can also identify irrelevantand redundant sentences and report on con-ceptual coherence and other organizationalqualities. However, as yet it cannot tell stu-dents whether they have made a specificpoint logically or persuasively, or that anindependent clause should have had been setoff by a comma, and so forth.

Scoring calibration. IEA’s use of LSA-based training on background text lets itanalyze and score essays with few or evenno prescored examples. Our researchshows that IEA can usually be optimallycalibrated with 100 prescored essays, andsometimes with as few as 20. However, byextending the LSA technology, IEA cantrain itself to give accurate rankings with-out using any human grades. Here it usesthe varying knowledge the essays expressthemselves to align them on a continuumof quality. To score in this way, IEA typi-cally needs 200 or more student essays onthe same well-defined topic.

IEA can also be calibrated by comparingessays either to ideal answers or to sectionsof a textbook that students should have stud-ied. These are the least desirable approaches,because students don’t usually write answerssimilar to those of professors or authors andoften write equally good answers in manydifferent ways.

In its tutorial feedback implementations,IEA is typically incorporated into onlinecourseware accompanying a textbook. Stu-dents write essays or summaries of sectionsor chapters. IEA provides immediate evalua-tion and points the student back to pages or

28 IEEE INTELLIGENT SYSTEMS

Figure 2. Schematic illustration of the computation of an Intelligent Essay Assessor score based on a customized combi-nation of its three main components plus accessory measures.

Page 8: hearst@sims.berkeley.edu The debate on automated essay gradingpeople.ischool.berkeley.edu/~hearst/papers/ieee-essay-grading.pdf · developed an automated essay-grading sys- ... the

SEPTEMBER/OCTOBER 2000 29

sections of the text containing informationthat should have been used in the essay. Thesystem also identifies irrelevant or redundantsentences and overall conceptual coherence.A well-controlled experiment has shown thatthe Summary Street version for middleschools produces significant improvement inrelevant skills.8,9 Another experimentalapplication has used IEA as an embeddedassessment measure for online discussion-based learning environments, where it cumu-latively measures the knowledge containedin individual students’ contributions.

Reliability and validity of IEAThe standard way to evaluate the accu-

racy of a set of essay grades is to measurehow well two independent scorings agreewith each other—either scores two humanjudges gave or one by an automatic graderand others by humans. This criterion is notobjective or absolute because humanjudges have legitimate differences of opin-ion. It is best interpreted simply as an esti-mate of how well the score in question willpredict additional human opinions.

There is no reason why an automaticgrader cannot predict a human score betterthan another human score does. One way isfor the automatic grader to be more consis-tent in evaluating some factor. However, weneed to make sure that the factor the machineis measuring better is one we want to stress.Otherwise, relying on this form of validationmight lead students to focus on the wrongthings—for example, simply using morerare, “trigger” or topic-specific words.

There are other possible criteria foressay-scoring accuracy, such as correlatingscores with other measures of knowledge orbetter agreement with more expert judges.We will report on such measures later. Firstwe review studies of IEA’s reliability versushuman graders as compared to the reliabil-ity between human graders. For validation,we prefer to deal with correlations betweenthe continuous IEA scores and whateverscores the human graders use. This methodgives an unbiased estimate of how well onepredicts the other, while avoiding the com-plication of classifying the scores into dis-crete “grade” or “score” groups, a matterthat involves instruction and policy issueslargely irrelevant to validity. (However,when desired, IEA statistically predicts thediscrete human classifications.)

In each case, we collected numerousessays students wrote to the same prompt in

a real examination. Either large national orinternational professional testing organiza-tions, such as ETS and CRESST, or profes-sors at major universities provided theseprompts. At least two graders independentlygraded each essay. These graders wereknowledgeable in the test’s content domainand quality criteria and trained in its scoringaccording to either holistic rubrics or ana-lytic components. They were blind to theIEA scores and, in the case of professionalscoring, uninformed that an automatic scor-ing system would be used. The studentgroups taking the tests included sixthgraders, high school and college students,graduate students in psychology, medicalschool students, and applicants to graduatebusiness management programs. The 15different topics included heart and circula-tory anatomy and physiology (the sameprompt at all student levels in various stud-ies), neural conduction, Pavlovian and oper-ant conditioning, aphasia, attachment inchildren, Freudian concepts, the history ofthe Great Depression, the history of thePanama Canal, ancient American civiliza-tions, alternative energy sources, businessand marketing problems, and a creativenarrative composition task in which stu-dents were given a two-sentence beginningof a story and asked to complete it. In allcases, the test essays differed from theessays used to train the system.

Figure 3 shows averaged results, andFigures 4 and 5 show scatter plots for twocases—a GMAT “argument” essay and astory-completion narrative.

IEA scores on average correlated with ahuman score as well as one human scorecorrelated with another. There was somevariation across student groups, tests, and

human graders. As expected, the more reli-able the human graders were, the betterIEA predicted their scores.

Using the professional grades, we ana-lyzed the contribution of the three compo-nents to overall scores. When combined bylinear regression, they predicted humangrades with a correlation of .85. Alone, thecontent, style, and mechanics scores pre-dicted human grades with a correlation of.83, .68 and .66, respectively. When opti-mally weighted by linear regression, therelative contributions were .75, .13, and.11, respectively.

Other empirical validations of IEA accu-racy. In three different ways, IEA essaysscores have proven modestly more validthan human essay scores. First, in studies ofessays on heart anatomy, LSA’s scores pre-dicted short-answer test scores over thesame material better than did expert humanscores on the same essays. Second, for a setof student essays on neural conduction,three sets of scores were obtained, one fromundergraduate teaching assistants, one fromgraduate-student teaching assistants, andone from the professor. Using the samecombined set for training, IEA agreed bestwith the professor, least with the undergrad-uates. The results of these two studies werestatistically significant. Third, in a studyinvolving GMAT essays, IEA was trainedusing all the pregraded scores of just one ofthe two graders for each essay. The IEAscore predicted the second grader’s scorevery slightly better than did the scores onwhich it was trained. This is attributable toIEA’s comparison of a to-be-scored essaywith all other essays, a kind of vicariousmultiple human scoring.

Figure 3. Average Latent Semantic Analysis-based Intelligent Essay Assessor results: Summary of reliability results for3,296 essays on 15 diverse topics. The measure (reliability coefficient) is the correlation between two human experts (leftbars) or between Intelligent Essay Assessor scores and one human grader, averaged over the two humans (right bars).

Page 9: hearst@sims.berkeley.edu The debate on automated essay gradingpeople.ischool.berkeley.edu/~hearst/papers/ieee-essay-grading.pdf · developed an automated essay-grading sys- ... the

IEA’s internal validity and oddity checks.IEA includes a battery of programs thatestimate confidence in the accuracy of thesystem’s score for a particular essay andcheck that the student wrote the essay in

normal English word order. They alsoascertain that the student is not trying tofool the system by larding it with rare ortopical words; that an essay is not highlyunusual, either by being very original or off

topic; and that it is not a copy, paraphrase,or rearrangement of another essay. In allsuch cases, IEA flags the essay for specialattention. For example, if comparisonessays are insufficiently similar to the to-be-scored essay, or are too variable in theirimplications for content quality, the essay ispassed to a human grader.

We know from empirical user testingthat it is very difficult to trick the systeminto an incorrect grade. As in other sys-tems, it is possible to compose a goodessay and then do something to make itabnormal (for example, reordering some ofits words or sentences), without incurringmuch penalty. However, we know of noway to get a high grade from IEA withoutknowing the material well. As an addedprecaution, we periodically add or substi-tute new counterfeit-detection routines thatwe do not reveal.

IEA responses to some social andphilosophical issues

People sometimes worry that computer-based essay grading will fail to credit novelcreative answers or answers that reflectgreater knowledge than the system wastaught. IEA is typically aimed at factualknowledge; we usually don’t want highlycreative essays on anatomy or jet enginerepair. Nonetheless, even with topics inwhich creativity can be desirable, our expe-rience with IEA has been favorable. Forexample, in opinion essays on the GMAT,there is ample opportunity for creditingcreativity, yet IEA was as reliable as theprofessional readers. Similarly, as shown inFigure 5, on creative narratives, IEA scoresagreed with highly trained expert graderscores as well as the latter agreed with eachother. How could this happen? One hypoth-esis is that a constant setting permits only alimited variety of story themes, plots, andcharacters—ones that draw upon commonknowledge and experience. LSA can cap-ture the similarity of texts that differ only inirrelevant details. For example, IEA treatsthe theme of “a boy searching for hishorse” as very similar to that of “a girllooking for her pony.” If good and badthemes are spelled out well or poorly intypical ways, IEA will measure their qual-ity in the same way that it assesses moreobviously focused expression ofknowledge.

There are other reasons for IEA’s successwith novel essays. Because it is based on

30 IEEE INTELLIGENT SYSTEMS

Figure 5. Scatter plot for 900 creative narrative essays comparing Intelligent Essay Assessor scores and averagedscores for two highly trained professional readers from an international assessment organization. All comparisons wereblind, and IEA was trained on different essays from the test results shown. The corresponding correlation coefficient is.90, identical to that between two expert human raters.

8

7

6

5

4

3

2

1

Hum

an g

rade

0 1 2 3 4 5 6 7 8IEA score

Figure 4. Scatter plot for Intelligent Essay Assessor scoring of two GMAT topics versus independent professional scores fromETS on a sample provided by ETS. On this sample, IEA and the ETS e-rater obtained the same reliabilities versus humanreaders. All comparisons were blind, and IEA was trained on different essays from those used for the test results shown.

7

6

5

4

3

2

1

0

Hum

an g

rade

–1 0 1 2 3 4 5 6 7IEA score

Page 10: hearst@sims.berkeley.edu The debate on automated essay gradingpeople.ischool.berkeley.edu/~hearst/papers/ieee-essay-grading.pdf · developed an automated essay-grading sys- ... the

SEPTEMBER/OCTOBER 2000 31

human judgments of similar essays, therange of performance that the system canmeasure is unlimited. For example, essayson heart anatomy and function were reli-ably scored, whether written by sixthgraders, college undergraduates, or medicalstudents. Essays better (or worse) than anyseen during system training can be scoredhigher (or lower) than any in the trainingset. Assume that humanly scored “6” essaysare on average highly similar to seven oth-ers scored “6” and three scored “5.” If IEAencounters a new essay that is highly simi-lar to ten essays scored “6,” it might give ita “6.3.” In self-calibrated scoring, IEAcould, in principle, determine that a particu-lar essay was better than any seen before,by as much as three standard deviations ormore. (Of course, any essay that unique ismore likely to be way off topic or psycho-pathic. In any case, it would be flagged.)

We discussed this issue because critics ofmachine essay grading often assume thatcomputer systems must be slavishly measur-ing overlap with a finite model and arealarmed by an imagined lack of sensitivityto creativity and genius. On the contrary, itnow appears possible that automatic meth-ods can be superior to humans in this regard.

Unique Contributions of IEATo our knowledge, the Intelligent Essay

Assessor is unique among commerciallyavailable systems in the following ways:

• It is always based primarily on semanticcontent, which it measures at a concep-tual level rather than by the occurrenceof selected words.

• It explicitly embodies holistic humanjudgments.

• It can be automatically applied to newtopics and in new languages withoutmanual construction of new rule sets orthe like.

• It can provide useful tutorial commen-tary on missing content, semanticcoherence, redundancy, and irrelevance.

• It detects plagiarism and “system gam-ing” and computes validity self-checks.

• It has been validated against doubleexpert human scores across a wide variety of different topics and student populations.

This is not to imply that other systems arenot capable of equal scoring accuracy—atleast some are—or that none are using similar

accessory measures. Nor would we claim thatIEA is better than others for all purposes. Forexample, for a composition assignment inwhich each essay is on a different topic, or inwhich content is secondary to form, or forevaluating sentence structure, grammar,spelling, or the logic of an argument indepen-dent of what it is about, some other methodshave greater face validity, and probablygreater accuracy and utility.

The future of automatic scoringmethods

All present technologies for automaticscoring of essays, including IEA, leave con-siderable room for improvement. Morespecifically, methods must evaluate and givecritical feedback and suggestions for im-provement in detailed matters of logic, syn-tax, and expression at the sentence level, andof clarity, comprehensibility, and affectivequalities (such as humor, suspense, andevocativeness) at sentence, paragraph, andorganizational levels. Doing those things willtake much better articulated understandingand modeling of human language than wenow have.

References 1. M.W. Berry, S.T. Dumais, and G.W. O’Brien,

“Using Linear Algebra for Intelligent Infor-mation Retrieval,” SIAM Rev., Vol. 37, No. 4,1995, pp. 573–595.

2. S. Deerwester et al., “Indexing By LatentSemantic Analysis,” J. Am. Soc. for Infor-mation Science,Vol. 41, No. 6, 1990, pp.391–407.

3. T.K. Landauer and S.T. Dumais, “A Solutionto Plato’s Problem: The Latent SemanticAnalysis Theory of the Acquisition, Induction,and Representation of Knowledge,” Psycho-logical Rev., No. 104, 1997, pp. 211–240.

4. T.K. Landauer, P.W. Foltz, and D. Laham,“An Introduction to Latent Semantic Analy-sis,” Discourse Processes, Vol. 25, Nos.2–3, 1998, pp. 259–284.

5. B. Rehder et al., “Using Latent SemanticAnalysis to Assess Knowledge: Some Tech-nical Considerations, Discourse Processes,Vol. 25, Nos. 2–3, 1998, pp. 337–354.

6. M.B. Wolfe et al., “Learning from Text:Matching Readers and Text by LatentSemantic Analysis,” Discourse Processes,No. 25, 1998, pp. 309–336.

7. P.W. Foltz, W. Kintsch, and T.K. Landauer,

“The Measurement of Textual Coherence withLatent Semantic Analysis,” Discourse Pro-cesses, Vol. 25, Nos. 2–3, 1998, pp. 285–308.

8. E. Kintsch, D. Steinhart, and G. Stahl, Inter-active Learning Environments, 8XX, 2000.

9. D. Steinhart, Summary Street: an LSA-Based Intelligent Tutoring System for Writ-ing and Revising Summaries, unpublisheddoctoral dissertation, Univ. of Colorado,Boulder, Colo., 2000.

Automated Grading of Short-Answer TestsLynette Hirschman, Eric Breck,Marc Light, John D. Burger, and Lisa FerroThe MITRE Corporation

The educational community and the pub-lic have accepted the idea of having a com-puter grade tests—provided that the testsare structured in such a way that there is nosubjective judgment involved, as in gradingtests with multiple-choice or yes-or-noanswers. It is far more controversial to havecomputers grading essays, as in the e-ratersystem from Educational Testing Service1

or—as we discuss here—short-answer tests.

Why short-answer tests? Whyautomatic evaluation?

Why use short-answer tests instead ofmultiple choice? First of all, they are more“authentic.” Answering real-world questionsis more like taking a short-answer test thantaking a multiple-choice test. Another moti-vation is economic; constructing high-qual-ity multiple-choice test items is expensive.Finally, multiple-choice tests, unlike short-answer tests, lend themselves to test-takingstrategies, which do not evaluate the stu-dent’s understanding of the question.

Why automatic evaluation for short-answer tests? For the educational-testingcommunity, one motivation is economic: ifyou can replace two human graders with onehuman and one system, you can reduce thecost of grading the examination. This substi-tution seems acceptable, as long as you candemonstrate that it won’t affect the finalgrade and that human judges make the finaldecision, should the human and system dis-agree. A second, more important, motivationis that automated grading of short-answerquestions provides students with much moreimmediate feedback—there is no need towait for an instructor to provide a “ruling”

Page 11: hearst@sims.berkeley.edu The debate on automated essay gradingpeople.ischool.berkeley.edu/~hearst/papers/ieee-essay-grading.pdf · developed an automated essay-grading sys- ... the

on the correctness of the answer. This imme-diacy supports interactive drills and testing,including diagnostic feedback for intelligenttutoring. However, an automated gradingsystem’s success ultimately depends on itsability to closely approximate the kinds ofjudgments a human grader would make.

Natural language researchAs natural language system developers,

our perspective on this issue differs fromthat of educators. Our long-term researchgoal is to develop systems that can read andunderstand common types of articles, sto-ries, or news reports, to help people gather

and digest vast amounts of information.For the past several years, we have beenworking specifically on developing sys-tems that can take and pass reading-com-prehension examinations—both multiple-choice and short-answer tests.2 Figure 6shows a sample reading-comprehensionstory and the related test questions.

How can we measure how well a systemunderstands what it reads? One way is tohave it answer questions about an article orstory that it has read—that is, to have it take(and pass) the same kinds of tests we give topeople, namely reading-comprehension tests.Our hypothesis is that if we can build systemsthat can pass general reading-comprehensiontests, we can build commercially useful sys-tems—systems that will provide factualanswers to users’ informational queries. Wecan track the progress of our research by“grading” these automated systems using thesame tests that we use on people.

System development for any automatedlanguage-processing system requires con-stant testing with feedback: did the latestchange make the system perform better orworse? This cycle becomes even more impor-tant when the modules of a reading-compre-hension system rely on statistical methods,such as hidden Markov Models, or variouskinds of machine-learning techniques. In fact,the cycle of testing, feedback, and improve-ment isn’t much different from what a studentneeds when learning new subject matter. Aswe noted earlier, students also benefit from atight loop of studying, doing drills, receivingdiagnostic feedback, and taking tests.

To support our research, we have createdseveral reading-comprehension test corporaand an evaluation infrastructure. We alsohope to engage other groups in building andevaluating reading-comprehension systems3

(see www.clsp.jhu.edu/ws2000/groupsreading for information on the Johns HopkinsSummer Workshop devoted to reading com-prehension). There is a related research effort,namely the Text Retrieval Conference’s openevaluation of question-answering systems,sponsored by the National Institute of Stan-dards and Technology. The TREC question-answering track evaluates systems thatanswer factual questions from information ina multi-gigabyte document collection. For the1999 evaluation, the systems were presentedwith 200 questions to answer; for the 2000evaluation, they were given 700 questions.The systems must provide a short (50 charac-ters) or long (250 characters) answer to each

32 IEEE INTELLIGENT SYSTEMS

Figure 6. Sample reading comprehension passage with questions. News story courtesy of the Canadian BroadcastingCorporation 4 Kids site, http://cbc4kids.com/general/whats-new/daily-news.

Mars Polar Lander—Where Are You?

(January 18, 2000) After more than a month of searching for a signal fromNASA’s Mars Polar Lander, mission controllers have lost hope of finding it. TheMars Polar Lander was on a mission to Mars to study its atmosphere and searchfor water, something that could help scientists determine whether life ever existedon Mars. Polar Lander was to have touched down December 3 for a 90-day mis-sion. It was to land near Mars’ south pole. The lander was last heard from minutesbefore beginning its descent. The last effort to communicate with the three-leggedlander ended with frustration at 8 a.m Monday.

“We didn’t see anything,” said Richard Cook, the spacecraft’s project manager atNASA’s Jet Propulsion Laboratory. The failed mission to the Red Planet cost theAmerican government more than $200 million dollars. Now, space agency scientistsand engineers will try to find out what could have gone wrong. They do not want tomake the same mistakes in the next mission. Controllers have been testing dozensof different scenarios to try and explain what might have happened to the lander.(Sources: Associated Press, CBC News Online, CBC Radio news, NASA)Copyright CBC/SRC, 1997. All Rights Reserved.

A. Who is the Polar Lander’s project manager?B. What was the mission of the Mars Polar Lander?C. When did the controllers lose hope of communicating with the lander?D. Where on Mars was the spacecraft supposed to touch down?E. Why did NASA want the Polar Lander to look for water?

Figure 7. Answer keys and answer-word recall and precision calculations.

Question:B. What was the mission of the Mars Polar Lander?

Correct sentence key:Sentence 3: The Mars Polar Lander was on a mission to Mars to study its atmos-

phere and search for water, something that could help scientists determine whetherlife ever existed on Mars.

Answer key:to study Mars’ atmosphere and to search for water |to help scientists determine whether life ever existed on Mars

Sample system answer:to study its atmosphere

Answer-word recall:key 1: (alternative 1): [study, Mars, atmosphere, search, water]system: [study, atmosphere]Recall: 2/5 = 40%Precision: 2/2 = 100%

Page 12: hearst@sims.berkeley.edu The debate on automated essay gradingpeople.ischool.berkeley.edu/~hearst/papers/ieee-essay-grading.pdf · developed an automated essay-grading sys- ... the

SEPTEMBER/OCTOBER 2000 33

question.4,5 Human judges thenreview the question and eachsystem’s response to determinewhether the system responseconstitutes a correct answer.

Between the multi-site workon reading comprehension andthe TREC conference on ques-tion answering, there is now an active natural language re-search community that needs amethod for rapid automatedgrading of short answer ques-tions. In each developmentcycle of an automated ques-tion-answering system, it isnecessary to run the system andtest its performance, to make sure it isimproving. For rapid development, this needsto be done many times a day. It simply isn’tpossible to wait for an expert human to gradeeach output the system produces, beforedoing more development.

Strategies for automated short-answer grading

If we want to build an automated evalua-tion for short-answer questions, the firstquestion is, what constitutes a correctanswer? An easy ad hoc answer might be,whatever the human judge says is correct.One approach would be to examine manyanswers judged correct and build a classifierthat we could train to produce similar judg-ments. Unfortunately, we don’t have millionsof “judged answers” available—although wedo have the answer judgments from some 10to 15 systems for the 200 TREC questionsused in the 1999 evaluation.

A second approach—the one we arecurrently taking—is to compare the systemanswer to an answer key. The answer keymight just come from the answers in theback of the book—if the book providessuch answers. Otherwise, we have a humanexpert create a set of appropriate answers.6

An answer key consists of one or moreacceptable answers for each question. Forexample, Figure 7 shows question B fromFigure 6 together with its answer key. Wesee that there are two alternate answerslisted, separated by a vertical bar:

to study Mars’ atmosphere and to search forwater | to help scientists determine whether life everexisted on Mars

Once we have the answer key, we can

develop an automated comparison tech-nique that measures the closeness of thesystem answer to the answer key.

Sentence correctness. Initially, we experi-mented with two measures: sentence cor-rectness and answer-word recall. For sen-tence correctness, a human expert creates acorrect sentence key, consisting of the sen-tence(s) from the passage that best answeredthe question. Determining correctness thensimply requires comparing the sentencechosen by the system to the sentence in theanswer key—or, if the sentences are num-bered, comparing sentence numbers. Thus,in Figure 6, the correct answer sentence forquestion B (“What was the mission of theMars Polar Lander?”) is the third sentence:“The Mars Polar Lander was on a mission toMars to study its atmosphere and search forwater, something that could help scientistsdetermine whether life ever existed onMars.” If the system returns that sentence, itis correct. If it returns another sentence, it isincorrect. However, there can be cases (alittle over 10% in our simple initial corpus)where there is no single sentence that pro-vides the answer. In addition, a sentenceoften contains more information than whatis needed to answer the question: comparethe length of the correct sentence (30 words)to the length of the answer key (9 or 10words) in Figure 7. Clearly, we need someother measures of answer correctness.

Answer-word recall (and precision). Ourlong-term goal is to build a system thatreturns a concise answer, not just a sentencefrom the text. So we have also developedmore fine-grained measures of answer cor-rectness, based on overlap of stemmed con-

tent words between the answerkey and the system answer.Answer-word recall measurescoverage: it is the overlapbetween the system answer andanswer key, divided by the num-ber of words in the answer key.Perfect (100%) recall means allthe answer key words appear inthe system answer (possiblyalong with other other words);0% recall means none of theanswer key words are in thesystem answer. Answer-wordprecision measures conciseness:it is the overlapping wordsdivided by the number of words

in the system answer. Perfect (100%) preci-sion means that all words in the systemanswer appear in the answer key; 0% preci-sion means that none of the system answer-words appear in the answer key. In Figure 7,if the system returned as its answer “to studyits atmosphere,” there are two stemmed con-tent words in the system answer (study,atmosphere), five content words in the firstkey (study, Mars, atmosphere, search, water),and eight in the second (help, scientist, deter-mine, whether, life, ever, exist, Mars). Thefirst answer key alternative provides a betterscore, based on the two overlapping contentwords: it yields a recall of 2/5 (40%) and aprecision of 2/2 or 100%.

Our initial experiments used answer-wordrecall only, because our early systems re-turned an entire sentence. Using precisionwas less informative, because all the re-sponses were sentences of roughly equallength. Also, because the sentences containedmany extraneous words, the precision wouldhave been very low. In these experiments, wefound that the correct answer sentence andanswer-word recall were reasonably wellcorrelated. We compared performance withversions of the system containing differentmodules and found that if a module (forexample, a proper name tagger) helped underone metric, it generally helped under theother. However, we also wanted to see howwell these methods (particularly answer-word recall) corresponded to human judges.

To compare our answer-word recall mea-sure to human judgments, we used theanswers returned from various automatedsystems in the TREC evaluation, togetherwith the associated human judgments ofanswer correctness. We asked our expert,Lisa Ferro of the MITRE Corporation, to

Figure 8. ROC curve for answer-word recall versus the human answer.

Page 13: hearst@sims.berkeley.edu The debate on automated essay gradingpeople.ischool.berkeley.edu/~hearst/papers/ieee-essay-grading.pdf · developed an automated essay-grading sys- ... the

create an answer key for the TREC data, andwe plotted the answer-word recall versus therecorded human judgments, giving theReceiver Operating Condition curve shownin Figure 8. Specifically, we wanted to see theeffect of selecting different thresholds foranswer-word recall: to call an answer correct,should we require 25% answer-word recall or100%? The ROC curve and the detailed fig-ures in Table 1 show the trade-offs in select-ing this threshold.7 For example, if we call ananswer correct if it has a word recall of over25%, we get a hit rate of 93.6% (the auto-mated system scores the answer as correct,given that the human assessor judges the sys-tem correct) and a false alarm rate of 6.6%(where the automated scorer judges theanswer correct, although the human judgehas called it incorrect). In general, there isgood correlation between the positiveanswer-word recall and the human assessorsjudging the answer correct.

Based on this correlation, do we have asatisfactory automated method to gradeshort answers? Not yet. First, there are someobvious problems with our methodology.Word overlap is far too simple. For example,in this experiment, wrong answers predomi-nate. If we had an automated scoringmethod that said all answers were wrong, itwould agree about 85% of the time with thehuman judges; at a word recall threshold of

25%, the automated system and the humanjudge agree 93.6% of the time. Furthermore,there are limitations with word recall as ametric, regardless of threshold. Almost 6%of correct answers have 0% word recall (noword overlap between the answer key and acorrect answer), and 1.7% of the incorrectanswers have 100% word recall (all theanswer key content words are found in theanswer, but it’s still wrong).

To understand the discrepancies betweenthe human judges and automated compari-son, we randomly selected 990 responses;out of these, we examined the 72 responsesfor which the automated system differedfrom the human judge. The results areshown in Table 2. In 7 cases, the humangrader appears to have made a mistake; in27 cases, it is unclear whether the human orthe automated grading system was correct—so in almost half the cases (47%), it was-n’t even clear if the automated system waswrong. For the remaining 38 cases, the auto-mated decision based on thresholded wordrecall was clearly wrong, but many of theseare easily fixed. Half of these errors (19/38)could be fixed by normalizing comparisonsacross different kinds of numerical expres-sions. This would fix the mismatch betweenan answer of “three” versus the answer key“3,” or “Tuesday” and “April 3.” Other dis-crepancies (7/38) were due to differences in

answer granularity or answer phrasing—forexample, if the answer key says “GeorgeWashington” and the system returns “Wash-ington.” And the remaining 12 discrepan-cies were due to other problems. So, withsome additional work, answer-word recallmight approximate human judgment reason-ably well—or at least well enough to sup-port rapid system development.

But aside from these detailed concerns,answer-word recall is a very limited measure.It ignores many important dimensions ofwhat makes an answer correct. These includeintelligibility (the coherence of the answer, tomake sure it isn’t just a “bag of words”); con-ciseness, or absence of extraneous material(measured perhaps by answer precision); andjustifiability (providing some evidence fromthe relevant passage that the entity has theappropriate characteristics and the systemdidn’t just guess correctly). Moreover, appro-priate measures must be sensitive to the spe-cific instructions that the student receives. Isthe student asked to provide a minimalanswer, the best phrase, or a sentence fromthe text? A conciseness measure, such asprecision, is appropriate if the student isinstructed to provide a short answer—but it isirrelevant if student is asked to identify thebest sentence from the text as an answer,where the student has no control over con-ciseness. We need to develop task-appropri-ate measures that capture these additionaldimensions of answer correctness.

We suspect that, in the long run,building a system that can grade a short-answer test might be almost as hard asbuilding a system that can take (and pass) ashort-answer test. If we succeed, we willhave created a useful tool that will helpboth developers of natural-language under-standing systems and educational-testdevelopers. Ultimately, these methods willbenefit both students and teachers—mak-ing drill and self-test materials more read-ily available to students and providingthem with better feedback, while removingsome of the drudgery from teaching byproviding help with routine grading.

References1. J. Burstein et al., “Computer Analysis of

Essays,” Proc. NCME Symp. AutomatedScoring, 1998; www.ets.org/research/

34 IEEE INTELLIGENT SYSTEMS

Table 1. Human judgments compared to thresholded answer word recall.

HUMAN JUDGED AS INCORRECT HUMAN JUDGED AS CORRECT

RECALL FREQUENCY INCORRECT (%) FREQUENCY CORRECT (%)

0.00 29,709 92.4 336 5.80.01–0.25 325 1.0 36 0.60.26–0.50 1,399 4.4 747 13.00.51–0.75 173 0.5 109 1.90.76–0.99 5 0.0 61 1.11.00 548 1.7 4,479 77.7

Total 32,159 100.0 5,768 100.0

Table 2. Error analysis comparing human to automated judgments for 72 discrepancies.

SOURCE of DISCREPANCY # %

TREC assessor (arguably) wrong 7 10%

Responses seemed relevant (“tough call”) 27 37%

Thresholded word recall score wrong 38 53%

TOTAL 72 100%

Page 14: hearst@sims.berkeley.edu The debate on automated essay gradingpeople.ischool.berkeley.edu/~hearst/papers/ieee-essay-grading.pdf · developed an automated essay-grading sys- ... the

SEPTEMBER/OCTOBER 2000 35

ncmefinal.pdf (current Nov. 2000).

2. L. Hirschman et al., “Deep Read: A ReadingComprehension System,” Proc. 37th Ann.Meeting Assoc. Computational Linguistics,Assoc. for Computational Linguistics, Col-lege Park, Md., 1999, pp. 325–332.

3. M. Light et al., Proc. Workshop on ReadingComprehension Tests as Evaluation forComputer-Based Language UnderstandingSystems, Assoc. for Computational Linguis-tics, New Brunswick, 2000.

4. E.M Vorhees and D.K. Harman, The EighthText Retrieval Conf., NIST Special Publica-tions, Feb. 2000.

5. E.J. Breck et al., “A Sys Called Qanda,”Proc. Eighth Text Retrieval Conf., Nat’lInst. Standards and Technology, Gaithers-burg, Md., 1999, pp. 499–506.

6. L. Ferro, “Reading Comprehension Tests:Guidelines for Question and Answer Writ-ing,” internal working paper, The MITRECorp., Bedford, Mass, 2000.

7. E.J. Breck et al., “How to Evaluate YourQuestion Answering System EveryDay…and Still Get Real Work Done,” Proc.Second Int’l Conf. Language Resources andEvaluation (LREC-2000), European Lan-guage Resources Association, Paris, 2000.

To Grade or Not to Grade Robert Calfee, University of California,Riverside

If, during this new millennium, we hopeto encourage more groups in our society toparticipate—in education, work, and poli-tics—then we need to better develop ourcapacity to communicate effectively. Thisoutcome will require working on not onlyoral and written communication but alsocritical-thinking skills—and knowledge andskill in technical writing are touchstones foreffective communication skills.

Student compositions reveal the mind inremarkable ways; unlike spoken discourse,the reader can retrace the author’s reason-ing and rethink his or her argument. Cre-ative writing and casual communicationcertainly have a place in society, but suc-cess in business and industry depends onclarity, which is best learned and assessedthrough writing. Unfortunately, instructorsdon’t always have sufficient time orresources to effectively grade student com-positions or provide feedback on their read-ing-comprehension skills. This is where

automated essay-grading systems such asthose described in this issue can help.

For instance, the Intelligent Essay Asses-sor described by Thomas Landauer and hiscolleagues provides the novice writer withinstant feedback about the match betweenhis composition and a “semantic core” cre-ated from a related set of readings andcompositions prepared by other writers.IEA also provides structural information:“You captured idea x quite well, but need tolook more carefully at y and z.”

The current system Research suggests that becoming an

effective writer requires at least these threeelements: guided practice, effective instruc-tion, and informed feedback.1 Today’s class-rooms, from elementary grades throughgraduate studies, rely mostly on the guided-practice element—more specifically,instructors give writing assignments butoffer only limited guidance.

The tempting topic of effective instruc-tion must wait for another time; suffice it tosay that high school English classes tend toemphasize literature and grammar inroughly equal proportions, neither of whichcontribute to the knowledge and skill neededto design the technical reports that I empha-size in this essay. The conditions remain thesame during the college years (especially forstudents identified as requiring remedia-tion), with the exception of the occasionaltechnical-writing course found in engineer-ing and business schools. The typicalresearch paper needs to define a problem,lay out a few analytic points, and reach aconclusion—something akin to the five-paragraph essay. In high school and college,the emphasis is more often on creativity.

Regarding informed feedback, ideallythe student receives a close and carefulreading of one or more drafts, with atten-tion to several elements of effective com-position—usually organization and coher-ence, style and usage, and mechanics(grammar, spelling, and so on). Of theseprinciples, textual integrity clearly mattersmost for the reader (until the mechanicsinterfere with understanding, micro-levelidiosyncrasies annoy but do not detract).Granted, most writing instructors spend agood deal of energy on editorial correctionsof surface nonconventions such as incorrectpunctuation and grammatical errors. Thisapproach makes sense when you considerthat the instructors must grade hundreds of

papers each week. They don’t have time tothink about individual compositions. Inaddition, students might become argumen-tative when challenged on a composition’shigh-level features: “What do you mean it’snot coherent? It includes all the facts!”Grammar and spelling, in contrast, areeither right or wrong, assuming acceptanceof certain conventions. Transforming thisstate of affairs so that teachers can changetheir approach would require more teachers(which is unlikely) and shifting the view-point of what matters from an emphasis onstyle to an emphasis on substance (whichwould be difficult).

The last few decades have seen numerousefforts to automate the intricacies of assess-ing written material, but few advances dealwith scoring. For hunt-and-peckers like me,the appearance of spelling and grammarcheckers was a blessing. They don’t solveevery problem or guarantee an “A,” but atleast you don’t have to worry about minutiae(such as how to spell minutiae) while com-posing, or even when revising and polishing.But what about automating the “big stuff”?

Automating the big stuffThe three other essays in this installment

of Trends & Controversies also address thisissue of increased automation capabilities,and they all share certain conclusions.

First, virtually every automated systemgenerates scores that correlate with the rat-ings of human judges as closely as humanjudges agree with one another. The highcorrelations might reflect the interrelated-ness of different elements in naturally occur-ring compositions; writers who producewell-organized passages also use a richvocabulary and carefully revise mechanics.Experiments with test passages in which thevarious elements are independently varied(that is, well organized but with poormechanics, or strong vocabulary but withlots of misspellings) would show how thedifferent systems in this issue respond todifferent elements.2

Second, all three essays take for grantedthe reading–writing connection.3 In mostnonacademic settings, this connectionundergirds writing tasks. For example,when an engineer prepares an evaluationreport on a new widget, she first learnsabout the widget, then outlines the report’smain points (often using a model), studiesother documents for background, andfinally prepares a draft. Reading and writ-

Page 15: hearst@sims.berkeley.edu The debate on automated essay gradingpeople.ischool.berkeley.edu/~hearst/papers/ieee-essay-grading.pdf · developed an automated essay-grading sys- ... the

ing intertwine continuously. In school,however, traditions separate reading andwriting in all but a few settings, mostly inthe later grades and college-bound tracks.

Third, all the essays focus on scoring butalso mention the need for more detailedfeedback. In “Automated Grading of Short-Answer Tests,” the authors suggest the valueof prompt and detailed feedback, but onlyfor microlevel questions (“You don’t under-stand the point of this specific question”).The emphasis on scoring links to summativeevaluation—to feedback at the end of thewriting process. Automated systems canpotentially convey formative informationalong the way, but the other three essaysoffer little information on this possibility.

I contend that automation’s value in help-ing novices become expert writers rests moreon the beginning rather than the end of thelearning process. Let me offer a metaphor.The area north of Monterey, California, isfamed for its artichokes, and one of myfavorite road stops is an old barn displaying a“grader,” an inclined plane with crossbarsnarrow at the top and wide at the bottom.The tiny artichokes fall through first, and thegiant artichokes don’t fall through until thebottom. These farmers evidently know howto both develop and grade their products. Thequestion that comes to mind is, how mightwe use the power and cost-effectiveness ofautomated text-evaluation systems to pro-vide inexperienced writers with feedbackand support for improving performance, thusdeveloping or growing—and not just grad-ing—our students?

The Intelligent Essay AssessorIn the effort to understand ways of better

developing students’ reading and writingskills, I rely on the IEA system as the touch-stone, mostly because I am more familiarwith it than with the other systems, both con-ceptually and operationally. In “The Intelli-gent Essay Assessor,” Thomas Landauer,Darrell Laham, and Peter Foltz introduce theIEA and discuss its effectiveness. However,several IEA elements not highlighted in theiressay offer considerable promise for support-ing student growth. Specifically, IEA canemploy automated strategies that supportboth students and teachers by providinginformed feedback through an interactiveassessment process that intertwines curricu-lum, instruction, and assessment. This abilityto offer informed feedback assumes greaterimportance than inter-rater consistency.

To develop this point, I first expand onLandauer, Laham, and Foltz’s essay. Whatoutcomes (other than scoring) does IEA dealwith, and how well does it handle these mat-ters? To what degree is it cost-effective andtime-efficient in these ancillary domains?What about practical matters of acceptabilityand feasibility; will teachers find it helpful?

What first attracted me to IEA is the read-ing–writing connection. School success fromthe middle-school grades onward requiresstudents to read a body of material, analyzeits content, and write a response. Unfortu-nately, today’s curricula do not help studentslearn this rather demanding task. Reading andwriting are separate parts of the curriculum,with different times, textbooks, and tests.Often, in science and social studies, teachers

are poorly prepared to handle either readingor writing. These lacunae have several conse-quences for curriculum, instruction, andassessment. If the teacher’s curricular goal isto help students appreciate the concept ofenergy (different forms of energy, the conser-vation principle), and the teacher doesn’tknow how to help the students with vocabu-lary and comprehension, then the curriculargoal falls through the cracks. The teacher willalso be at a loss as to how to provide adequateinstruction and conduct proper assessment.

Imagine a middle-school science teacherwho presents to the class a project-basedassignment on deserts:

This month we are going to study the desert.I’ve brought in many books and found somerelated Web sites. Next week, we’ll visit theJurupa Science Museum, where you can seeplants and animals, along with rocks and otheritems typically found in a desert environment.Earlier this year, a builder asked the TownCouncil to build 500 new homes in CanyonCrest, up the hill from where most of you live.It’s just desert, and many people think we need

more homes. Our job is to study what’s goingon and write letters to the Town Council aboutwhat we think they should do.

This assignment goes well beyond the typi-cal middle-school curriculum and requiresextraordinary instructional capabilities. Butsuppose that the project moves ahead; theletters to the Town Council begin to takeshape. How can the teacher judge the for-mative adequacy of the initial drafts andthe progress toward the final summativegoal? The teacher and his young chargesshould eventually know whether their pro-ject produced a collection of exemplaryarguments or a set of mundane and uncon-vincing sputterings. Furthermore, along theway, the teacher should know how to pro-vide assessments, as well as suggestionsfor enhancing the works in progress. Theprovision of this along-the-way input—theessence of all levels of professional devel-opment—is surely important for students.

If applied to such an assignment, IEAwould perform some tasks very well, oth-ers indifferently, and a few not at all. Ofcourse, IEA (along with the other systemsdescribed in this issue) is still evolving, soits full potential remains to be seen.

IEA’s strong points. IEA goes to the crux ofthe reading–writing connection. To whatdegree can a student read about a topic, ana-lyze it from a defined perspective, and write aresponse? The IEA strategy—conceptual andtechnical—is to construct a canonical tem-plate around the focal topic (the desert, forexample). Latent Semantic Analysis, a theo-retical technique with origins in the cognitiverevolution of the 1970s, starts with word–con-cept networks, adds a dash of mathematicaltechnology to crunch the information produc-ing the template, and then presents an inter-face for translating theory and technologyinto practical outcomes. The process beginswith input and output texts. Input texts in-clude various readings on the topic (booksand other artifacts); student essays—whichexperienced judges rate as more or less ade-quate—are the output texts. LSA begins bygetting to know the territory, but then learnsthe difference between “knowing well” and“knowing not so well”—or “expressing well”and “not so well.”

Using LSA, IEA grades responses partic-ularly well. The data that Landauer and hiscolleagues present in “The Intelligent EssayAssessor” show that this system does anexcellent job of scoring essays in whichstudents read a text and respond to a related

36 IEEE INTELLIGENT SYSTEMS

IEA goes to the crux of thereading–writing connection.

To what degree can astudent read about a topic,

analyze it from a definedperspective, and write a

response?

Page 16: hearst@sims.berkeley.edu The debate on automated essay gradingpeople.ischool.berkeley.edu/~hearst/papers/ieee-essay-grading.pdf · developed an automated essay-grading sys- ... the

SEPTEMBER/OCTOBER 2000 37

prompt. The technique is both time-efficientand cost-effective; students and teachersreceive information quickly and precisely,and the cost promises to be reasonable. IEA also provides informed feedback byexamining student work as it emerges,offering suggestions about overall quality(the likely grade) and identifying places forimprovement. For example, IEA’s Sum-mary Street application (my favorite)checks a reader’s capacity to extract the keyinformation in a text by means of a writtensummary. Unlike the MITRE system, IEAassesses comprehension by asking thereader to decide what is important. Feed-back immediately tells the summarizerwhich text segments have been neglectedand identifies material that is irrelevant orredundant. The program’s potential forenhancing comprehension of expositorypassages merits special attention.

An average performance. However, evenSummary Street could be enhanced in a cou-ple of ways. First, it could include teachingtexts that pose increasingly defined and diffi-cult challenges to the reader. Science text-books, for instance, often contain seductivedistracters—information included to sparkinterest but irrelevant to the main points.Second, the program could inform the stu-dent about missing elements in a chain ofreasoning. Feedback from the main IEAprogram tends to be rather generic, althoughsome specific indicators are quite valuable.The internal coherence and essay validationmeasures, for instance, point the student totext-level problems but do not offer muchhelp about how to address the issues.

In addition, IEA provides various ancil-lary supports, including the usual grammati-cal and spelling backups, with the usual prosand cons. The developers should leave thesematters to others, remaining aware that someaudiences might value information aboutwriting mechanics. The weighted contribu-tions in Figure 2 (see the first essay) seemabout right: 75% content, 15% style, and10% mechanics. These statistics providereasonable guidelines for further systemdevelopment.

Also, although the current system offersstudents (and teachers) more detailed andinstructive information than other automatedsystems with which I am familiar, it stilllacks the overarching framework that canrespond to the question, “What does thenovice reader and writer need to know to

master this task, and when during the writ-ing process does he or she need to know it?”This article is not the place to fully explorethis question, but I offer a tentative answerin the next section.

Room for improvement. IEA does not per-form some pedagogical tasks at all—forexample, assessing text structure, which isimportant in both reading and writing.4 LSAdepends on an inductive strategy to handlecontent. The program inputs a huge amountof information and calculates dimensionalcosines. It’s amazing how well this strategyworks for grading. But so do other, lesssophisticated methods, perhaps because ofthe interrelatedness mentioned earlier. Textstructures do more than replicate contentorganization, however; they shape, amplify,and transform the content. The Greeks pro-duced not just geometry but also rhetoric, thestructural principles that undergird moderncommunications. As the authors of “BeyondAutomated Essay Scoring” explain, addingrhetorical structures to existing systems willbe nontrivial (and, for some purposes, per-haps unnecessary). But for IEA and LSA toserve as an instructional support system thatoffers efficient guidance, as well as informedfeedback, incorporating text frameworksmight be significant for future developments.

Vocabulary also matters, and IEA doesnot measure this skill. To avoid dings formisspelling, the novice writer often relies oneveryday words, but the sophisticated ratersees a composition replete with plebeianwords and clichéd phrases. IEA provides areadability index that reflects unusualusages but does not indicate to the student(or teacher) the level of skill in expressivevocabulary. The importance of this matterrecently captured my attention when col-leagues and I discovered that today’s mid-dle-school students tend to eschew a riskylexicon, staying with terms that are tried,true, high frequency, and easy to spell.

Practical concernsMany of today’s classrooms, especially

those serving poor communities, lack theresources needed to implement these pro-grams. A discussion of the disparities inhuman capital would take me beyond thisessay’s scope, but it is clear that schools inunderserved neighborhoods do not havethe equipment needed for effective accessto net-based systems. Pay a visit to such aschool and you will discover a handful of

computers from the early 1990s, maybewith a phone-line modem. This situationhas evoked well-publicized but generallyscattershot responses. The harm from inad-equate resources might be more substantialthan is generally recognized. Mark Russelland Walt Haney, for instance, found thatstudents preparing their written assign-ments on computers rather than with paperand pencil improved on average from“Needs Improvement” (a delicate way ofsaying “failed”) to “Passing.”5 This isbecause the students write more with thefirst method and spend more time revisingtheir drafts.

Imagine the impact of combining the bestof what we know with what we can do—imagine if we could do everything, fromgiving immediate and informative feedbackto ensuring that every student spends timetinkering with a keyboard. Computer tech-nologies—including better equipment andthe kinds of response systems described inthis issue, will not put teachers out of busi-ness, as some seem to fear. Rather, they willprovide them with tools that amplify theirprofessional skill and knowledge.

AcknowledgmentThe Office of Educational Research and

Improvement (IERT-9979834) supported thisessay’s preparation.

References1. C.R. Cooper and L. Odell, eds., Research

on Composing: Points of Departure, Nat'lCouncil of Teachers of English, Urbana, Ill.,1978.

2. S.W. Freedman, “Student Characteristicsand Essay Test Writing Performance,”Research in the Teaching of English, Vol.17, 1983, pp. 313–324.

3. N.N. Nelson and R.C. Calfee, eds., TheReading-Writing Connection: The Yearbookof the Nat’l Soc. for the Study of Education,Univ. of Chicago Press, Chicago, 1998.

4. M.J. Chambliss and R.C. Calfee, Textbooksfor Learning: Nurturing Children’s Minds,Blackwell, Oxford, UK, 1998.

5. M. Russell and W. Haney, “Testing Writingon Computers: An Experiment ComparingStudent Performance in Tests Conductedvia Computer and via Paper-and-Pencil,”Education Policy Analysis Archives, Vol. 5,No. 3, May/June, 1997; http://epaa.asu.edu/epaa/v5n3.html (current Nov. 2000).