automated essay scoring: my way or the highway!

Automated Essay Scoring: My Way or the Highway!

Alexander Hurtado, Vamsi SaladiDepartment of Computer Science

Stanford UniversityStanford, CA 94305

[email protected], [email protected]://github.com/alexanderjhurtado/nlp_aes

Abstract

The following research attempts to approach the problem of automated essayscoring, a long-standing goal in the world of natural language processing. Weapproached this problem using deep learning techniques, rather than more com-monly used machine learning techniques like bag-of-words logistic regressionor support vector machines. We implemented and trained our own single-layerunidirectional LSTM network, multi-layer unidirectional LSTM network, word-level recurrent highway network, and word-to-sentence-level recurrent highwaynetwork. We used a dataset provided by the platform Kaggle, which hosted acompetition sponsored by the Hewlett Foundation, the creator of the dataset. Weallocated essay scores into one of four buckets (0,1,2,3) to account for differentgrading schemes. After training our models on virtual machines, we found that themulti-layer unidirectional LSTM outperformed the rest of the models, producing anaccuracy of approximately 0.63. However, the recurrent highway network and thesingle layer unidirectional LSTM both did relatively well as well, with accuraciesaround 0.54 and 0.55 respectively.

1 Introduction

Automating the process of essay scoring has been a long-standing wish in the world of NLP. Asa natural venue of research in the world of natural language processing, automated essay scoringbecame a hot topic for research as the popularity of sentiment analysis increased. Research began onautomated essay scoring as early as 1999, with the development of the CRASE automated constructedresponse grader developed by Howard Mitzel and Sue Lottridge as a part of Pacific Metrics. However,the research did not really take off in academia until 2012, when Kaggle released a dataset pro-vided by the Hewlett Foundation with over 13,000 transcribed essays and teacher criticism and ratings.

Our goal was to use deep learning methods to address the problem, and build a model by training onapproximately 13,000 essays with their respective scores. There are 8 essays prompts, and take arespective proportion of each prompt to train, validate and test on. We wanted to compare our resultsto the baselines established before us using machine learning techniques like regression, generalizedlinear models, and Support Vector Machines. However, we wanted to use these baselines but improveupon them by pursuing deep learning techniques. Using techniques like LSTMs, RNNs, and highwaynetworks, we wanted to see if we could improve upon the performance of non-network based modelson automated essay scoring.

Our proposed contribution to this area of research is to infuse deep learning techniques, which arecriminally underused as of today. We hope to report the accuracy scores that result from using themethods described above.

32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.

https://github.com/alexanderjhurtado/nlp_aes

2 Related Works

There has been a significant amount of research done into the area of automated essay scoring,varying from the use of machine learning techniques that do not involve neural networks to those thatdepend entirely on them. One of the earliest papers written on this topic deals with using logisticregression and SVMs on the essay representations to get a decision boundary, effectively treating asmulti-class classification problems.

Next, we see research developed as more machine learning techniques began to be applied to thisarea. We see that paper by Taghipour and Ng [4] which was one of the earlier papers written dealingwith automated essay scoring to consider using the idea of convolutions. This idea spread to otherresearchers as by Farag, Yannakoudakis, and Briscoe [2] shows, which used an convolution layer ofwindows to get convolutions to pass into the main RNN.

However, we also drew inspiration from highway networks and their increased use in NLP tasks.Specifically, we drew from the work of Zilly et. al. (2017) [1] on the idea of using recurrent highwaynetworks, a model structured to combine the ease-of-training of highway networks and infuse itinto the classic vanilla recurrent neural network architecture. The intuition behind this comes fromthe idea that recurrent highway networks can learn complex structures in the data because, likemulti-layer LSTMs, they are structurally deep along the vertical axis. Unlike multi-layer LSTMs,however, recurrent highway networks are computationally easier to train (as seen by our results)because of their use of highway networks. By replacing LSTM cells with highway networks andrestructuring the step-to-step transitions, a recurrent highway network is less prone to the issue ofvanishing and exploding gradients than LSTMs and other deep recurrent models.

3 Approach

In order to fairly compare the effectiveness of our deep neural models, we need to observe how wella simpler machine learning model performs on the task.

First, we realize that this a classification task, and thus we have to use algorithms that are classiyingin nature. We first obtained a baseline score using a single-layer multinomial logistic regressionmodel; from our research, we observed that others that have tackled automated essay scoring in thepast have employed multinomial logistic regression as the typical baseline model. In order to inputour data into the logistic regression model, we constructed a vector representation for each essayusing a bag-of-words approach. This approach views essay representation as a simple combinationof the words that constitute the essay. In essence, an essay can be represented as a vector whosedimension is equivalent to the size of the vocabulary; each entry in the vector corresponds to thefrequency of a certain word from the vocabulary in the given essay. These bag-of-words vectorsare then passed through our model, which outputs the classification probabilities for essay. Theseprobabilities are then used to predict the score class of the essay. We then use cross entropy loss asthe loss function to backpropagate prediction error.

We then moved onto deep recurrent neural models. In particular, we approached the task through twoprimary architectures: long short-term memory (LSTM) models and recurrent highway networks(RHN). In each recurrent model, we first convert each word in a given essay to their correspondingGlobal Vector (GloVe) representation. These word embeddings will then serve as input to our models.

The first recurrent model that we utilized was a vanilla single-layer, unidirectional LSTM, as depictedbelow.

2

Figure 1: Uni-directional LSTM with a single cell detailed

For this LSTM model, we convert each essay into a sequence of GloVe embeddings correspondingto the sequence of words in the essay. We then pass each word embedding into the LSTM untilthe entire sequence is processed. We then determine a vector representation for the essay bytaking the sum of the hidden states output by each LSTM cell. The resulting essay vector is thenpassed into a fully connected layer. This fully connected layer determines a mapping from theessay vector to output classification probabilities for the essay. The score class of the essay ispredicted using these probabilities and the prediction error is backpropagated using cross entropy loss.

The second recurrent model we constructed was a multi-layer, unidirectional LSTM. This modelworks nearly identically to the vanilla single-layer LSTM described above, as can be seen in Figure 2.

Figure 2: Uni-directional, Multi-Layer LSTM Network

However, a multi-layer LSTM is structurally different in that it is made deep in the vertical axisby applying stacking multiple LSTMs on top of each other. By stacking LSTMs, our network

3

can compute more complex representations, with the idea that the lower-level LSTMs computelower-level features while the higher-level LSTMs compute complex, higher-level features. Thismulti-layer architecture is advantageous to task of essay score as it allows for greater model com-plexity, allowing our model to better understand the complex reasons that dictate the score of an essay.

The third recurrent model implemented was a recurrent highway network, as described in the work ofZilly et. al. (2017). We see the general structure below:

Figure 3: A single computation within the RHN layer inside the recurrent loop.

A recurrent highway network is similar in structure to a multi-layer LSTM; both are recurrent neuralmodels that are deep in both the time dimension and in the vertical dimension. However, RHNmodels are fundamentally different from multi-layer LSTMs, architecturally. Whereas a multi-layerLSTM has a step-to-step transition where an input is processed through a single LSTM cell beforebeing passed off to the next layer and the next cell, the step-to-step transition of a recurrent highwaynetwork is defined by processing the input through L stacked highway layers before being passedoff to the next input. Similarly to how LSTMs can be described as a recurrent sequence of LSTMcells, an RHN model can be best described as a sequence of recurrent highway stacks, where eachstack is constructed by L stacked highway layers. Let’s define m as the dimension of the GloVe wordembeddings, n as the dimension of the hidden layers, L as the depth of the step-to-step transition(highway stack), x ∈ Rm as the input to the highway stack, and y ∈ Rn as the output of the highwaystack. We can then also define WH,T,C ∈ Rn×m as the input weight matrix and RHl,Tl,Cl

∈ Rn×n

as the recurrent weight matrix. Let sl denote the output at depth l with s0[t] = y[t−1]. Then, we can

formally describe the computation done by a single highway layer in a stack at depth l ∈ {1, 2, . . . , L}as follows:

4

This process is shown in the Figure (2), where we can see the H , C, and T parameters correspondingto the variables s[t]l , t

[t]l ,h

[t]l in the equations shown above.

Similarly to how the output for an LSTM was handled, we take the sum of the outputs sL of eachhighway stack and pass the resulting vector into a fully connected layer. This fully connected layerthen produces the classification probabilities for the essay, which is then used to predict the essay’sscore. The loss function used to backpropagate error is cross entropy loss.

The final recurrent model implemented comprised of passing in the output of a word-level RHN intoa sentence-level RHN. In particular, an essay would first be broken up into its constituent sentences.Then, the word embeddings for each word in a sentence would be passed into an RHN; the outputvector of this RHN is used as a form of sentence representation. These sentence representationvectors are then passed into another RHN that processes the sentence vectors to output a final essayrepresentation in an identical manner to the RHN model described above. This final representation issimilarly passed into a fully connected layer to get classification probabilities that are then used topredict the essay’s score/class. The overall structure of this word-to-sentence-level RHN is similar tohow character-level and word-level models interact in a hybrid NMT model.

4 Experiments

4.1 Data

The dataset we used was provided by the Hewlett Foundation as part of the Automated StudentAssessment Prize (ASAP) contest, hosted by the computer science platform Kaggle. The essaysare responses from students between grades 7 to grade 10. All essays were hand-written anddouble-scored, and are later transcribed onto a word document for our purposes. Thus, whenever ahandwritten word is illegible, it is transcribed as "illegible" or "???". We believe that in general, itmakes sense that poorly written essays would score relatively worse, we decided to leave those wordsas they are. Two examples of these handwritten essays are shown below.

5

Figure 4: A bad essay Figure 5: A good essay

We were given 8 different datasets of essays, each of which were essays in response to a differentprompt. However, the dataset provided was structured in a way such that for proper nouns likenames of people or organizations, the names themselves are replaced with placeholders like"@ORGANIZATION1" or "@NAME3." Thus, in order to deal with this, we decided to replace thesetokens with the corresponding word vector for the noun it represents. For example, "@NAME2"would be replaced with the word vector for "name," and "@ORGANIZATION3" would replacedwith the word vector for "organization." We felt this was the easiest way to not completely lose themeaning of the word while maintaining some semblance of the meaning. Additionally, there aresome tags that cannot just be converted directly to a lowercase word, like "@CAPS" or "@NUM",which we handle manually. So, for @NUM, we changed the word to "number". For @CAPS and@DR, we changed it to "name," because they are used to replace the names of proper nouns anddoctors respectively. Additionally, we changed the suffix "Dr." to "doctor" so we could also representthis properly.

Furthermore, we had to account for the different rubrics and different grading scales used for eachessay. Each dataset had its own scale range, so we consolidated all the essays into a one large dataset,and used histograms to assign 4 possible scores: {0, 1, 2, 3}. Thus based on the scoring scale, andsome frequency analysis, we were able to model the overall score distributions among essays usingthis method.

Finally, we want to find the proper GLoVe embeddings for each of the words in our essay, beforeinputting them into our model.

4.2 Evaluation Method

To evaluate our model, we have standard metrics for accuracy and precision. First, we can calculatestrict error, where we iterate through each of the m essays and calculate the accuracy A as follows:

A =

∑mi=1 1{pj = sj}

m

where pj is the predicted score for essay j and sj is the true score.

4.3 Experimental Details

The model parameters for our stochastic gradient descent we used were as follows:

• Learning rate: 0.001

6

• Max epoch: 15

• Dimension size of embeddings: 300

• Frequency of validation: every 1000 iterations

• Layer/stack depth: 3

Additionally, we ran our models on Microsoft Azure services using the NV6 configuration, whichconsists of 6vcpus, consisting of 56 GB of memory. The training took approximately 20 hours foreach of the models on these configurations.

4.4 Results

The following table highlights the results of our work.

Table 1: Performance of the ModelsBaseline Models

Model Training Time (hrs) Accuracy (%)

Logistic Regression 0.56 0.452Single-Layer LSTM 7.43 0.540

Neural Network Models

Model Training Time (hrs) Accuracy (%)

Multi-Layered LSTM 20.39 0.631Recurrent Highway Network (Word-level) 16.39 0.543Recurrent Highway Network (Word-to-sentence) 16.68 0.548

The multi-layered LSTM model and the single-layered LSTM models were the ones that performedthe best on our dataset, which relatively high accuracy scores (∼ 60%) scores compared to randomguessing (25% chance). However, the recurrent highway network models trains much faster than theLSTM models. Overall, it was surprising the model that best balanced training time and accuracywas the single-layered LSTM.

5 Analysis

Given the scores above, there is definitely room for improvement in our model accuracy. There werecertain things about the grading scheme that prevented the model from being as effective as it couldbe. For example, consider the following essay:

Dear local newspaper I raed ur argument on the computers and I think they are apositive effect on people. The first reson I think they are a good effect is becauseyou can do so much with them like if you live in mane and ur cuzin lives in califanyou and him could have a wed chat. The second thing you could do is look upnews any were in the world you could be stuck on a plane and it would be varyboring when you can take but ur computer and go on ur computer at work and startdoing work. When you said it takes away from exirsis well some people use thecomputer for that too to chart how fast they run or how meny miles they want andsometimes what they eat. The thrid reson is some peolpe jobs are on the computersor making computers for exmple when you made this artical you didnt use a typewriter you used a computer and printed it out if we didnt have computers it wouldmake ur @CAPS1 a lot harder. Thank you for reading and whe you are thinkingadout it agen pleas consiter my thrie resons.

7

The correct score for this essay was 2 (although just barely) on our scale from 0 to 3. However, thepredicted score was 0. We can see by reading the essay itself that the arguments made are not actuallyterrible for the age of the students writing them. However, the reason they lost a point was entirelybecause of the spelling. However, misspelled words in our model do not just contribute to one missedpoint because all misspelled words that aren’t in the vocabulary are automatically embedded as theunknown word token. Thus, they contribute negatively to the essay far more than just one missedpoint, and thus the model is not very effective at dealing with this scenario.

However, the model did handle length differences relatively well. For example, there were essaysthat scored remarkably well, despite being on the shorter end. For example, the essay below:

Dear Newspaper People, I think that computers do benefit society for a few reasons.Computers make work easier they can do things people can’t like solve difficultproblems, and kids like playing games on them. Computers make work easier andneater. Typing is often faster than writing, and is always easier to read. Studieshave also shown that people who use computers to do work finish faster and haveup to @PERCENT1 more free time to do whatever they want. Computers also havee-mail, which allows you to send work that you have done to your boss withoutprinting it and wasting paper. Computers can solve difficult problems that peopleeither can’t do, or don’t want to waste their time doing. If you need to solve amath problem you use a calculated or a computer. So you don’t have to figure itout yourself. I have to do a lot of math problems for homework and I think it ismuch easier to use a calculator. My last reason why computers are good is thatkids like playing games on them. A lot of people say that kids shouldn’t playcomputer games but some of them are educational. Even non-educational gameshelp kids to have fun and have something to look forward to after their work isdone. @PERCENT2 of kids say that they are happier when they have somethingfun to look forward to, and happier kids do better work. I hope that after readingthis you will understand how much computers contribute to society.

scored a 2 on the 0-3 scale, despite being on the shorter end of the range of 150-550 words at 250words. However, all three models prediction was a 2 as well, showing that length, though importantin determining quality, was handled appropriately.

6 Conclusion

Overall, the model did relatively well in getting close to the correct score, even if it didn’t exactlymatch the score. However, it was not ideal measurements regardless. Perhaps, we could improveupon these models by making the LSTM models bidirectional. This might help improve thecomplexity of our model and predictions. Additionally, it might be interesting to consider usingcharacter embeddings rather word embeddings and seeing how that would affect the accuracy ofthe model, and how it would handle misspelled words differently than the current model, whichdramatically affects the score by automatically throwing it an unknown token. Additionally, therewas a lot of research done into the potential use of convolutions, and convolutional layers working intandem with the highway network, something that would be worth exploring in the future.

Additionally, our use of the recurrent model where the outputs of word-level RHN are passed into asentence level RHN was perhaps a bit off the mark. It is possible that the highway cells in the secondpart of this model did not actually learn anything new out of what passed in, because there is notmuch more to learn in going from words to sentences and then to essays, rather than words to essays.This might be what caused the negligible difference in accuracy between the two (a difference of0.005%). However, the models did much better than random guessing and improved noticeably onour baseline models.

8

Acknowledgments

We would like to thank our professor Christopher Manning for helping us acquire the tools necessaryto conduct this research. Additionally, we would like to thank our mentor Amita Kamath for her helpand advice in developing our models and working through the obstacles we faced while conductingthis research.

References

[1]J. Zilly, R. Srivastva, J. Koutník and J. Schmidhuber, "Recurrent Highway Networks", Arxiv.org, 2019.

[2]Y. Farag, H. Yannakoudakis and T. Briscoe, "Neural Automated Essay Scoring and Coherence Modeling forAdversarially Crafted Input", 2017. [Online]. Available: https://arxiv.org/pdf/1804.06898.pdf. [Accessed: 02-Mar- 2019].

[3] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. ”Glove: Global Vectors for WordRepresentation.” EMNLP. Vol. 14. 2014.

[4]K. Taghipour and H. Ng, "A Neural Approach to Automated Essay Scoring", 2016.

[5]Valenti, S., Neri, F., Cucchiarelli, A. (2003). An Overview of Current Research on Automated Essay Grading

9

automated essay scoring: my way or the highway!

Documents