using rule-based methods and machine learning for …

47
USING RULE-BASED METHODS AND MACHINE LEARNING FOR SHORT ANSWER SCORING BENEDITH MULONGO, FREDRIK PIHLQVIST KTH ROYAL INSTITUTE OF TECHNOLOGY ELEKTROTEKNIK OCH DATAVETENSKAP

Upload: others

Post on 13-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

USING RULE-BASED METHODS AND MACHINE LEARNING FOR SHORT ANSWER SCORING
BENEDITH MULONGO, FREDRIK PIHLQVIST
KTH ROYAL INSTITUTE OF TECHNOLOGY
E L E K T R O T E K N IK O C H D A T A V E T E N S K A P
Abstract
Automatiskt rättning av korta texter är ett område som spänner allt från naturlig språkbehandling till maskininlärning. Projektet behandlar maskininlärning för att förutsäga korrektheten av svar i fritext. Naturlig språkbehandling används för att analysera text och utvinna viktiga underliggande relationer i texten.
Det finns idag flera approximativa lösningar för automatiskt rättning av korta svar i fritext. Två framstående metoder är maskininlärning och regelbaserad metod. Vi kommer att framföra en alternativ metod som kombinerar maskininlärning med en regelbaserad metod för att approximativt lösa förenämnda problemet.
Studien handlar om att implementera en regelbaserad metod, maskininlärning metod och en slutgiltig kombination av båda dessa metoder. Utvärderingen av den kombinerade metoden utförs genom att titta på de relativa ändringarna i prestanda då vi jämför med den regelbaserade och maskininlärning metoden.
De erhållna resultaten har visat att det inte finns någon ökning av noggrannheten hos den kombinerade metoden jämfört med endast maskininlärning metoden. Den kombinerade metoden använder emellertid en liten mängd märkta data med en noggrannhet som är nästan lika metoden med maskininlärning, vilket är positivt.
Ytterligare undersökning inom detta område behövs, denna uppsats är bara ett litet bidrag till nya metoder i automatisk rättning.
Nyckelord: maskininlärning; naturlig språkbehandling; automatisk rättning; regelbaserat system; självlärande
Abstract
Automatic correction of short text answers is an area that involves everything from natural language processing to machine learning. Our project deals with machine learning for predicting the correctness of candidate answers and natural language processing to analyse text and extract important underlying relationships in the text.
Given that today there are several approximative solutions for automatically correcting short answers, ranging from rule-based methods to machine learning methods. We intend to look at how automatic answer scoring can be solved through a clever combination of both machine learning methods and rule-based method for a given dataset.
The study is about implementing a rule-based method, a machine learning method and a final combination of both these methods. The evaluation of the combined method is done by measuring its relative performance compared to the rule-based method and machine learning method.
The results obtained have shown that there is no increase in the accuracy of the combined method compared to the machine learning method alone. However, the combined method uses a small amount of labeled data with an accuracy almost equal to the machine learning, which is positive.
Further investigation in this area is needed, this thesis is only a small contribution, with a new approaches and methods in automatic short answer scoring.
Keywords: machine learning; natural language processing; automatic answer scoring; rule-based system; self-learning
Table of Contents 1 Introduction..………………………………………...............................................…………………1
3.3 Experimental System..……………………………………………........................15
5.3 Combined Results..……………………………………………..............................32
6.3 Combined Analysis..……………………………………………............................36
1 Introduction
There is currently an ongoing change in the education system, where more courses are given online. This shift towards e-learning enable the courses to be given to more students worldwide than in traditional classroom setting. But how do these students get assessment on their newfound knowledge?
When the response is only a number or a choice between predefined proposed candidate answers, the assessment is relative easy. But whenever the answer is in free-text and the question is relative open, many difficulties arises. This is due to the richness of natural languages that enables two answers with different vocabulary and words usage to be similar despite their apparent linguistic and syntactic dissimilarity.
The big problem is how to find and grasp the semantic similarity between the student answers and the references answers and how those similarity can be used to grade the student answers. Different approaches will be investigated in this thesis work.
1.1 Background
Automatic Short Answer Grading (ASAG) is a term that encapsulates the process to grade student answers that are in free text form. When a student answers a question in this medium at least one person needs to manually grade it. When a computer can do this grading automatically you have an automatic short answer system. This can be used when the question can be answered in multiple ways and every possible solution can not be precomputed [1].
A couple of different approaches have been tested during the years to improve the results of Automatic Short Answer Grading. One of the most successful and popular one is a machine learning approach. This relies on a large dataset of graded answers to train on. It works by extracting features from the training set and predicting the grade of ungraded answers [2].
Another method to grade short answers is to specifically design a ruleset that works as a filter. A student answer goes through this rule set and the result of doing so should be the grade of the answer. This require extended knowledge in the question area from the constructor of the rule set [3].
There have been many attempts to combine different machine learning approaches to build a robust model. An earlier research presents a system that make use of synonyms to build an additional training dataset. After the extension of the dataset by synonyms, different decision tree classifiers are trained on these data. The rules generated by the decision trees are then extracted and used to classify new unseen data [4].
1.2 Problem
How is a student answer automatically graded if the answer given is a short text in natural language? There is currently no fully working automatic short answer grading tool that have a broad use. The solutions that are in use are often limited in several factors, either in scope or in correctness [5].
The biggest issue with Automatic Short Answer Grading is the amount of work required to get it up and running. The machine learning approach needs a large amount of data to train on before a model can be used. This is most of the time hard to achieve because of the lack of a reasonable amount of data to train on. Even though that amount of data is available, sometimes in real classrooms situations reusing the same question can enable cheating and memorized knowledge.
Another issue is the repetition of the machine learning process whenever another exam or question needs to be assessed. The knowledge learning when building a model for a type of question or a type of exams are not easily transferable to other questions or exams in other subjects. That is a big issue because if this repetition is cumbersome then the usefulness of the automated scoring system is questionable.
Aside the machine learning approach, there is another approach based on rules, when an advanced pattern matcher suited for the given subject is used to map the student answers with predefined rules and facts. Whenever a match occurs the student get a predefined number of points. The problem with this approach is the amount of pre-work needed to make rules and the difficulty to anticipate different cases to span all the ways an answer can be written.
The thesis will investigate how those two methods can be combined to avoid the difficulty that each of them hold.
How can machine learning and rule-based scoring be combined to automatically grade short answers?
1.3 Purpose
The purpose with the thesis is to investigate by experiments three different approaches to implement a model for automatic short answer scoring. The observation and the result obtained are analysed to investigate if the approaches are good solutions to short answer scoring in future works.
1.4 Goal
The goal of this project it to find out how machine learning based methods and approaches, that solve the short answer scoring problem can be combined with a more rule-based methods in order to decrease the amount of data needed for training and furthermore increase the
accuracy of the predictor but alleviating the need to make detailed hand-crafted rules.
Benefits, Ethics and Sustainability
Today there is a huge demand of online courses and many universities propose education online. Due to the number of students enrolled it is almost infeasible to hire a reasonable number of graders to assess students answers [6]. This phenomenon has been coined as massive open online courses. For this kind of systems, it is very beneficial to have a automatic assessment tool that can not only grade multi-choice question but also free-text answers to open questions. In that aspect is this project useful even though it raises many questions of judicial and ethical characters.
Impartiality and fault-tolerance of the system is one of ethical aspects that must be consider for automatic assessment tools. The system must be impartial and give the score that the student deserves and in the same time be free from any attacks and programming errors.
Furthermore, the system’s scores might not be legally defensible, that means the system may not be held legally accountable for the score given and the legitimacy of the grade given by the system may be questioned [7]. Many more questions of ethical, sustainable aspects may be raised, but those are outside the scope of the given thesis because this thesis study solely the technical aspects of the construction of an automatic answer scoring system.
1.5 Metodologi
The thesis project uses a quantitative research approach by applying experimental methods to answer the questions stated [8]. The experiment involves building three different systems, two built on established methods and one on a combination of the two. The systems will be evaluated using standardized numerical measurements that is used in this science field. These involves accuracy, precision and recall.
To properly evaluate the experimental systems the same dataset is used to build and test the systems. The dataset originates from earlier research in the same area. The final stage involves validating the systems and drawing conclusions based on the experimental results.
1.6 Scope
The attribute extraction algorithms implemented are often the simplest version of the once that are references. This is because of our previous knowledge in the field.
This thesis does not answer why the machine learning algorithms used are better or worse than one another. The focus is on the features extracted from the written text. The machine learning algorithms are researched to get a strong knowledge base to work from.
The thesis does not apply data handling methods that have no errors. The tokenization, synonyms and lemming of words is done by using already implemented methods and the limitations that that entails.
The textual entailment between the student answer and the reference answer are not considered and the negation detection algorithm for the dataset is not implemented in this research.
A big impediment of a rapid and easy deployment of short answer assessment tools is their inability to be useful outside the area the system is trained for. That means the process must be repeated all the time a new subject has to be automatically assessed. We have not investigated how the short answer scoring system can be used outside the subject it is trained, the ability to do that is called transfer learning [9].
1.7 Disposition
In section 2 the necessary background and related works will be presented and discussed. The methodologies and methods used during the thesis work will be presented in section 3. Section 4 will present the approaches used and the different choices made during the system implementation. The results will be exposed in section 5 for the machine learning system, rule-based system and the combined method. The results will be analysed and discussed in section 6 and we will finally conclude in section 7, where we will also give suggestions to future works.
2 Background
In this section, the technical background necessary to understand the thesis will be presented. Machine and three classifiers will be presented in 2.1 Classifiers. In section 2.2 Technical Background individual methods used in the thesis will be understood. Section 2.3 Related Work will be presented and studied to thoroughly understand the research area
2.1 Classifiers
A classifier is a machine learning model using the data X and the labels y in other to predict the labels y of unseen samples X. Three classifiers are used in this thesis: Logic Regression, Naïve Bayes and Random Forest are presented.
Machine Learning
Machine learning is a field within computer science and artificial intelligence. Machine learning uses a large amount of data combined with mathematical and statistical techniques to give the computer the ability to learn patterns that can be used to improve a task without the need to explicitly programming the task.
There exists a fairy amount of different machine learning algorithms today. Three of the classifiers regarding this thesis work will be presented. The reason these are picked are because of their general high performance during the initial testing.
Logistic Regression
Logistic regression is a machine learning method originally from statistics. Logistic regression uses a logit or sigmoïd function to calculate the probability of the outcome, the function links the predictors to the outcomes.
A logistic function is a function defined by the mathematical function below and represented in Figure 1.
Figure 1: Logistic Regression Function
() = 1
1 +
Let p be the probability that the student answer has score = 3 and ~p the probability that the student has score different of 3. The odds of the for score = 3 is:
=

The trick behind the logistic regression is to take the logit of the odds, as explained below:
() = () = (
=
1 +
6
The probability p is then used to predict the class of an instance given the x and the weights W. Logistic regression uses maximum likelihood estimate to classify. In this case classify an instance to the class where the probability p is the maximum.
Gradient descent is an algorithm usually used in logistic regression to learn the weights of the data. A thoroughly explanation of the implementation of gradient descent and logistic regression can be found in Machine learning in action[20] and Primer with Matlab[21].
Naive Bayes
A probabilistic classifier that is based on Bayes theorem shown below, which in short helps predict the probability of an event happening with understanding of prior events [22].
(|) = (|) ⋅ ()
()
The approach classifies the target class to the one with the target class with highest probability. This is done by having A be the target class y and B being the feature vector {1, 2. . . }.
(|) → = ∈((|1, 2. . . ) )
(|) ⋅ ()
() → ∈(
(1, 2. . . ) )
= ∈((1, 2. . . | ) ⋅ ()
By looking at the previous feature set and the target class and calculating the probability via frequency a classifier can be trained. The difficulty come with increasing the feature set, estimating the probability of is then very difficult.
The naive approach assumes that the features are conditionally independent given the target class. This is just the product of all features combined.
= ∈(() ∏ (|)

)
The training of the method involves creating the probability of features to class target. The method then uses this set probability to determine the class for a new instance.
Random Forest
This classifier is an ensemble algorithm, this means it uses more than one set of itself to determine the class of the target. Specifically, random forest uses several decision trees.
A decision tree predicts the classifier by constructing a tree of decisions. As the sample to be predicted goes through the tree a decision is made at each step. The outcome of the process is the predicted class of the sample.
The trees are trained using randomly selected portion of the training- set. A majority vote between the different trees are held to determine the predicted class of the target [23]. A depiction of the division and voting is shown in Figure 2.
Figure 2: Random Forest
2.2 Theoretical background
The theoretical background will explain all the necessary knowledges needed to understand the process used in this thesis.
Data analysis
Data analysis is the first step in every machine learning project. A thoroughly understanding to the data makes it easy to detect outliers, clean the data and build a suitable model.
On the other hand, data pre-processing is the technique used to pre- process the data, for example by removing noises, missing values and other data samples that may alter the prediction result.
In short answer scoring, the only data available is the student’s answers text. It is difficult to analysis raw texts to find interesting relations without a pre-labeld transformation of the texts into a different format. Nonetheless, we can analysis the frequency of some words in the text as depicted in Figure 3. We can also conduct advanced analysis by looking for the most frequent bigrams, also called collocated words.
Furthermore, a grammar check can be used to grammatically correct the students answer. The grammar checker has some inherent problems because of it's implementation in Python. The result is still better then not performing the grammar check.
Cleaning
The dataset used is full of different noises, composite words that should be separated, grammatical errors, words that do not fit in the phrases, grading errors etc. The cleaning of the dataset is of great importance. The cleaning process involves separating composite words like “container.also” to “container also”. Furthermore, using an approximate string matching algorithm to compensate for such errors.
Tokenize
The process of taking a sentence and creating a list of words in the same order of the sentence is tokenization. This technique is used to better perform operations on text. An example of tokenizing is shown below.
I want to have fun → [, , , , ]
Stemming
Stemming is a method to find the root of a word. The root of running is to run. This process is performed on the individual word, often used on a tokenized list. Different stemming techniques exist but the idea is the same everywhere, to get the root of the word.
Lemmatization
Lemmatization is like steaming but uses a vocabulary and morphological analysis of the word to return its lemma. The difference is that stemming does not consider the context the word was used in. lemmatization tries to do this by using the words part of speech.
Bag of words
A bag of words model (BOW) is one common way to vectorize a phrase or text by using each unique words and their occurence rate.
Given the two following texts 1,2and the list or dictionary . Two Bag of Word vectors can be created based on the dictionary.
9
= {, , , , , , , , ]
(1, ) = [1,1,1,1,0,1,1,0,1,1]
(2, ) = [1,1,1,0,1,1,0,1,1,1]
In this case a binary vector is shown as an example but that does not need to be the case. It is binary only because the words occur once, and it has zeroes where there is no occurrence of the word under consideration for example the word quantity has no occurrence in the second text where we have put zero.
N-grams
N-gram is a method used to represent and analyse a text. Given the sentence ”We need to know the quantity of vinegar”, we can either choose to analyse the sentence by considering each word uniquely [We, need, to, know, the, quantity, of, vinegar] this process is called unigram. Otherwise we can choose to consider the sentence as a sequence of two words [We need, need to, to know, know the, the quantity, quantity of, of vinegar], this representation is called bigram. We can continue the process to trigram and hopefully to n- gram.
Feature extraction
Feature extraction is a technique used before building a machine learning model. The feature extraction is used to find characteristics or attributes of each sample in the dataset. For example, a feature or attribute of a text can be the length of the text, the presence or absence of some keywords, the similarity score between the sample and some references answers. Those features are used to predict the class of unseen samples given that we know their attributes or features.
[11, 21] → 1
[12, 22] → 2
[13, 23] → ?
The above examples show how two feature vectors can be used to infer the class of the third example given its feature vector.
Feature selection
After different features have been implemented, a feature selection algorithm could be used to find the most predictive features. The most predictive features are features that contain a lot of information towards the category of interest. Knowing the state of those features increase our knowledge of the class the sample belong to. In many cases feature selection increase the performance of the classifier. It
10
can also decrease the training time by reducing the dimension of features.
The most well-known algorithm for feature selection, is the BORUTA algorithm [11]. In this thesis we will use the feature selection from SKLEARN1. A complete and theoretical introduction to features selection can be found in An Introduction to Variable and Feature Selection[18] and Improving Question Classification by Feature Extraction and Selection[19].
2.3 Related work
As described in the introductory chapter, this thesis is about automatic short answer scoring. It is a very broad subject with many hard problems with no known general solutions due to the difficulty of the subject. To build a good performing answering scoring system, expert knowledge in the domain for which the system is build for are required, furthermore knowledges in computational linguistics, data science, computer science, machine learning, mathematics is required.
Plenty of researches and studies have been conducted in the area of short answer scoring. Different approaches have been tested and studied. Approaches that solely rely on the total understanding of the question using advanced natural language processing techniques to methods that rely on pattern recognition and machine learning. Some studies have also considered combinations of machine learning with rule-based approaches.
The common characteristics of the aforementioned approaches are that they are difficult to use outside the domain for which they were built for. This is the main reason behind the difficulty to deploy short answer scoring systems on a large scale. Every time a new exam will be assess, a new system needs to be build, which is very cumbersome.
Machine Learning
To apply machine learning in a given domain, the data plays a vital role. The data is of a paramount importance in machine learning. The importance of the data is still true in the are of short answer scoring. There have been many researches dealing with the application of machine learning in the domain of short answer scoring [7]. Of all the researches done in this domain, we will here only review two of them, because they are the most relevant and most related to this thesis work, furthermore they used the same dataset as the dataset used in this thesis.
1 SKLEARN, science skit learn is on of the most popular open source machine learning library written in Python. It can be found at http://scikit- learn.org/stable/index.html for more information.
The interesting attributes implemented in the paper are the following:
Latent Dirichlet Allocation (LDA), is an algorithm for topic modelling to automatically find the topic of unlabelled texts [10]. The authors used a model by constructing two LDA topic spaces. The similarity between the student’s answer and the two implemented LDA topic spaces were used as a feature of the student’s answer.
Well-formedness Features used to check the grammatical correctness of the answer.
Length Features, the length of each individual words and characters were used as attributes.
Language Model Features, two language models where trained and the perplexity of each answer in those language models was used as features.
The result obtained by the paper is presented in table 1.
Table 1: Prognosis Essay Scoring and Article Relevancy Using Multi-Text Features and Machine Learning
In the paper prognosis Essay Scoring [11], the authors followed almost the same approach as the one explained above. The only difference is that here the authors have used a restricted number of features. They used word2vec, regular expressions, text statistics and N-grams. Those features were used to train a random forest and a gradient boosting classifier with Boruta selection algorithm. Their result named proposed are presented in table 2.
This thesis takes inspiration from the two aforementioned approaches but uses different classifiers and features algorithms. A detailed explanation of the methods used, and the results obtained are presented in following chapters.
Rule-based systems
A rule-based system can be described as a system which consist of a working memory or knowledge base, a rule base, an inference engine and an execution engine [12]. The knowledge base or working memory describes the fact and conditions. The rule base describes a relation between the premises and the conclusions. The inference engine has a pattern matcher that applies rules given the fact, an agenda that list all relevant and applicable rules. The execution engine decides which rules to apply given the input.
Two main algorithms are used to infer a rule-based system, forward chaining and backward chaining[13]. Forward chaining uses a top- down approach, it begins with the facts and use the rules to infer the conclusions or trigger an action given the facts. Backward chaining uses a bottom-up approach, it starts with some hypothesis or goals and searches the rule space for rules that could be used to prove the observed hypothesis, by setting new sub goals to prove as the process is moving forward.
We will not in this thesis use a rigid and logical definition of the rule- based system as described above. Rule-based system in the context of the thesis defines a system that understand the student answer. This is done by building a model of the reference answers and use a pattern-matcher to compute the relation between the student’s answers and the reference answers. This process can also be described, in some extent, as an information extraction technique. The reference answers and the patterns formed can be considered as rules, the student answer can be represented as a fact and the system will infer if the fact follows the rules.
Figure 4: Illustration of a simple mark scheme template
The author used the above template to match the student answer to the reference answers. The authors also used some kind of text pre- processing and natural language processing techniques with sentence analysis.
S. Pulman and J. Sukkarieh follow almost the same approach as T. Mitchell in their paper, but with a big emphasis on the synonyms construction [15]. They use an information extraction approach with handmade patterns as rules. They also use an inductive programming approach but found it not be very promising.
A more complete resource about information extraction and rule- based approaches can be found at “A systematic approach to the automated marking of short-answer questions”[16], where the authors begin by first spell checking the student answers and parsing with the Stanford parser. In the second step, they used Stanford parser partly to find the part of speech (POS) of the student’s answer partly to found out the type-dependency parse tree of the student answers text. The obtained POS are used to analyse the answer syntactically by the used of the Question answer language, an EBNF grammar developed by the author to describe different patterns that the correct answers must follow. The typed-dependency parser is used to analyse the grammatical relation in the text in order to ensure that the given text follow the correct predefined grammatical patterns.
A different approach using rules to match the student’s answer with predefined correct model answers or reference answers [3]. In the paper a scoring algorithm is described and given in a pseudocode. The scoring is capable to automatically correct a student answer, given some predefined constructed patterns of the models answers. If the student’s answer match one of those patterns, the score of the student so far is increased by one.
[13][14][15].
Combined approach
There have been many attempts to combine different machine learning approaches in order to build a robust model. In Hybrid approach for automatic short answer marking, the authors describe a system that use synonyms to build additional training datasets and build different decision tree classifier for each model answers. The rules are extracted from the decision trees and are then used to classify new unseen text answers [4].
Although not considered in the work section, there are some previous work about transfer learning for short answer scoring done by different researchers [9]. Transfer learning is a very promising field, especially for short answer scoring, because it will alleviate the need to always repeat the same training process or/and information extraction process to each new questions or exams. That is cumbersome and not so desirable if we wish a broad deployment of this kind of systems.
The approach used in the thesis is different from the aforementioned methods. Using a semi-supervised approach combined with a rule- based system. Self-learning is a semi-supervised algorithm. A semi- supervised algorithm is an algorithm that uses a combination of labeled and unlabeled data to predict new examples. The semi- supervised algorithm used in the work is self-learning [17].
The self-learning method used in the thesis combines a machine learning method and a rule-based method. The combination comes from how the machine learning method trains itself. This is done to increase performance and decrease the amount of data needed during training.
3 Metodologi
In order to answer the research question presented in this thesis, a research method needs to be applied. There are two basic research methodologies: quantitative research method and qualitative research method [8]. Quantitative research implies that the answers to the research question can be answered through and with quantifiable results. This is usually applied in experiments, testing and computer systems. This method requires use of statistic methods to validate the results.
Qualitative research is the opposite and focuses on non-numerical methods. The methodology focusses on understanding the meaning, opinion and behaviour to reach a result.
3.1 Research Approaches
The two most common research approaches are: inductive and deductive [24]. The inductive approach is based on analysing the data and formulating views and opinions of the phenomenon. The deductive approach verifies or falsify a hypothesis. This is done by testing a theory on a large dataset, the result from this must be measurable. This thesis uses deductive approaches to verify the given hypothesis.
3.2 Data collection and analysis
The data collected is based on a well-known dataset from the data science competition website Kaggle [35]. The dataset is used for experimental studies in short answer scoring. It is constituted of a student answer and two scores ranging from 0 to 3, where 3 is a full- mark.
The dataset is transformed in order to work in our research methods. This is done by dividing the dataset into two or more separated sets. Train-set, Validation-set and Test-set. This division is done by randomly selecting elements from the original dataset. Train-set is used for training an initial or a base classifier for the given research problem. Validation-set is used for changing variables in the experimental system to increase its performance and its accuracy. The test-set is used to measure the goodness of the model, its ability to generalise well outside known examples.
3.3 Experimental System
The system is built in three parts where each part corresponds to a method constructed to tackle the research question. Each part is evaluated separately using existing evaluation methods. Two of the methods are based on known techniques which in the past have given quantifiable results. The third is a combination of the previous two and will be tested in the same manner as the previous one.
The system will use an incremental model where the system will be improved after iterations of development [25]. This incremental model is picked because it will greatly improve the results and is relatively inexpensive to apply on a software system.
Python3 is used to build the system [26]. The Python programming language have a lot of tools to achieve the results needed to answer the research question. A very important tool is the machine learning library Sklearn, which used extensively in this thesis [27]. To analysis
3.4 Evaluation
All the systems implemented is evaluated in the same manner. The evaluations are based on established methods in machine learning and Automatic Short Answer Grading [2]. The assessment measures are inter alia Accuracy, Precision, recall, f1-score. All the measures presented here will be used later and are important in order to assess the model, but we will however focus mainly on the accuracy when comparing the results.
Accuracy
Accuracy is represented as a fraction of the correct predictions of the model over the hole dataset. Here is the predicted value and y is the true value of the sample. The samples noted by nsamples is the number of elements in the dataset that was tested. The accuracy is a value between 0-1 [29].
(, ) = 1
1( = )
To remove uncertainty in the accuracy a statistical method called cross-validation is sometimes used instead. The training set is divided into multiple separate sets and the accuracy of each of them are calculated separately. The result of each is later normalised with the mean score and added standard deviation. Here k is the number of dividing sets and si is the true value and i is the predicted value. The Cross validation is given by ω.
= ( ∑
Figure 1: Confusion Matrix
Figure 5 shows a confusion matrix for two classes 0 and 1. A confusion matrix is matrix used to assess the performance of a model by
comparing the true values and the predicted ones. It shows how well the model is capable to predict the true value of each samples.
True positive (TP) = true value is predicted to be true True negative (TN) = false value is predicted to be false False positive(FP) = false value is predicted as true False negative(FN) = true value is predicted to be false
Precision
Precision or positive predictive value is a measurement of the precision of a classifier. It is the fraction of relevant instance among the set of all the retrieved instances. It gives a measure of how well the model performs in classifying the instances in the right category. This is done by taking the true positives (tp) and false positives (fp) and determining the ratio of tp to the hole. [31]
=
+
Recall
Recall or sensitivity is a measurement of the fraction of relevant retrieved instances over the total amount of all relevant instances. This is done by taking the true positives and false negatives (fn) and returning the ratio between true positives [32].
=
+
1 = 2 ⋅
=
4 Work
This section will present the process used in this thesis and the way the work is conducted. The data will be presented in section 4.1 and the remaining sections 4.2-4.4 will present the work conducted in the thesis project.
4.1 Data
The dataset is from the Automated Student Assessment Prize (ASAP), which is a joint effort of the Hewlett Foundation and Open Educational Solutions to gather all the current approaches to automated scoring systems for open-ended student response tasks. The competition is available at Kaggle [35]. The dataset is one of the biggest publicly available dataset for short answer scoring today [2]. Figure 6 shows the structure of the dataset.
Figure 2: Dataset Structure
The dataset is in a tab-separated value file with 5 columns. The first column is the identity number of each question which is unique for each question and ranges from 1 to 27 588. The second column is the essay set, there are 10 questions and the EssaySet describes and identifies the question and ranges from 1 to 10. Score 1 is the first score and Score 2 is the second score from two different correctors. But the final score is Score 1. The essay test is the student text answers.
In this project, we only consider the first question and only assess our system using score 1. The reason to limit ourselves to the first question is due partly to time aspect partly to the reliability of score1, which is the final human score. The final dataset used in this project corresponds to the EssaySet 1 with score 1 as label which is exactly 1672 data samples.
There are some grammatical errors in the dataset and some assessment errors from the human corrector, but those errors do not constitute the whole dataset.
The errors can be caused by misspelling when pupils were written their answers on the paper. Other errors can be due to optical character recognition system when translating handwritten texts to machine-encoded texts.
4.2 Machine Learning Method
The matrix of features X and their corresponding label y as depicted in section 2.2 are used in three different classifiers:
• Logistic regression • Naive Bayes • Random forest
The machine learning model for each classifier is evaluated and assessed and the result obtained can be found in section 5.
Approach
Using machine learning for short answer scoring, we want to find patterns in the dataset that can help us predict the score of other student answers not present in the training dataset. The only thing available is the student’s answers texts and their corresponding scores.
As stated before machine learning algorithms work with numerical values2, the main difficulty is to implement algorithms to transform the student answer to a numerical value that can be feed into a machine learning algorithm. To achieve this transformation from free text to numerical values, 19 features or characteristics of the text are implemented. Those features take the text as input and return a numerical value representing a distance measure between the text and a reference answer for example as a similarity score, a number of keywords present in the text etc.
Feature Implementation
Features are characteristics of text. Those characteristics can help us know if the text is worth 3 points or 0 point depending of the numerical values of features, sometimes called attributes.
A simple example of a feature is the number of keywords present in the student’s answer text. A detailed explanation of these features implemented below can be found in [2][7][10][36]–[42]. The following features are implemented and computed for each answer.
Cosine Similarity
The simplest form of the cosine similarity uses the bag-of-words representation of two text and compute the distance between their respective vectors using the following formula:
= 1 ∗ 2
| 1|| 2|
2 Most machine learning algorithms work with numerical values, although there is string kernel function that can directly work with string, but the final result of the kernel function is a numerical value. Some algorithms may also use categorical values as {sunny, cold, rainy...}, but you get the idea, do not you?
The similarity score between the vector representation of the reference texts and the student answers were used as a feature in the project.
Keywords
Keywords is the simplest feature, where we have a list of each keywords and simply count of many times each keyword occur in the student answers.
We use two versions of the keywords features, one that simply count the occurrence of each keywords and return the result as a feature and another one that normalizes the result. Those two features are then saved in a feature matrix as explained above.
Latent Semantic Analysis
A complete explanation of the latent semantic analysis will be lengthy and out of the scope of this thesis. Nonetheless the latent semantic analysis can be defined as a method to grasp the latent (hidden) semantic space of two or more texts even if they do not necessarily share the same words.
Given the dictionary [we, need, to, know, measure, the, quantity, amount, of, vinegar] and four vector representations of four text using this dictionary we can build the following matrix:
[ [1, 1, 1, 1], [1, 1, 0, 3], [1, 1, 1, 1], [1, 0, 1, 0], [0, 1, 0, 1], [1, 1, 1, 1], [3, 0, 1, 0], [0, 1, 0, 1], [1, 1, 1, 2], [1, 1, 1, 1] ]
Each column is a transposed vector representation of each text using the bag-of-words approach with a common dictionary.
After the construction of the matrix we calculate the singular value decomposition of the matrix X:
() =
The obtained matrices are reduced to a lower dimension to find an approximation and in the same time reducing the computation time needed as () =
is in a lower dimension.
The terms are represented by and the text (documents) by ,
the similarity between two terms or document in the semantic space can be computed by using the cosine similarity between the two vectors.
21
Two versions of the LSA model have been implemented as features, one that uses one reference answer where each sentence creditworthy is considered as document. The feature is calculated by averaging the similarity scores between the student’s answers and each sentence in the reference answers.
Another LSA model is built with a set of hundred reference answers as document and the highest similarity between the student’s answers and the reference answers set is return as a feature.
Partial Word Overlap
Partial word overlap is a method to compare two texts by computing the word overlap between them using the following formula:
(, ) = | ∩ |
| + |
We use the word partial here because it not a complete matching of words between the two texts in comparison, but an approximate matching where we allow some difference between each word for example vinegar and vinnagar will be matched with an approximate string matching.
Language Model
The language model is a method commonly used in automatic word suggestion or completion for example when typing in Google or other search engines.
Language model is used to calculate the probability of the next coming words given the preceding observed words. For example:
( | ) (| )
When typing in google ‘how to’, the phrase ‘lose weight’ and ‘draw’ appear among the words showing up first. That means in Google there is a higher probability that peoples usually search ‘lose weight’ or ‘draw’ after typing ‘how to’. To estimate those probability, we need a very large corpus that is representative for the goal or application considered. To reduce null probability, the probability is approximate to unigram and the independence of words is sometimes assumed.
We have here made a large corpus of acceptable answers and calculate the perplexity of each student answer as a feature, that is the probability that the answer is from the corpus given the answer text.
Latent Dirichlet Algorithm
Latent Dirichlet allocation is a very advanced algorithm using both higher probabilistic model distribution and advanced methods as Gibbs sampling because the correct estimate of the probabilities used in the model is NP-hard.
22
We use the algorithm implemented in Gensim to implement the LDA features. Two versions are implemented.
The first version uses the references answers and make an LDA model of it. Then the student answers are reduced to the same dimension as the references answers and the topic distribution probability of the student answers is used as a feature.
Word Alignment
Word alignment is a way to compare two texts by calculating or estimating how many semantic similar words the two texts have in common. It reminds in many ways to the partial word overlap, but here we are only interested to the word to word semantic similarity not necessary their syntactic similarity.
The formula to calculate the word alignment is:
((1), (2)) =
((1)) + ((2))
(1) (2) are two input texts
= number of content word in the input text without stop words (all word counts)
= number of word in the input text aligned (without stopwords )
Corpus Similarity
Corpus similarity uses a set of keywords from the reference answer and look for each such word in the student’s answer, if not founded the algorithm looks up synonyms of the words and again match it against the student’s answer. The number of matched word is used as a feature.
Jaccard
Jaccard similarity is used to calculate the similarity between the vector representation of the student answer respective the reference answers using the following formula:
(, ) = | ∩ |
| ∪ |
The similarity between the student’s answers and each reference answers is calculated, and the average score is return as a feature.
Dice Similarity
Dice similarity calculate the similarity between the vector representation of the student answer respective the reference answers using the following formula:
23
||||
The similarity between the student answers and each reference answers is calculated, and the highest score is return as a feature.
Blue Score
Bleu score is a machine language translation method that can be used to estimate the quality of a machine translation. This quality estimate is calculated by comparing the machine translation against a set of reference human translations. The score is way to benchmark the performance of the machine translation and its quality. There have been recent research and attempts to modify the blue scoring algorithm to fit different needs for example in textual entailment or recently in essay scoring [36]. In order to use bleu as features, we have both used an unmodified blue algorithm from the natural language toolkit and a modified blue algorithm based on article [36] although some steps in the paper have been disregarded or modified.
Ngram
The reference answers are firstly broken up in different sentences each of them creditworthy. Every sentence is analysed independently. Using the string approximate matching algorithm implemented in python, called Ngram. We calculate the similarity between the student’s answer and every sentence. The three highest values are returned as a feature.
Key
The feature that is extracted is the number of unique keywords found by comparing the answer to a selected list of keywords. The keywords are selected by modifying the initial reference answers. The modification is done by removing punctuations, stop words and duplicated words. The remaining words are put into a list ki this is done for all the reference sentences resulting in a big list K, represented as follow:
= {1, 2. . . }
The extraction algorithm compares the student’s answer to each element in the list K. The comparison is done by counting how many of the keywords in ki was found in the student’s answer. The results are stored in a list S.
= {1, 2. . . } = [0, ||]
The resulting numerical returned is the sum of the three highest values si in S.
The reference answers are divided up into sentences. This is done by selecting the nouns and the corresponding synonyms words. Each sentence is put into a list. Stemming is used on all words in in order to get a higher similarity rate. Each sentence is tested independently against the student’s answer. If the student’s answer has all the words present in the reference sentence a credit is given. The process is repeated for all sentences. The sum of all obtained points or credits is returned as a feature, where sum exceeding 3 is replaced by 3. An example of reference sentences is given in figure 7.
Figure 3: Reference Sentences
Feature Extraction
Features extraction refers here to the computation of the features for each student answer in the training set. Each feature mentioned above were computed for every student answers in order to build a matrix that a machine learning algorithm can use in order to make a prediction.
For each student answer we have computed a vector represented as follows :
→ () = [ 1, 2, 3 . . . 16 ]
→ () = [0,3]
The list of each features vector computed, and their corresponding scores is represented as matrix X and the labels as y, forming together our training set.
Feature Selection
Given many features are implemented, some of them may be useless or inaccurate or less predictive than other. Therefore, we need a method to select the best performing features given the scores to be predicted. By best here we mean the features with higher accuracy or high information gain towards the prediction classes.
25
The process to choose the best performing features are called features selection as explained in the background chapter. The features selection algorithm implemented in the science-kit learn python library are used.
Machine Learning System Architecture
The validation data is used for parameter tuning, in order to increase the system performance or change the system behaviour. It is therefore of great importance that this set is isolated from the test data to avoid wrongful results or bias. A step by step process is shown in Figure 8.
4.3 Rule-based System Method
In this section a detailed overview of the final rule-based system will be presented. Natural language processing is used to build the rule system and the process will be explained in detail.
Rule-based System Architecture
The system is built in two main levels. In the first level we have a filter which decides which pattern matcher that should be used in the next level. The output of the system is the predicted score of the student’s answer. A step by step depiction of the system is shown in Figure 9.
The filter level takes two inputs. The first input is the student’s answer which
is processed as a list of sentences = {1, 2. . . } and the second input is a list of keywords . The output is a list of sentences ′ = {1, 2. . . } to
each trail given by a list = {1, 2. . . }. Each list can contain nothing or all the words present in the original student’s sentence.
If a sentence contains a keyword that sentences is attached to the corresponding sentence list . Each pattern can be different, the input of the pattern box is a list of sentences. The test is performed in the same manner and if the sentence does pass the pattern box the output is true, otherwise it is false.
The output of each trial is summed up into a single value . The maximum points this system can give out is 3. Therefore, the final output of the system is which ensures that the final score is in the correct range [0,3].
Figure 4: Machine Learning Process
26
Patterns and Rules
Following a simple rule-based method, two main techniques are used to predict the score of the student answer. The first is the keyword filter and the second is the grammatical similarity comparator based on part of speech tagging.
Keyword selection
From the reference answers, six keywords are extracted. The keywords selected are nouns and are important in order to accurately answer the given question. Other words such as temperature, sample are extracted from the analysis of the dataset. Each keyword is extended by its corresponding synonyms from wordnet [43]. Finally, each word in the student’s answer are now matched with the original word list in figure 10 or their synonyms taken from Wordnet.
Figure 6: Keyword List
Part of Speech structure
The system used to filter each student’s answer is constructed in the same manner. The difference is in the structure of the sentences. This is done by analysing the grammatical structure of the sentence by the help of the part of speech tagging from NLTK.
Figure 11 shows how the comparison is made between the student’s answer and the reference example. If the part of speech tag of the student’s answer and the reference sentence overlap during the
comparison, then the answer is worth a point. Each trial gives results in either 0 or 1 no matter how many sentences are tested.
Figure 7: Part of Speech Pattern Matcher
In the beginning, the student answer is break up in sentences. A possible representation of a student answer is = {1, 2. . . } . Every sentence is put in the first level of the system, the pattern matcher. The pattern matcher makes the sentence ready for the next level by returning a list of possible patterns to test the sentences on for further analysis
→ {1, 2. . . }
The second level, analysis and compare the student’s answer against the selected patterns output by the pattern matcher. If the similarity is strong enough then the student answer gets a point otherwise no point.
4.4 Combined Method
In the third part of this project, we have combine a machine learning approach with a rule-based system for the short answer scoring problem. The machine learning algorithm used is a semi-supervised algorithm called self-learning. This algorithm is selected due to its simplicity and its ability to learn from few data samples. The rule- based system is the same as the system described in section 4.3.
Combined attributes
The initial machine learning model used in the self-learning algorithm is the same as the model presented in section 4.2 with 19 different features. The feature selection implemented in the Sklearn are used in order to reduce dimensionality and increase the accuracy of the classifier.
Self Training Implementation
As explained in the introductory chapter, there have been many attempts to combine different techniques for short answer scoring. We have used here a self-learning method combined with a rule-based system, the rule-based system described before.
The pseudo-code of the self-learning system is presented at Figure 12.
28
Figure 8: Self Learning Rule-based Algorithm
The condition in the if statement have been changed in order to use the prediction from the rule-based system instead. Self-learning usually uses the probability of its own prediction and label the unlabeled data only if the probability is above a predefined threshold. Instead of using the classifier’s own confidence estimation, the rule-based system is used to compensate from some errors that the machine learning can make due to few data examples used in the training and the weakness of the model.
Here the unlabeled examples are labeled only if the prediction of the rule-based system coincide with the prediction of the classifier. There is some possibility that both the classifier and the rule-based system are wrong about a given example, but the overall benefit to have them combined is much higher. And it alleviates the need to label those samples in beforehand.
5 Results
The presentation of the results of the experiments in this chapter will follow the overall structure of the thesis work. The result for the machine learning approach will be presented first, followed by the presentation of the rule-based system result and finally the result for the combination of the two methods will be presented.
In each part, the model evaluation result will be presented and the result for the model optimization will be shown whenever it is applicable.
5.1 Machine Learning Results
Three machine learning classifiers is used to build the machine learning model and their respective accuracy and other metrics are presented. Only the Logistic Regression will be presented here. The results of Naive Bayes and Random Forest will be available in the appendix.
29
Feature correlation and importance
The features vector plays a big role in the accuracy of the machine learning model. As described in section 2, it is of great importance to find a pair of features with high predictive power in order to build an accurate model. Even if all features are not predictive, an analysis of their respective correlation can help. Figure 13 shows the correlation between every features used in the machine learning.
Reading the correlation matrix shown in Figure 13 we can infer that four features are not very correlated with the rest of features. Bingo score, align, key, and LDA are the features that are not very correlated or that are negatively correlated. This correspond to the parameter found when searching the best number of features to select using grid search. The best number of features to select varies between 15 and 16, which exactly 19 - 4 (total number of features - bad features = best features).
Figure 9: Feature Correlation Matrix
Logistic Regression
The logistic regression algorithm implemented in the science-kit learning library in python are used to implement our machine learning model.
Before using the feature vector, we have use the feature selection algorithm implemented in the science-kit library to increase the performance of the model using the validation set.
30
The dataset is divided in three parts where the train part is 70% of the data, the test part is 20% of the data and the validation is only 10% of the dataset. The result is presented below.
Feature Selection
The feature selection algorithm is used to find the best features to used in the model in order to increase the accuracy of the model. The parameter found are 15 features as shown in the above Figure 14.
Figure 10: Feature Selection on Logical Regression
Confusion Matrix
The confusion matrix of the logistic regression model is shown in Figure 15. The matrix is normalized.
Figure 11: Logical Regression Confusion Matrix
31
Numerical Results
Accuracy and cross validation for the logistic regression model is shown Table 3. Table 4 shows the precision, recall and f1-score of the model. Support is the number of samples the numbers are based on.
Table 3: Logical Regression Accuracy
Table 4: Logical Regression Test score
Roc-curve
Figure 16 depicts the roc-curve for each individual class given by the implemented logistic regression model. The area under each curve is also shown in the figure.
Figure 12: Logical Regression ROC-curve
32
5.2 Rule-based System Results
We present here the results of the rule-based system whose structures is explained in section 4.2.
Confusion Matrix
Figure 17 shows the confusion matrix of the rule-based system.
Figure 13: Rule-based Confusion Matrix
Numerical Results
The accuracy, precision, recall and f1-score is shown in the Table 5.
Table 5: Rule-based Test Scores
5.3 Combined Results
The dataset is divided in four parts. Two different partitions are tested. The first partition uses 20% for training, 70 % for testing and 10% for validation, the second partition uses 30% for training, 60% for testing and 10 % for validation.
33
Confusion Matrix
The confusion matrix and combined results are shown in Figure 18 and Figure 19.
Figure 14: 20% for training, 70 % for testing and 10% for validation
Figure 15: 30% for training, 60 % for testing and 10% for validation
34
Comparison
The comparison between simple logistic regression and self-learning rule-based system. The table 6 shows the numerical results of the between the different systems for the two partitions used.
Table 6: 20% for training, 70 % for testing and 10% for validation
Table 7: 30% for training, 60 % for testing and 10% for validation
Table 8: 60 % for training, 30 % for testing and 10 % for validation
35
6 Analysis
In this section, the results presented in section 5 will be analysed. The structure of this section will follow the general structure used in this thesis so far. The analysis of the machine learning system will be presented in section 6.1. In section 6.2, the rule-based system will be discussed thoroughly and finally in section 6.3 the hybrid system will be commented and analysed.
6.1 Machine Learning Analysis
The machine learning method have a result of 55 % on the test dataset which represents 20 % of the total samples. This accuracy is obtained using feature selection and parameter optimization. Given the number of features used and implemented, the accuracy is very low. The reason behind it can be a poor implementation of particular features. When you look at Figure 14: Feature Selection Graph the increase in accuracy as the number of features increase is very small. Figure 13: Feature Correlation Matrix shows that most features have a positive correlation with each other. This indicates that the features reinforce each other correct predictions but also the negative once.
Another reason behind the low accuracy rate is how the reference answers are written. The initial set of reference answers are only a subset of all correct answers. The space of all correct answer cannot be anticipated. Having a closed set of reference answers, the difficulty to extend the reference answers with synonyms and paraphrases will decrease.
One factor that made the task difficult to predict was the fact that a student answer with a score of 3 could be entirely different to another student answer that got a 3 as score. This incoherency makes it difficult for the machine learning method to determine the class boundaries for the different scores. Because the features picked are mostly related to the reference, and how this similarity differentiates the answers. Compiling the dataset into three binary answers to the question would then greatly improve the method.
Figure 16: Logistic Regression ROC-curve indicates that score 0 and 3 are much easier to classify compared to score 1 and 2. The reason behind this behaviour is that score 3 and 0 are very different from each other, therefore easier to recognize. If the initial problem would have been changed to a binary classification problem, it would have been much easier to recognise probable correct answers from probable incorrect answers with a very high accuracy. Therefore, a binary discriminative classifier can already be implemented with a high accuracy in order to reduce the number of exams needed to be corrected by a human. The only exams corrected by humans will therefore be the answers having a high probability to be correct.
Regarding table 4: logistic regression test score, we can see that the result of the precision and recall completely agree with the result from
36
the roc-curve. The answers with score 0 and 3, have a higher precision and recall value. The precision here is a measure of how well the classify could determine the score of answer worth score 0 as score 0 and the recall is the ability to not label the score in the wrong class.
In summary we can infer from the result that the decision boundary between score 3 and 2 versus score 1 and 0 is very small as the confusion matrix in figure 15 shows. From the confusion matrix, it can be observed that 76 % of answers belonging to score 0 were correctly classified as score 0 and 12 % of the answers were wrongly labeled as belonging to class 1 instead. Those observations show the difficulty to correctly determine which score a answer is worth when the decision have to be made between very close classes as 0 and 1 or 2 and 3. The same behaviour can be observed for neighbouring classes.
6.2 Rule-based Analysis
The rule-based are thought as an independent solution to the short answer scoring problem. It has used no training data. The accuracy is calculated for the whole dataset. The accuracy is 45 %. It is lower than the machine learning method but it still quite good, given that no training data is used. The rule-based method is used for benchmark purpose and for helping the machine learning algorithm to label new examples.
The rule-based system suffers from the same problem faced when constructing a machine learning system for short answer scoring, that is the difficulty to write beforehand a set of acceptable reference answers or patterns that the student’s answers may be corrected from.
Regarding our dataset, the set of all acceptable reference answers is not a closed set. If a thoroughly analysis of different possible answers was possible and if the set were more closed it would have been much easier to write rules and patterns in the rule-based system.
6.3 Combined Analysis
Using 20% of the data for training and 70% for the testing, the machine learning method outperforms the self-learning rule-based method on the test data only by 2.5%. However, when considering cross-validation, we can see that the self-learning rule-based method have a higher accuracy.
Using 30% of the data for training and 60% for the testing, the machine learning method still performs much better than the self- learning rule-based method by 5.3% using the test data. There is a high decrease in the test data accuracy but the cross validation for the self- learning rule-based approach are much higher than the machine learning method.
The reason behind the higher cross-validation of the self-learning rule-based method can be due to the data used and selected during the
37
self-learning approach. The data selected by the self-learning rule- based method are more reliable which lead to a noise reduction in the dataset.
One reason to why the accuracy does not improve much is because of the limitation the rule-based systems put on the self-training. The input from the rule-based system should have had a weight attached to it that could have been examined more during testing.
We must consider how the two methods process the training data. The self-training method uses only half of the labels in the training set. In many real-world applications, the reduction of the required amount of labeled data would result in a positive impact, as it will reduce the need to annotate the data. This makes the assumption that labeling the data is a time-consuming task.
7 Conclusion
We have combined the rule-based system with the machine learning method by using a semi-supervised approach. We have shown that the machine learning model can be combined with the rule-based system using a semi-supervised algorithm called self-learning.
The goal of this thesis is to reduce the amount of labeled data used for training and in the same time increase the accuracy of the system. Regarding the reduction of the amount labeled data needed for training, we can claim that the combined method used a smaller amount of labeled data compared to the machine learning method. However, there is no evidence based on the result that a small amount of training data may increase the accuracy of the model.
As stated before, there is no evidence that a small amount of training data may increase the accuracy the combined model. We can however state based on the results presented on table 8 that the combined method performs better than the machine learning approach when using a considerable amount of unlabeled training data.
An investigation of using small number of labeled data and a big amount of unlabeled training data are needed because labeled data are difficult to obtain due to the fact a human corrector is required. Furthermore, the methods used here can spark interest in this subject and help further research.
7.1 Future Work
The features implemented used in the thesis work are needed to be further investigated. Because the accuracy obtained are not very good. Those features impact the machine learning methods accuracy on the test data.
There is a big amount of noises in the dataset, and that is normal because the answers are written by children. But a good method to
38
clean the data from noises and grammatical errors may increase the accuracy of the model and should be investigated.
Furthermore, a better thesaurus of synonyms must be found in order to create paraphrases of the student answers, that may also help the rule-base system. The rule-based system can also be extended by using a mount of training data in order to write better rules and learn rules from the data.
Regarding the combination of the rule-based system with self- learning, there are some drawbacks with the methods. Although the self-learning algorithm are easier to combine with rule-based system, it is however prone to errors because the data are trained using the same features even when using the rule-based system. An extension of the self-learning algorithm called co-training can be used in order to build a more robust model.
References
[1] J. Lockwood, “Handbook of Automated Essay Evaluation Current Applications and New Directions Mark D. Shermis and Jill Burstein (eds.) (2013) New York: Routledge. Pp. 194 ISBN: 9780415810968,” Writing & Pedagogy, vol. 6, no. 2, pp. 437–441, 2014.
[2] D. Higgins et al., “Is getting the right answer just about choosing the right words? The role of syntactically-informed features in short answer scoring,” arXiv [cs.CL], 04-Mar-2014.
[3] S. Pulman, “Automarking: using computational linguistics to score short ‚free− text responses,” 2003.
[4] S. T. Alotaibi and A. A. Mirza, “Hybrid approach for automatic short answer marking,” in Southwest decision sciences forty-third annual meeting.-2012, 2010.
[5] S. Burrows, I. Gurevych, and B. Stein, “The Eras and Trends of Automatic Short Answer Grading,” International Journal of Artificial Intelligence in Education, vol. 25, no. 1, pp. 60–117, 2014.
[6] S. Jing, O. C. Santos, J. G. Boticario, C. Romero, M. Pechenizkiy, and A. Merceron, “Automatic Grading of Short Answers for MOOC via Semi-supervised Document Clustering,” in EDM, 2015, pp. 554–555.
[7] D. R. P. Marn, “Automatic evaluation of users’ short essays by using statistical and shallow natural language processing techniques,” Master’s thesis, Universidad Autónoma de Madrid. http://www. ii. uam. es/dperez/tea. pdf, 2004.
[8] A. Håkansson, “Portal of Research Methods and Methodologies for Research Projects and Degree Projects,” in Proceedings of the International Conference on Frontiers in Education: Computer Science and Computer Engineering (FECS); Athens, 2013, pp. 1–7.
[9] S. Roy, H. S. Bhatt, and Y. Narahari, “An Iterative Transfer Learning Based Ensemble Technique for Automatic Short Answer Grading,” arXiv [cs.CL], 16-Sep-2016.
[10] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Allocation,” J. Mach. Learn. Res., vol. 3, no. Jan, pp. 993–1022, 2003.
[12] E. Friedman-Hill, Jess in Action: Rule-based Systems in Java. Manning Publications, 2003.
[13] C. Grosan and A. Abraham, Intelligent Systems: A Modern Approach. Springer Science & Business Media, 2011.
[14] T. Mitchell, T. Russell, P. Broomhead, and N. Aldridge, “Towards robust computerised marking of free-text responses,” 2002.
[15] S. G. Pulman and J. Z. Sukkarieh, “Automatic short answer marking,” in Proceedings of the second workshop on Building Educational Applications Using NLP - EdAppsNLP 05, 2005.
[16] R. Siddiqi and C. Harrison, “A systematic approach to the automated marking of short-answer questions,” Multitopic Conference, 2008. INMIC, 2008.
[17] V. Iosifidis and E. Ntoutsi, “Large Scale Sentiment Learning with Limited Labels,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 1823–1832.
[18] I. Guyon and A. Elisseeff, “An Introduction to Variable and Feature Selection,” J. Mach. Learn. Res., vol. 3, no. Mar, pp. 1157–1182, 2003.
[19] N. Van-Tu and L. Anh-Cuong, “Improving Question Classification by Feature Extraction and Selection,” Indian J. Sci. Technol., vol. 9, no. 17, 2016.
[20] P. Harrington, Machine learning in action, vol. 5. Manning Greenwich, CT, 2012.
[21] Pascal (new York University Wallisch (New York, Ny, and Usa)), Neural Data Science - a Primer with Matlab (r) and Python (tm). Elsevier Science Publishing Company, 2017.
[22] T. M. Mitchell, Machine Learning. McGraw-Hill, 1997, pp. 177–180.
[23] “How the random forest algorithm works in machine learning,” Dataaspirant, 22-May-2017. [Online]. Available: http://dataaspirant.com/2017/05/22/random-forest-algorithm- machine-learing/. [Accessed: 18-May-2018].
[24] W. M. K. Trochim and J. P. Donnelly, “Research methods knowledge base,” 2001.
[25] “What is Incremental model- advantages, disadvantages and when to use it?,” 17-Apr-2018. [Online]. Available: http://istqbexamcertification.com/what-is-incremental-model- advantages-disadvantages-and-when-to-use-it/. [Accessed: 18-May- 2018].
[26] “Welcome to Python.org,” Python.org. [Online]. Available: https://www.python.org/. [Accessed: 18-May-2018].
[27] “scikit-learn: machine learning in Python — scikit-learn 0.19.1 documentation.” [Online]. Available: http://scikit- learn.org/stable/index.html. [Accessed: 18-May-2018].
[28] “Natural Language Toolkit — NLTK 3.3 documentation.” [Online]. Available: https://www.nltk.org/. [Accessed: 18-May-2018].
[30] “sklearn.metrics.confusion_matrix — scikit-learn 0.19.1 documentation.” [Online]. Available: http://scikit- learn.org/stable/modules/generated/sklearn.metrics.confusion_matri x.html. [Accessed: 18-May-2018].
[31] “sklearn.metrics.precision_score — scikit-learn 0.19.1 documentation.” [Online]. Available: http://scikit- learn.org/stable/modules/generated/sklearn.metrics.precision_score. html. [Accessed: 18-May-2018].
[33] “sklearn.metrics.f1_score — scikit-learn 0.19.1 documentation.” [Online]. Available: http://scikit- learn.org/stable/modules/generated/sklearn.metrics.f1_score.html. [Accessed: 18-May-2018].
[34] A. P. Bradley, “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern Recognit., vol. 30, no. 7, pp. 1145–1159, Jul. 1997.
[35] “The Hewlett Foundation: Short Answer Scoring.” [Online]. Available: https://www.kaggle.com/c/asap-sas. [Accessed: 23-Apr-2018].
[36] F. Noorbehbahani and A. A. Kardan, “The automatic assessment of free text answers using a modified BLEU algorithm,” Comput. Educ., vol. 56, no. 2, pp. 337–345, 2011.
[37] Yanling Li, Y. Li, and Y. Yan, “New similarity measures for automatic short answer scoring in spontaneous non-native speech,” in International Conference on Automatic Control and Artificial Intelligence (ACAI 2012), 2012.
[38] F. S. Pribadi, T. B. Adji, and A. E. Permanasari, “Automated Short Answer Scoring using Weighted Cosine Coefficient,” in 2016 IEEE Conference on e-Learning, e-Management and e-Services (IC3e), 2016.
[39] Omran and Omran, “AUTOMATIC ESSAY GRADING SYSTEM FOR SHORT ANSWERS IN ENGLISH LANGUAGE,” Journal of Computer Science, vol. 9, no. 10, pp. 1369–1382, 2013.
[40] A. Islam and D. Z. Inkpen, “Semantic similarity of short texts,” in Current Issues in Linguistic Theory, 2009, pp. 227–236.
[41] T. K. Landauer, P. W. Foltz, and D. Laham, “An introduction to latent semantic analysis,” Discourse Process., vol. 25, no. 2–3, pp. 259–284, 1998.
[42] JURAFSKY and D, “Speech and language processing : an introduction to natural language processing,” Computational Linguistics, and Speech Recognition, 2000.
[43] G. A. Miller, “WordNet: A Lexical Database for English,” Commun. ACM, vol. 38, no. 11, pp. 39–41, Nov. 1995.