the answer question in question answering systems · chitecture: question interpretation, passage...
TRANSCRIPT
The Answer Question in Question
Answering Systems
Andre Filipe Esteves Goncalves
Dissertacao para obtencao do Grau de Mestre em
Engenharia Informatica e de Computadores
Juri
Presidente Prof Jose Manuel Nunes Salvador TriboletOrientador Profa Maria Luısa Torres Ribeiro Marques da Silva CoheurVogal Prof Bruno Emanuel da Graca Martins
Novembro 2011
Acknowledgements
I am extremely thankful to my Supervisor Prof Luısa Coheur and also mycolleague Ana Cristina Mendes (who pratically became my Co-Supervisor) forall their support throughout this journey I could not have written this thesiswithout them truly
I dedicate this to everyone that has inspired me and motivates me on a dailybasis over the past year (or years) ndash ie my close friends my chosen family Itwas worthy Thank you you are my heroes
Resumo
No mundo atual os Sistemas de Pergunta Resposta tornaram-se uma respostavalida para o problema da explosao de informacao da Web uma vez que con-seguem efetivamente compactar a informacao que estamos a procura numa unicaresposta Para esse efeito um novo Sistema de Pergunta Resposta foi criadoJustAsk
Apesar do JustAsk ja estar completamente operacional e a dar respostascorretas a um certo numero de questoes permanecem ainda varios problemas aenderecar Um deles e a identificacao erronea de respostas a nıvel de extracaoe clustering que pode levar a que uma resposta errada seja retornada pelosistema
Este trabalho lida com esta problematica e e composto pelas seguintes tare-fas a) criar corpora para melhor avaliar o sistema e o seu modulo de respostasb) apresentar os resultados iniciais do JustAsk com especial foco para o seumodulo de respostas c) efetuar um estudo do estado da arte em relacoes en-tre respostas e identificacao de parafrases de modo a melhorar a extracao dasrespostas do JustAsk d) analizar erros e detectar padroes de erros no mdulode extracao de respostas do sistema e) apresentar e implementar uma solucaopara os problemas detectados Isto incluira uma normalizacaointegracao dasrespostas potencialmente corretas juntamente com a implementacao de novastecnicas de rdquoclusteringrdquo f) avaliacao do sistema com as novas adicoes propostas
Abstract
In todayrsquos world Question Answering (QA) Systems have become a viableanswer to the problem of information explosion over the web since they effec-tively compact the information we are looking for into a single answer For thateffect a new QA system called JustAsk was designed
While JustAsk is already fully operable and giving correct answers to cer-tain questions there are still several problems that need to be addressed Oneof them is the erroneous identification of potential answers and clustering ofanswers which can lead to a wrong answer being returned by the system
This work deals with this problem and is composed by the following tasksa) create test corpora to properly evaluate the system and its answers moduleb) present the baseline results of JustAsk with special attention to its an-swers module c) perform a state of the art in relating answers and paraphraseidentification in order to improve the extraction of answers of JustAsk d) ana-lyze errors and detect error patterns over the answer extraction stage e) presentand implement a solution to the problems detected This will include normaliza-tionintegration of the potential answers as well as implementing new clusteringtechniques f) Evaluation of the system with the new features proposed
Keywords
JustAskQuestion AnsweringRelations between AnswersParaphrase IdentificationClusteringNormalizationCorpusEvaluationAnswer Extraction
6
Index
1 Introduction 12
2 Related Work 1421 Relating Answers 1422 Paraphrase Identification 15
221 SimFinder ndash A methodology based on linguistic analysisof comparable corpora 15
222 Methodologies purely based on lexical similarity 16223 Canonicalization as complement to lexical features 18224 Use of context to extract similar phrases 18225 Use of dissimilarities and semantics to provide better sim-
ilarity measures 1923 Overview 22
3 JustAsk architecture and work points 2431 JustAsk 24
311 Question Interpretation 24312 Passage Retrieval 25313 Answer Extraction 25
32 Experimental Setup 26321 Evaluation measures 26322 Multisix corpus 27323 TREC corpus 30
33 Baseline 31331 Results with Multisix Corpus 31332 Results with TREC Corpus 33
34 Error Analysis over the Answer Module 34
4 Relating Answers 3841 Improving the Answer Extraction module
Relations Module Implementation 3842 Additional Features 41
421 Normalizers 41422 Proximity Measure 41
43 Results 4244 Results Analysis 4245 Building an Evaluation Corpus 42
5 Clustering Answers 4451 Testing Different Similarity Metrics 44
511 Clusterer Metrics 44512 Results 46513 Results Analysis 47
52 New Distance Metric - rsquoJFKDistancersquo 48
7
521 Description 48522 Results 50523 Results Analysis 50
53 Possible new features ndash Combination of distance metrics to en-hance overall results 50
54 Posterior analysis 51
6 Conclusions and Future Work 53
8
List of Figures
1 JustAsk architecture mirroring the architecture of a general QAsystem 24
2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the
command line 294 Proposed approach to automate the detection of relations be-
tween answers 385 Relationsrsquo module basic architecture Each answer has now a field
representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40
6 Execution of the application to create relationsrsquo corpora in thecommand line 43
9
List of Tables
1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-
ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from
the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of
retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created
using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-
scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34
7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42
8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47
9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50
10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51
10
Acronyms
TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech
11
1 Introduction
QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones
With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all
This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be
QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations
In resume the goals of this work are
ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk
ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts
ndash test new clustering measures
ndash create corpora and software tools to evaluate our work
This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering
12
stage and the results achieved and finally on Section 6 we present the mainconclusions and future work
13
2 Related Work
As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature
21 Relating Answers
As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA
Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages
From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers
Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)
Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)
Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)
Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory
The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like
14
lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals
22 Paraphrase Identification
According to WordReferencecom 1 a paraphrase can be defined as
ndash (noun) rewording for the purpose of clarification
ndash (verb) express the same message in different words
There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such
Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus
Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence
In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus
221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora
We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types
1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed
of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification
15
ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached
ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether
The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure
222 Methodologies purely based on lexical similarity
In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap
NGsim(S1 S2) =1
Nlowast
Nsumn=1
Count match(N minus gram)
Count(N minus gram)(1)
in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence
A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data
16
In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences
S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday
there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is
L(x λ) = minus log 2(λx) (2)
where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow
logSim(S1 S2) =
L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise
(3)
This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ
x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y
L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test
3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another
17
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
Acknowledgements
I am extremely thankful to my Supervisor Prof Luısa Coheur and also mycolleague Ana Cristina Mendes (who pratically became my Co-Supervisor) forall their support throughout this journey I could not have written this thesiswithout them truly
I dedicate this to everyone that has inspired me and motivates me on a dailybasis over the past year (or years) ndash ie my close friends my chosen family Itwas worthy Thank you you are my heroes
Resumo
No mundo atual os Sistemas de Pergunta Resposta tornaram-se uma respostavalida para o problema da explosao de informacao da Web uma vez que con-seguem efetivamente compactar a informacao que estamos a procura numa unicaresposta Para esse efeito um novo Sistema de Pergunta Resposta foi criadoJustAsk
Apesar do JustAsk ja estar completamente operacional e a dar respostascorretas a um certo numero de questoes permanecem ainda varios problemas aenderecar Um deles e a identificacao erronea de respostas a nıvel de extracaoe clustering que pode levar a que uma resposta errada seja retornada pelosistema
Este trabalho lida com esta problematica e e composto pelas seguintes tare-fas a) criar corpora para melhor avaliar o sistema e o seu modulo de respostasb) apresentar os resultados iniciais do JustAsk com especial foco para o seumodulo de respostas c) efetuar um estudo do estado da arte em relacoes en-tre respostas e identificacao de parafrases de modo a melhorar a extracao dasrespostas do JustAsk d) analizar erros e detectar padroes de erros no mdulode extracao de respostas do sistema e) apresentar e implementar uma solucaopara os problemas detectados Isto incluira uma normalizacaointegracao dasrespostas potencialmente corretas juntamente com a implementacao de novastecnicas de rdquoclusteringrdquo f) avaliacao do sistema com as novas adicoes propostas
Abstract
In todayrsquos world Question Answering (QA) Systems have become a viableanswer to the problem of information explosion over the web since they effec-tively compact the information we are looking for into a single answer For thateffect a new QA system called JustAsk was designed
While JustAsk is already fully operable and giving correct answers to cer-tain questions there are still several problems that need to be addressed Oneof them is the erroneous identification of potential answers and clustering ofanswers which can lead to a wrong answer being returned by the system
This work deals with this problem and is composed by the following tasksa) create test corpora to properly evaluate the system and its answers moduleb) present the baseline results of JustAsk with special attention to its an-swers module c) perform a state of the art in relating answers and paraphraseidentification in order to improve the extraction of answers of JustAsk d) ana-lyze errors and detect error patterns over the answer extraction stage e) presentand implement a solution to the problems detected This will include normaliza-tionintegration of the potential answers as well as implementing new clusteringtechniques f) Evaluation of the system with the new features proposed
Keywords
JustAskQuestion AnsweringRelations between AnswersParaphrase IdentificationClusteringNormalizationCorpusEvaluationAnswer Extraction
6
Index
1 Introduction 12
2 Related Work 1421 Relating Answers 1422 Paraphrase Identification 15
221 SimFinder ndash A methodology based on linguistic analysisof comparable corpora 15
222 Methodologies purely based on lexical similarity 16223 Canonicalization as complement to lexical features 18224 Use of context to extract similar phrases 18225 Use of dissimilarities and semantics to provide better sim-
ilarity measures 1923 Overview 22
3 JustAsk architecture and work points 2431 JustAsk 24
311 Question Interpretation 24312 Passage Retrieval 25313 Answer Extraction 25
32 Experimental Setup 26321 Evaluation measures 26322 Multisix corpus 27323 TREC corpus 30
33 Baseline 31331 Results with Multisix Corpus 31332 Results with TREC Corpus 33
34 Error Analysis over the Answer Module 34
4 Relating Answers 3841 Improving the Answer Extraction module
Relations Module Implementation 3842 Additional Features 41
421 Normalizers 41422 Proximity Measure 41
43 Results 4244 Results Analysis 4245 Building an Evaluation Corpus 42
5 Clustering Answers 4451 Testing Different Similarity Metrics 44
511 Clusterer Metrics 44512 Results 46513 Results Analysis 47
52 New Distance Metric - rsquoJFKDistancersquo 48
7
521 Description 48522 Results 50523 Results Analysis 50
53 Possible new features ndash Combination of distance metrics to en-hance overall results 50
54 Posterior analysis 51
6 Conclusions and Future Work 53
8
List of Figures
1 JustAsk architecture mirroring the architecture of a general QAsystem 24
2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the
command line 294 Proposed approach to automate the detection of relations be-
tween answers 385 Relationsrsquo module basic architecture Each answer has now a field
representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40
6 Execution of the application to create relationsrsquo corpora in thecommand line 43
9
List of Tables
1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-
ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from
the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of
retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created
using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-
scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34
7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42
8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47
9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50
10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51
10
Acronyms
TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech
11
1 Introduction
QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones
With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all
This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be
QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations
In resume the goals of this work are
ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk
ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts
ndash test new clustering measures
ndash create corpora and software tools to evaluate our work
This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering
12
stage and the results achieved and finally on Section 6 we present the mainconclusions and future work
13
2 Related Work
As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature
21 Relating Answers
As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA
Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages
From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers
Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)
Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)
Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)
Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory
The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like
14
lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals
22 Paraphrase Identification
According to WordReferencecom 1 a paraphrase can be defined as
ndash (noun) rewording for the purpose of clarification
ndash (verb) express the same message in different words
There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such
Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus
Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence
In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus
221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora
We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types
1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed
of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification
15
ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached
ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether
The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure
222 Methodologies purely based on lexical similarity
In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap
NGsim(S1 S2) =1
Nlowast
Nsumn=1
Count match(N minus gram)
Count(N minus gram)(1)
in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence
A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data
16
In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences
S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday
there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is
L(x λ) = minus log 2(λx) (2)
where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow
logSim(S1 S2) =
L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise
(3)
This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ
x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y
L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test
3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another
17
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
I dedicate this to everyone that has inspired me and motivates me on a dailybasis over the past year (or years) ndash ie my close friends my chosen family Itwas worthy Thank you you are my heroes
Resumo
No mundo atual os Sistemas de Pergunta Resposta tornaram-se uma respostavalida para o problema da explosao de informacao da Web uma vez que con-seguem efetivamente compactar a informacao que estamos a procura numa unicaresposta Para esse efeito um novo Sistema de Pergunta Resposta foi criadoJustAsk
Apesar do JustAsk ja estar completamente operacional e a dar respostascorretas a um certo numero de questoes permanecem ainda varios problemas aenderecar Um deles e a identificacao erronea de respostas a nıvel de extracaoe clustering que pode levar a que uma resposta errada seja retornada pelosistema
Este trabalho lida com esta problematica e e composto pelas seguintes tare-fas a) criar corpora para melhor avaliar o sistema e o seu modulo de respostasb) apresentar os resultados iniciais do JustAsk com especial foco para o seumodulo de respostas c) efetuar um estudo do estado da arte em relacoes en-tre respostas e identificacao de parafrases de modo a melhorar a extracao dasrespostas do JustAsk d) analizar erros e detectar padroes de erros no mdulode extracao de respostas do sistema e) apresentar e implementar uma solucaopara os problemas detectados Isto incluira uma normalizacaointegracao dasrespostas potencialmente corretas juntamente com a implementacao de novastecnicas de rdquoclusteringrdquo f) avaliacao do sistema com as novas adicoes propostas
Abstract
In todayrsquos world Question Answering (QA) Systems have become a viableanswer to the problem of information explosion over the web since they effec-tively compact the information we are looking for into a single answer For thateffect a new QA system called JustAsk was designed
While JustAsk is already fully operable and giving correct answers to cer-tain questions there are still several problems that need to be addressed Oneof them is the erroneous identification of potential answers and clustering ofanswers which can lead to a wrong answer being returned by the system
This work deals with this problem and is composed by the following tasksa) create test corpora to properly evaluate the system and its answers moduleb) present the baseline results of JustAsk with special attention to its an-swers module c) perform a state of the art in relating answers and paraphraseidentification in order to improve the extraction of answers of JustAsk d) ana-lyze errors and detect error patterns over the answer extraction stage e) presentand implement a solution to the problems detected This will include normaliza-tionintegration of the potential answers as well as implementing new clusteringtechniques f) Evaluation of the system with the new features proposed
Keywords
JustAskQuestion AnsweringRelations between AnswersParaphrase IdentificationClusteringNormalizationCorpusEvaluationAnswer Extraction
6
Index
1 Introduction 12
2 Related Work 1421 Relating Answers 1422 Paraphrase Identification 15
221 SimFinder ndash A methodology based on linguistic analysisof comparable corpora 15
222 Methodologies purely based on lexical similarity 16223 Canonicalization as complement to lexical features 18224 Use of context to extract similar phrases 18225 Use of dissimilarities and semantics to provide better sim-
ilarity measures 1923 Overview 22
3 JustAsk architecture and work points 2431 JustAsk 24
311 Question Interpretation 24312 Passage Retrieval 25313 Answer Extraction 25
32 Experimental Setup 26321 Evaluation measures 26322 Multisix corpus 27323 TREC corpus 30
33 Baseline 31331 Results with Multisix Corpus 31332 Results with TREC Corpus 33
34 Error Analysis over the Answer Module 34
4 Relating Answers 3841 Improving the Answer Extraction module
Relations Module Implementation 3842 Additional Features 41
421 Normalizers 41422 Proximity Measure 41
43 Results 4244 Results Analysis 4245 Building an Evaluation Corpus 42
5 Clustering Answers 4451 Testing Different Similarity Metrics 44
511 Clusterer Metrics 44512 Results 46513 Results Analysis 47
52 New Distance Metric - rsquoJFKDistancersquo 48
7
521 Description 48522 Results 50523 Results Analysis 50
53 Possible new features ndash Combination of distance metrics to en-hance overall results 50
54 Posterior analysis 51
6 Conclusions and Future Work 53
8
List of Figures
1 JustAsk architecture mirroring the architecture of a general QAsystem 24
2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the
command line 294 Proposed approach to automate the detection of relations be-
tween answers 385 Relationsrsquo module basic architecture Each answer has now a field
representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40
6 Execution of the application to create relationsrsquo corpora in thecommand line 43
9
List of Tables
1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-
ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from
the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of
retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created
using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-
scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34
7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42
8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47
9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50
10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51
10
Acronyms
TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech
11
1 Introduction
QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones
With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all
This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be
QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations
In resume the goals of this work are
ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk
ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts
ndash test new clustering measures
ndash create corpora and software tools to evaluate our work
This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering
12
stage and the results achieved and finally on Section 6 we present the mainconclusions and future work
13
2 Related Work
As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature
21 Relating Answers
As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA
Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages
From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers
Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)
Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)
Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)
Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory
The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like
14
lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals
22 Paraphrase Identification
According to WordReferencecom 1 a paraphrase can be defined as
ndash (noun) rewording for the purpose of clarification
ndash (verb) express the same message in different words
There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such
Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus
Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence
In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus
221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora
We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types
1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed
of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification
15
ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached
ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether
The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure
222 Methodologies purely based on lexical similarity
In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap
NGsim(S1 S2) =1
Nlowast
Nsumn=1
Count match(N minus gram)
Count(N minus gram)(1)
in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence
A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data
16
In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences
S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday
there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is
L(x λ) = minus log 2(λx) (2)
where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow
logSim(S1 S2) =
L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise
(3)
This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ
x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y
L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test
3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another
17
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
Resumo
No mundo atual os Sistemas de Pergunta Resposta tornaram-se uma respostavalida para o problema da explosao de informacao da Web uma vez que con-seguem efetivamente compactar a informacao que estamos a procura numa unicaresposta Para esse efeito um novo Sistema de Pergunta Resposta foi criadoJustAsk
Apesar do JustAsk ja estar completamente operacional e a dar respostascorretas a um certo numero de questoes permanecem ainda varios problemas aenderecar Um deles e a identificacao erronea de respostas a nıvel de extracaoe clustering que pode levar a que uma resposta errada seja retornada pelosistema
Este trabalho lida com esta problematica e e composto pelas seguintes tare-fas a) criar corpora para melhor avaliar o sistema e o seu modulo de respostasb) apresentar os resultados iniciais do JustAsk com especial foco para o seumodulo de respostas c) efetuar um estudo do estado da arte em relacoes en-tre respostas e identificacao de parafrases de modo a melhorar a extracao dasrespostas do JustAsk d) analizar erros e detectar padroes de erros no mdulode extracao de respostas do sistema e) apresentar e implementar uma solucaopara os problemas detectados Isto incluira uma normalizacaointegracao dasrespostas potencialmente corretas juntamente com a implementacao de novastecnicas de rdquoclusteringrdquo f) avaliacao do sistema com as novas adicoes propostas
Abstract
In todayrsquos world Question Answering (QA) Systems have become a viableanswer to the problem of information explosion over the web since they effec-tively compact the information we are looking for into a single answer For thateffect a new QA system called JustAsk was designed
While JustAsk is already fully operable and giving correct answers to cer-tain questions there are still several problems that need to be addressed Oneof them is the erroneous identification of potential answers and clustering ofanswers which can lead to a wrong answer being returned by the system
This work deals with this problem and is composed by the following tasksa) create test corpora to properly evaluate the system and its answers moduleb) present the baseline results of JustAsk with special attention to its an-swers module c) perform a state of the art in relating answers and paraphraseidentification in order to improve the extraction of answers of JustAsk d) ana-lyze errors and detect error patterns over the answer extraction stage e) presentand implement a solution to the problems detected This will include normaliza-tionintegration of the potential answers as well as implementing new clusteringtechniques f) Evaluation of the system with the new features proposed
Keywords
JustAskQuestion AnsweringRelations between AnswersParaphrase IdentificationClusteringNormalizationCorpusEvaluationAnswer Extraction
6
Index
1 Introduction 12
2 Related Work 1421 Relating Answers 1422 Paraphrase Identification 15
221 SimFinder ndash A methodology based on linguistic analysisof comparable corpora 15
222 Methodologies purely based on lexical similarity 16223 Canonicalization as complement to lexical features 18224 Use of context to extract similar phrases 18225 Use of dissimilarities and semantics to provide better sim-
ilarity measures 1923 Overview 22
3 JustAsk architecture and work points 2431 JustAsk 24
311 Question Interpretation 24312 Passage Retrieval 25313 Answer Extraction 25
32 Experimental Setup 26321 Evaluation measures 26322 Multisix corpus 27323 TREC corpus 30
33 Baseline 31331 Results with Multisix Corpus 31332 Results with TREC Corpus 33
34 Error Analysis over the Answer Module 34
4 Relating Answers 3841 Improving the Answer Extraction module
Relations Module Implementation 3842 Additional Features 41
421 Normalizers 41422 Proximity Measure 41
43 Results 4244 Results Analysis 4245 Building an Evaluation Corpus 42
5 Clustering Answers 4451 Testing Different Similarity Metrics 44
511 Clusterer Metrics 44512 Results 46513 Results Analysis 47
52 New Distance Metric - rsquoJFKDistancersquo 48
7
521 Description 48522 Results 50523 Results Analysis 50
53 Possible new features ndash Combination of distance metrics to en-hance overall results 50
54 Posterior analysis 51
6 Conclusions and Future Work 53
8
List of Figures
1 JustAsk architecture mirroring the architecture of a general QAsystem 24
2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the
command line 294 Proposed approach to automate the detection of relations be-
tween answers 385 Relationsrsquo module basic architecture Each answer has now a field
representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40
6 Execution of the application to create relationsrsquo corpora in thecommand line 43
9
List of Tables
1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-
ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from
the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of
retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created
using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-
scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34
7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42
8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47
9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50
10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51
10
Acronyms
TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech
11
1 Introduction
QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones
With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all
This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be
QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations
In resume the goals of this work are
ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk
ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts
ndash test new clustering measures
ndash create corpora and software tools to evaluate our work
This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering
12
stage and the results achieved and finally on Section 6 we present the mainconclusions and future work
13
2 Related Work
As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature
21 Relating Answers
As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA
Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages
From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers
Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)
Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)
Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)
Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory
The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like
14
lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals
22 Paraphrase Identification
According to WordReferencecom 1 a paraphrase can be defined as
ndash (noun) rewording for the purpose of clarification
ndash (verb) express the same message in different words
There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such
Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus
Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence
In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus
221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora
We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types
1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed
of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification
15
ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached
ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether
The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure
222 Methodologies purely based on lexical similarity
In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap
NGsim(S1 S2) =1
Nlowast
Nsumn=1
Count match(N minus gram)
Count(N minus gram)(1)
in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence
A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data
16
In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences
S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday
there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is
L(x λ) = minus log 2(λx) (2)
where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow
logSim(S1 S2) =
L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise
(3)
This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ
x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y
L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test
3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another
17
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
Abstract
In todayrsquos world Question Answering (QA) Systems have become a viableanswer to the problem of information explosion over the web since they effec-tively compact the information we are looking for into a single answer For thateffect a new QA system called JustAsk was designed
While JustAsk is already fully operable and giving correct answers to cer-tain questions there are still several problems that need to be addressed Oneof them is the erroneous identification of potential answers and clustering ofanswers which can lead to a wrong answer being returned by the system
This work deals with this problem and is composed by the following tasksa) create test corpora to properly evaluate the system and its answers moduleb) present the baseline results of JustAsk with special attention to its an-swers module c) perform a state of the art in relating answers and paraphraseidentification in order to improve the extraction of answers of JustAsk d) ana-lyze errors and detect error patterns over the answer extraction stage e) presentand implement a solution to the problems detected This will include normaliza-tionintegration of the potential answers as well as implementing new clusteringtechniques f) Evaluation of the system with the new features proposed
Keywords
JustAskQuestion AnsweringRelations between AnswersParaphrase IdentificationClusteringNormalizationCorpusEvaluationAnswer Extraction
6
Index
1 Introduction 12
2 Related Work 1421 Relating Answers 1422 Paraphrase Identification 15
221 SimFinder ndash A methodology based on linguistic analysisof comparable corpora 15
222 Methodologies purely based on lexical similarity 16223 Canonicalization as complement to lexical features 18224 Use of context to extract similar phrases 18225 Use of dissimilarities and semantics to provide better sim-
ilarity measures 1923 Overview 22
3 JustAsk architecture and work points 2431 JustAsk 24
311 Question Interpretation 24312 Passage Retrieval 25313 Answer Extraction 25
32 Experimental Setup 26321 Evaluation measures 26322 Multisix corpus 27323 TREC corpus 30
33 Baseline 31331 Results with Multisix Corpus 31332 Results with TREC Corpus 33
34 Error Analysis over the Answer Module 34
4 Relating Answers 3841 Improving the Answer Extraction module
Relations Module Implementation 3842 Additional Features 41
421 Normalizers 41422 Proximity Measure 41
43 Results 4244 Results Analysis 4245 Building an Evaluation Corpus 42
5 Clustering Answers 4451 Testing Different Similarity Metrics 44
511 Clusterer Metrics 44512 Results 46513 Results Analysis 47
52 New Distance Metric - rsquoJFKDistancersquo 48
7
521 Description 48522 Results 50523 Results Analysis 50
53 Possible new features ndash Combination of distance metrics to en-hance overall results 50
54 Posterior analysis 51
6 Conclusions and Future Work 53
8
List of Figures
1 JustAsk architecture mirroring the architecture of a general QAsystem 24
2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the
command line 294 Proposed approach to automate the detection of relations be-
tween answers 385 Relationsrsquo module basic architecture Each answer has now a field
representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40
6 Execution of the application to create relationsrsquo corpora in thecommand line 43
9
List of Tables
1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-
ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from
the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of
retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created
using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-
scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34
7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42
8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47
9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50
10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51
10
Acronyms
TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech
11
1 Introduction
QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones
With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all
This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be
QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations
In resume the goals of this work are
ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk
ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts
ndash test new clustering measures
ndash create corpora and software tools to evaluate our work
This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering
12
stage and the results achieved and finally on Section 6 we present the mainconclusions and future work
13
2 Related Work
As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature
21 Relating Answers
As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA
Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages
From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers
Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)
Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)
Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)
Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory
The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like
14
lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals
22 Paraphrase Identification
According to WordReferencecom 1 a paraphrase can be defined as
ndash (noun) rewording for the purpose of clarification
ndash (verb) express the same message in different words
There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such
Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus
Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence
In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus
221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora
We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types
1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed
of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification
15
ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached
ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether
The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure
222 Methodologies purely based on lexical similarity
In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap
NGsim(S1 S2) =1
Nlowast
Nsumn=1
Count match(N minus gram)
Count(N minus gram)(1)
in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence
A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data
16
In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences
S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday
there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is
L(x λ) = minus log 2(λx) (2)
where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow
logSim(S1 S2) =
L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise
(3)
This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ
x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y
L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test
3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another
17
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
Keywords
JustAskQuestion AnsweringRelations between AnswersParaphrase IdentificationClusteringNormalizationCorpusEvaluationAnswer Extraction
6
Index
1 Introduction 12
2 Related Work 1421 Relating Answers 1422 Paraphrase Identification 15
221 SimFinder ndash A methodology based on linguistic analysisof comparable corpora 15
222 Methodologies purely based on lexical similarity 16223 Canonicalization as complement to lexical features 18224 Use of context to extract similar phrases 18225 Use of dissimilarities and semantics to provide better sim-
ilarity measures 1923 Overview 22
3 JustAsk architecture and work points 2431 JustAsk 24
311 Question Interpretation 24312 Passage Retrieval 25313 Answer Extraction 25
32 Experimental Setup 26321 Evaluation measures 26322 Multisix corpus 27323 TREC corpus 30
33 Baseline 31331 Results with Multisix Corpus 31332 Results with TREC Corpus 33
34 Error Analysis over the Answer Module 34
4 Relating Answers 3841 Improving the Answer Extraction module
Relations Module Implementation 3842 Additional Features 41
421 Normalizers 41422 Proximity Measure 41
43 Results 4244 Results Analysis 4245 Building an Evaluation Corpus 42
5 Clustering Answers 4451 Testing Different Similarity Metrics 44
511 Clusterer Metrics 44512 Results 46513 Results Analysis 47
52 New Distance Metric - rsquoJFKDistancersquo 48
7
521 Description 48522 Results 50523 Results Analysis 50
53 Possible new features ndash Combination of distance metrics to en-hance overall results 50
54 Posterior analysis 51
6 Conclusions and Future Work 53
8
List of Figures
1 JustAsk architecture mirroring the architecture of a general QAsystem 24
2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the
command line 294 Proposed approach to automate the detection of relations be-
tween answers 385 Relationsrsquo module basic architecture Each answer has now a field
representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40
6 Execution of the application to create relationsrsquo corpora in thecommand line 43
9
List of Tables
1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-
ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from
the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of
retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created
using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-
scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34
7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42
8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47
9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50
10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51
10
Acronyms
TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech
11
1 Introduction
QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones
With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all
This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be
QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations
In resume the goals of this work are
ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk
ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts
ndash test new clustering measures
ndash create corpora and software tools to evaluate our work
This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering
12
stage and the results achieved and finally on Section 6 we present the mainconclusions and future work
13
2 Related Work
As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature
21 Relating Answers
As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA
Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages
From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers
Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)
Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)
Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)
Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory
The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like
14
lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals
22 Paraphrase Identification
According to WordReferencecom 1 a paraphrase can be defined as
ndash (noun) rewording for the purpose of clarification
ndash (verb) express the same message in different words
There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such
Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus
Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence
In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus
221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora
We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types
1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed
of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification
15
ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached
ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether
The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure
222 Methodologies purely based on lexical similarity
In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap
NGsim(S1 S2) =1
Nlowast
Nsumn=1
Count match(N minus gram)
Count(N minus gram)(1)
in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence
A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data
16
In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences
S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday
there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is
L(x λ) = minus log 2(λx) (2)
where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow
logSim(S1 S2) =
L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise
(3)
This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ
x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y
L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test
3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another
17
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
Index
1 Introduction 12
2 Related Work 1421 Relating Answers 1422 Paraphrase Identification 15
221 SimFinder ndash A methodology based on linguistic analysisof comparable corpora 15
222 Methodologies purely based on lexical similarity 16223 Canonicalization as complement to lexical features 18224 Use of context to extract similar phrases 18225 Use of dissimilarities and semantics to provide better sim-
ilarity measures 1923 Overview 22
3 JustAsk architecture and work points 2431 JustAsk 24
311 Question Interpretation 24312 Passage Retrieval 25313 Answer Extraction 25
32 Experimental Setup 26321 Evaluation measures 26322 Multisix corpus 27323 TREC corpus 30
33 Baseline 31331 Results with Multisix Corpus 31332 Results with TREC Corpus 33
34 Error Analysis over the Answer Module 34
4 Relating Answers 3841 Improving the Answer Extraction module
Relations Module Implementation 3842 Additional Features 41
421 Normalizers 41422 Proximity Measure 41
43 Results 4244 Results Analysis 4245 Building an Evaluation Corpus 42
5 Clustering Answers 4451 Testing Different Similarity Metrics 44
511 Clusterer Metrics 44512 Results 46513 Results Analysis 47
52 New Distance Metric - rsquoJFKDistancersquo 48
7
521 Description 48522 Results 50523 Results Analysis 50
53 Possible new features ndash Combination of distance metrics to en-hance overall results 50
54 Posterior analysis 51
6 Conclusions and Future Work 53
8
List of Figures
1 JustAsk architecture mirroring the architecture of a general QAsystem 24
2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the
command line 294 Proposed approach to automate the detection of relations be-
tween answers 385 Relationsrsquo module basic architecture Each answer has now a field
representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40
6 Execution of the application to create relationsrsquo corpora in thecommand line 43
9
List of Tables
1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-
ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from
the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of
retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created
using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-
scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34
7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42
8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47
9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50
10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51
10
Acronyms
TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech
11
1 Introduction
QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones
With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all
This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be
QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations
In resume the goals of this work are
ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk
ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts
ndash test new clustering measures
ndash create corpora and software tools to evaluate our work
This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering
12
stage and the results achieved and finally on Section 6 we present the mainconclusions and future work
13
2 Related Work
As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature
21 Relating Answers
As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA
Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages
From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers
Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)
Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)
Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)
Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory
The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like
14
lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals
22 Paraphrase Identification
According to WordReferencecom 1 a paraphrase can be defined as
ndash (noun) rewording for the purpose of clarification
ndash (verb) express the same message in different words
There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such
Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus
Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence
In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus
221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora
We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types
1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed
of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification
15
ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached
ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether
The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure
222 Methodologies purely based on lexical similarity
In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap
NGsim(S1 S2) =1
Nlowast
Nsumn=1
Count match(N minus gram)
Count(N minus gram)(1)
in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence
A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data
16
In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences
S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday
there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is
L(x λ) = minus log 2(λx) (2)
where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow
logSim(S1 S2) =
L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise
(3)
This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ
x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y
L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test
3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another
17
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
521 Description 48522 Results 50523 Results Analysis 50
53 Possible new features ndash Combination of distance metrics to en-hance overall results 50
54 Posterior analysis 51
6 Conclusions and Future Work 53
8
List of Figures
1 JustAsk architecture mirroring the architecture of a general QAsystem 24
2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the
command line 294 Proposed approach to automate the detection of relations be-
tween answers 385 Relationsrsquo module basic architecture Each answer has now a field
representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40
6 Execution of the application to create relationsrsquo corpora in thecommand line 43
9
List of Tables
1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-
ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from
the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of
retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created
using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-
scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34
7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42
8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47
9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50
10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51
10
Acronyms
TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech
11
1 Introduction
QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones
With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all
This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be
QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations
In resume the goals of this work are
ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk
ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts
ndash test new clustering measures
ndash create corpora and software tools to evaluate our work
This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering
12
stage and the results achieved and finally on Section 6 we present the mainconclusions and future work
13
2 Related Work
As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature
21 Relating Answers
As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA
Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages
From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers
Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)
Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)
Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)
Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory
The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like
14
lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals
22 Paraphrase Identification
According to WordReferencecom 1 a paraphrase can be defined as
ndash (noun) rewording for the purpose of clarification
ndash (verb) express the same message in different words
There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such
Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus
Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence
In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus
221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora
We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types
1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed
of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification
15
ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached
ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether
The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure
222 Methodologies purely based on lexical similarity
In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap
NGsim(S1 S2) =1
Nlowast
Nsumn=1
Count match(N minus gram)
Count(N minus gram)(1)
in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence
A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data
16
In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences
S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday
there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is
L(x λ) = minus log 2(λx) (2)
where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow
logSim(S1 S2) =
L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise
(3)
This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ
x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y
L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test
3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another
17
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
List of Figures
1 JustAsk architecture mirroring the architecture of a general QAsystem 24
2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the
command line 294 Proposed approach to automate the detection of relations be-
tween answers 385 Relationsrsquo module basic architecture Each answer has now a field
representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40
6 Execution of the application to create relationsrsquo corpora in thecommand line 43
9
List of Tables
1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-
ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from
the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of
retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created
using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-
scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34
7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42
8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47
9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50
10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51
10
Acronyms
TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech
11
1 Introduction
QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones
With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all
This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be
QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations
In resume the goals of this work are
ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk
ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts
ndash test new clustering measures
ndash create corpora and software tools to evaluate our work
This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering
12
stage and the results achieved and finally on Section 6 we present the mainconclusions and future work
13
2 Related Work
As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature
21 Relating Answers
As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA
Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages
From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers
Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)
Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)
Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)
Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory
The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like
14
lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals
22 Paraphrase Identification
According to WordReferencecom 1 a paraphrase can be defined as
ndash (noun) rewording for the purpose of clarification
ndash (verb) express the same message in different words
There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such
Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus
Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence
In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus
221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora
We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types
1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed
of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification
15
ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached
ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether
The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure
222 Methodologies purely based on lexical similarity
In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap
NGsim(S1 S2) =1
Nlowast
Nsumn=1
Count match(N minus gram)
Count(N minus gram)(1)
in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence
A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data
16
In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences
S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday
there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is
L(x λ) = minus log 2(λx) (2)
where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow
logSim(S1 S2) =
L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise
(3)
This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ
x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y
L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test
3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another
17
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
List of Tables
1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-
ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from
the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of
retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created
using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-
scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34
7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42
8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47
9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50
10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51
10
Acronyms
TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech
11
1 Introduction
QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones
With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all
This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be
QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations
In resume the goals of this work are
ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk
ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts
ndash test new clustering measures
ndash create corpora and software tools to evaluate our work
This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering
12
stage and the results achieved and finally on Section 6 we present the mainconclusions and future work
13
2 Related Work
As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature
21 Relating Answers
As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA
Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages
From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers
Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)
Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)
Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)
Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory
The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like
14
lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals
22 Paraphrase Identification
According to WordReferencecom 1 a paraphrase can be defined as
ndash (noun) rewording for the purpose of clarification
ndash (verb) express the same message in different words
There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such
Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus
Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence
In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus
221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora
We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types
1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed
of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification
15
ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached
ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether
The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure
222 Methodologies purely based on lexical similarity
In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap
NGsim(S1 S2) =1
Nlowast
Nsumn=1
Count match(N minus gram)
Count(N minus gram)(1)
in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence
A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data
16
In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences
S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday
there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is
L(x λ) = minus log 2(λx) (2)
where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow
logSim(S1 S2) =
L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise
(3)
This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ
x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y
L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test
3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another
17
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
Acronyms
TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech
11
1 Introduction
QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones
With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all
This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be
QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations
In resume the goals of this work are
ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk
ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts
ndash test new clustering measures
ndash create corpora and software tools to evaluate our work
This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering
12
stage and the results achieved and finally on Section 6 we present the mainconclusions and future work
13
2 Related Work
As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature
21 Relating Answers
As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA
Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages
From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers
Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)
Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)
Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)
Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory
The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like
14
lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals
22 Paraphrase Identification
According to WordReferencecom 1 a paraphrase can be defined as
ndash (noun) rewording for the purpose of clarification
ndash (verb) express the same message in different words
There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such
Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus
Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence
In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus
221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora
We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types
1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed
of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification
15
ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached
ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether
The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure
222 Methodologies purely based on lexical similarity
In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap
NGsim(S1 S2) =1
Nlowast
Nsumn=1
Count match(N minus gram)
Count(N minus gram)(1)
in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence
A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data
16
In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences
S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday
there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is
L(x λ) = minus log 2(λx) (2)
where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow
logSim(S1 S2) =
L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise
(3)
This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ
x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y
L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test
3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another
17
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
1 Introduction
QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones
With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all
This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be
QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations
In resume the goals of this work are
ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk
ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts
ndash test new clustering measures
ndash create corpora and software tools to evaluate our work
This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering
12
stage and the results achieved and finally on Section 6 we present the mainconclusions and future work
13
2 Related Work
As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature
21 Relating Answers
As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA
Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages
From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers
Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)
Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)
Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)
Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory
The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like
14
lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals
22 Paraphrase Identification
According to WordReferencecom 1 a paraphrase can be defined as
ndash (noun) rewording for the purpose of clarification
ndash (verb) express the same message in different words
There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such
Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus
Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence
In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus
221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora
We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types
1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed
of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification
15
ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached
ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether
The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure
222 Methodologies purely based on lexical similarity
In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap
NGsim(S1 S2) =1
Nlowast
Nsumn=1
Count match(N minus gram)
Count(N minus gram)(1)
in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence
A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data
16
In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences
S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday
there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is
L(x λ) = minus log 2(λx) (2)
where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow
logSim(S1 S2) =
L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise
(3)
This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ
x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y
L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test
3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another
17
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
stage and the results achieved and finally on Section 6 we present the mainconclusions and future work
13
2 Related Work
As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature
21 Relating Answers
As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA
Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages
From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers
Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)
Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)
Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)
Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory
The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like
14
lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals
22 Paraphrase Identification
According to WordReferencecom 1 a paraphrase can be defined as
ndash (noun) rewording for the purpose of clarification
ndash (verb) express the same message in different words
There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such
Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus
Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence
In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus
221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora
We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types
1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed
of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification
15
ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached
ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether
The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure
222 Methodologies purely based on lexical similarity
In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap
NGsim(S1 S2) =1
Nlowast
Nsumn=1
Count match(N minus gram)
Count(N minus gram)(1)
in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence
A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data
16
In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences
S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday
there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is
L(x λ) = minus log 2(λx) (2)
where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow
logSim(S1 S2) =
L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise
(3)
This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ
x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y
L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test
3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another
17
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
2 Related Work
As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature
21 Relating Answers
As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA
Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages
From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers
Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)
Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)
Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)
Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory
The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like
14
lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals
22 Paraphrase Identification
According to WordReferencecom 1 a paraphrase can be defined as
ndash (noun) rewording for the purpose of clarification
ndash (verb) express the same message in different words
There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such
Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus
Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence
In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus
221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora
We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types
1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed
of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification
15
ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached
ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether
The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure
222 Methodologies purely based on lexical similarity
In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap
NGsim(S1 S2) =1
Nlowast
Nsumn=1
Count match(N minus gram)
Count(N minus gram)(1)
in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence
A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data
16
In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences
S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday
there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is
L(x λ) = minus log 2(λx) (2)
where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow
logSim(S1 S2) =
L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise
(3)
This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ
x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y
L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test
3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another
17
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals
22 Paraphrase Identification
According to WordReferencecom 1 a paraphrase can be defined as
ndash (noun) rewording for the purpose of clarification
ndash (verb) express the same message in different words
There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such
Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus
Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence
In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus
221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora
We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types
1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed
of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification
15
ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached
ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether
The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure
222 Methodologies purely based on lexical similarity
In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap
NGsim(S1 S2) =1
Nlowast
Nsumn=1
Count match(N minus gram)
Count(N minus gram)(1)
in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence
A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data
16
In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences
S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday
there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is
L(x λ) = minus log 2(λx) (2)
where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow
logSim(S1 S2) =
L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise
(3)
This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ
x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y
L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test
3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another
17
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached
ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether
The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure
222 Methodologies purely based on lexical similarity
In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap
NGsim(S1 S2) =1
Nlowast
Nsumn=1
Count match(N minus gram)
Count(N minus gram)(1)
in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence
A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data
16
In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences
S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday
there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is
L(x λ) = minus log 2(λx) (2)
where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow
logSim(S1 S2) =
L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise
(3)
This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ
x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y
L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test
3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another
17
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences
S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday
there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is
L(x λ) = minus log 2(λx) (2)
where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow
logSim(S1 S2) =
L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise
(3)
This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ
x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y
L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test
3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another
17
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
223 Canonicalization as complement to lexical features
Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively
224 Use of context to extract similar phrases
While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by
sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)
18
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems
s(S1 S2) = max
s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)
(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned
Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters
This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)
225 Use of dissimilarities and semantics to provide better similar-ity measures
In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases
Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity
They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given
19
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples
The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula
1
2(
sumwisinT1(maxsim(w T2)idf(w))sum
wisinT1 idf(w)+
sumwisinT2(maxsim(w T1)idf(w))sum
wisinT2 idf(w))
(6)For each word w in the segment T1 the goal is to identify the word that
matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics
ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them
ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage
ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy
ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)
ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most
4httpwwwnatcorpoxacuk
20
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo
ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content
ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input
ndash Jcn ndash uses the same information as Lin but combined in a different way
Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as
ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion
ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource
ndash Cosine measure ndash by using Relevant Domains instead of word frequency
Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]
Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula
sim(~a~b) =~aW~b
~a ~b (7)
21
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics
More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores
Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)
If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations
23 Overview
Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence
22
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)
We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system
There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification
For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al
23
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
3 JustAsk architecture and work points
The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module
31 JustAsk
This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification
Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem
311 Question Interpretation
This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information
5httpwwwinesc-idpt
24
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)
Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]
312 Passage Retrieval
The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia
As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer
313 Answer Extraction
The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer
JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to
25
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo
Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as
Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)
where |X⋂Y | is the number of tokens shared by candidate answers X and Y and
min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length
32 Experimental Setup
Over this section we will describe the corpora we used to evaluate our system
321 Evaluation measures
To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used
Precision
Precision =Correctly judged items
Relevant (nonminus null) items retrieved(10)
Recall
Recall =Relevant (nonminus null) items retrieved
Items in test corpus(11)
F-measure (F1)
F = 2 lowast precision lowast recallprecision+ recall
(12)
Accuracy
Accuracy =Correctly judged items
Items in test corpus(13)
26
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by
MRR =1
N
Nsumi=1
1
ranki (14)
322 Multisix corpus
To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves
We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner
The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]
For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry
1 Question=What is the capital of Somalia
URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest
27
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false
The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)
An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line
Figure 2 How the software to create the lsquoanswers corpusrsquo works
This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process
28
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future
Figure 3 Execution of the application to create answers corpora in the commandline
29
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
323 TREC corpus
In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage
TREC questions
The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer
After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo
The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested
But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem
question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival
30
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question
We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions
New York Times corpus
20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus
The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google
We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital
With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section
33 Baseline
331 Results with Multisix Corpus
Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time
31
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
Questions Correct Wrong No Answer200 88 87 25
Accuracy Precision Recall MRR Precision144 503 875 037 354
Table 1 JustAsk baseline best results for the corpus test of 200 questions
In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers
Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS
Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88
Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages
We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease
In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables
With the category Entity the system only extracts one final answer no
32
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
Google ndash 64 passages Extracted Questions CAE AS
Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79
Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google
Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS
abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22
200 138 102 84 162 120 88 168 123 79
Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category
matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance
332 Results with TREC Corpus
For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and
33
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall
scores than the previous sets of questions as we can see from the table 5 below
1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211
Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus
Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows
20 30 50 100Accuracy 211 224 237 238
Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus
Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here
34 Error Analysis over the Answer Module
This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)
1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness
ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo
34
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo
As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword
Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold
2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples
lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo
lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo
lsquoRiver Glommarsquo vs rsquoGlommarsquo
lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo
lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo
Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo
35
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples
lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo
lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo
lsquo2009rsquo vs lsquoApril 3 2009rsquo
Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo
Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements
4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters
Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example
5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc
Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)
6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as
lsquoKotashaanrsquo vs lsquoKhotashanrsquo
lsquoGroznyrsquo and lsquoGROZNYrsquo
Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such
36
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly
7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo
Solution In this case we may extend abbreviations normalization to in-clude this abbreviation
Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points
37
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
4 Relating Answers
Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure
41 Improving the Answer Extraction moduleRelations Module Implementation
Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21
Figure 4 Proposed approach to automate the detection of relations betweenanswers
We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers
The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to
38
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user
After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module
As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category
The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)
These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers
The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering
ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation
ndash Equivalence
ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)
For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora
39
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion
40
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
42 Additional Features
421 Normalizers
To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively
Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language
A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers
For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions
422 Proximity Measure
As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)
41
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
43 Results
200 questions 1210 questionsBefore features P506 R870
A440P280 R795A222
After features A511 R870A445
P282 R798A225
Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC
44 Results Analysis
These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular
The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)
45 Building an Evaluation Corpus
To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes
As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21
ndash lsquoErsquo for a relation of equivalence between pairs
ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)
ndash lsquoArsquo for alternative (complementary) answers
ndash lsquoCrsquo for contradictory answers
42
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
ndash lsquoDrsquo for causes of doubt between relation types
If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have
QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee
In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo
Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line
Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase
43
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
5 Clustering Answers
As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline
51 Testing Different Similarity Metrics
511 Clusterer Metrics
Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo
ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)
J(AB) =|A capB||A cupB|
(15)
In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences
ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula
Dice(AB) =2|A capB||A|+ |B|
(16)
Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13
44
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following
Jaro(s1 s2) =1
3(m
|s1|+
m
|s2|+mminus tm
) (17)
in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order
ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by
dw = dj + (lp(1minus dj)) (18)
with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro
ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)
D(i j) =
D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G
(19)
This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested
ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently
ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo
45
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric
512 Results
Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way
513 Results Analysis
As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases
46
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
MetricThreshold 017 033 05Levenshtein P511
R870A445
P437R875A385
P373R845A315
Overlap P379R870A330
P330R870A330
P364R865A315
Jaccard index P98R560A55
P99R555A55
P99R555A55
Dice Coefficient P98R560A55
P99R560A55
P99R555A55
Jaro P110R635A70
P119R630A75
P98R560A55
Jaro-Winkler P110R635A70
P119R630A75
P98R560A55
Needleman-Wunch P437R870A380
P437R870A380
P169R590A100
LogSim P437R870A380
P437R870A380
A420R870A365
Soundex P359R835A300
P359R835A300
P287R820A235
Smith-Waterman P173R635A110
P134R595A80
P104R575A60
Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded
52 New Distance Metric - rsquoJFKDistancersquo
521 Description
Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in
47
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight
The new distance is computed the following way
distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)
in which the scores for the common terms and matches of the first letter ofeach token are given by
commonTermsScore =commonTerms
max(length(a1) length(a2)) (21)
and
firstLetterMatchesScore =firstLetterMatchesnonCommonTerms
2(22)
with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example
A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see
48
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
522 Results
Table 9 shows us the results after the implementation of this new distancemetric
200 questions 1210 questionsBefore P511 R870
A445P282 R798A225
After new metric P517 R870A450
P292 R798A233
Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC
523 Results Analysis
As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant
53 Possible new features ndash Combination of distance met-rics to enhance overall results
We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like
if (Category == Human) use Levenshtein
if (Category == Location) use LogSim
()
For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are
49
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10
MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484
Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result
We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others
54 Posterior analysis
In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement
ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)
ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as
50
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates
ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers
ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case
ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible
ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references
After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552
51
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
6 Conclusions and Future Work
Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system
Here are the main contributions achieved
ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module
ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion
ndash We developed new metrics to improve the clustering of answers
ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats
ndash We developed new corpora that we believe will be extremely useful in thefuture
ndash We performed further analysis to detect persistent error patterns
In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions
There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few
ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers
ndash Use the rules as a distance metric of its own
ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results
52
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
References
[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003
[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003
[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003
[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006
[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007
[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945
[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004
[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008
[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999
[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964
[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901
[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995
[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989
53
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997
[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003
[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006
[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997
[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998
[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965
[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002
[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993
[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007
[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009
[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003
[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004
54
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence
[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006
[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995
[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006
[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970
[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003
[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002
[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006
[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995
[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal
[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918
[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988
[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009
[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools
55
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981
[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005
[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001
[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003
[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002
[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990
[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994
[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005
56
Appendix
Appendix A ndash Rules for detecting relations between rules
57
Appendix
Appendix A ndash Rules for detecting relations between rules
57
Relations between answers
1 Rules for detecting relations between answers
This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers
Assuming candidate answers A1 and A2 and considering
bull Aij ndash the jth token of candidate answer Ai
bull Aijk ndash the kth char of the jth token of candidate answer Ai
bull lemma(X) ndash the lemma of X
bull ishypernym(XY ) ndash returns true if X is the hypernym of Y
bull length(X) ndash the number of tokens of X
bull contains(X Y ) ndash returns true if Ai contains Aj
bull containsic(X Y ) ndash returns true if X contains Y case insensitive
bull equals(X Y ) ndash returns true if X is lexicographically equal to Y
bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive
bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp
11 Questions of category Date
Ai includes Aj if
bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001
bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001
1
1 Rules for detecting relations between answers 2
12 Questions of category Numeric
Given
bull number = (d+([0-9E]))
bull lessthan = ((less than)|(almost)|(up to)|(nearly))
bull morethan = ((more than)|(over)|(at least))
bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))
and
bull max = 11 (or another value)
bull min = 09 (or another value)
bull numX ndash the numeric value in X
bull numXmax = numX max
bull numXmin = numX min
Ai is equivalent to Aj if
bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200
bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people
Ai includes Aj if
bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210
bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200
bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210
1 Rules for detecting relations between answers 3
bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300
bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200
13 Questions of category Human
Ai is equivalent to Aj if
bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy
bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy
bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy
bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy
131 Questions of category HumanIndividual
Ai is equivalent to Aj if
bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy
bull containsic(AiAj) ndash Example
bull levenshteinDistance(AiAj) lt 1 ndash Example
14 Questions of category Location
Ai is equivalent to Aj if
bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China
1 Rules for detecting relations between answers 4
bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example
bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China
bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China
bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example
Ai includes Aj if
bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida
bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California
bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California
15 Questions of category Entity
Ai is equivalent to Aj if
bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example
Ai includes Aj if
bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey
- Introduction
- Related Work
-
- Relating Answers
- Paraphrase Identification
-
- SimFinder ndash A methodology based on linguistic analysis of comparable corpora
- Methodologies purely based on lexical similarity
- Canonicalization as complement to lexical features
- Use of context to extract similar phrases
- Use of dissimilarities and semantics to provide better similarity measures
-
- Overview
-
- JustAsk architecture and work points
-
- JustAsk
-
- Question Interpretation
- Passage Retrieval
- Answer Extraction
-
- Experimental Setup
-
- Evaluation measures
- Multisix corpus
- TREC corpus
-
- Baseline
-
- Results with Multisix Corpus
- Results with TREC Corpus
-
- Error Analysis over the Answer Module
-
- Relating Answers
-
- Improving the Answer Extraction module Relations Module Implementation
- Additional Features
-
- Normalizers
- Proximity Measure
-
- Results
- Results Analysis
- Building an Evaluation Corpus
-
- Clustering Answers
-
- Testing Different Similarity Metrics
-
- Clusterer Metrics
- Results
- Results Analysis
-
- New Distance Metric - JFKDistance
-
- Description
- Results
- Results Analysis
-
- Possible new features ndash Combination of distance metrics to enhance overall results
- Posterior analysis
-
- Conclusions and Future Work
-
1 Rules for detecting relations between answers 2
12 Questions of category Numeric
Given
bull number = (d+([0-9E]))
bull lessthan = ((less than)|(almost)|(up to)|(nearly))
bull morethan = ((more than)|(over)|(at least))
bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))
and
bull max = 11 (or another value)
bull min = 09 (or another value)
bull numX ndash the numeric value in X
bull numXmax = numX max
bull numXmin = numX min
Ai is equivalent to Aj if
bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200
bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people
Ai includes Aj if
bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210
bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200
bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210
1 Rules for detecting relations between answers 3
bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300
bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200
13 Questions of category Human
Ai is equivalent to Aj if
bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy
bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy
bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy
bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy
131 Questions of category HumanIndividual
Ai is equivalent to Aj if
bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy
bull containsic(AiAj) ndash Example
bull levenshteinDistance(AiAj) lt 1 ndash Example
14 Questions of category Location
Ai is equivalent to Aj if
bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China
1 Rules for detecting relations between answers 4
bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example
bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China
bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China
bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example
Ai includes Aj if
bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida
bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California
bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California
15 Questions of category Entity
Ai is equivalent to Aj if
bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example
Ai includes Aj if
bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey
- Introduction
- Related Work
-
- Relating Answers
- Paraphrase Identification
-
- SimFinder ndash A methodology based on linguistic analysis of comparable corpora
- Methodologies purely based on lexical similarity
- Canonicalization as complement to lexical features
- Use of context to extract similar phrases
- Use of dissimilarities and semantics to provide better similarity measures
-
- Overview
-
- JustAsk architecture and work points
-
- JustAsk
-
- Question Interpretation
- Passage Retrieval
- Answer Extraction
-
- Experimental Setup
-
- Evaluation measures
- Multisix corpus
- TREC corpus
-
- Baseline
-
- Results with Multisix Corpus
- Results with TREC Corpus
-
- Error Analysis over the Answer Module
-
- Relating Answers
-
- Improving the Answer Extraction module Relations Module Implementation
- Additional Features
-
- Normalizers
- Proximity Measure
-
- Results
- Results Analysis
- Building an Evaluation Corpus
-
- Clustering Answers
-
- Testing Different Similarity Metrics
-
- Clusterer Metrics
- Results
- Results Analysis
-
- New Distance Metric - JFKDistance
-
- Description
- Results
- Results Analysis
-
- Possible new features ndash Combination of distance metrics to enhance overall results
- Posterior analysis
-
- Conclusions and Future Work
-
1 Rules for detecting relations between answers 3
bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300
bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200
13 Questions of category Human
Ai is equivalent to Aj if
bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy
bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy
bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy
bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy
131 Questions of category HumanIndividual
Ai is equivalent to Aj if
bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy
bull containsic(AiAj) ndash Example
bull levenshteinDistance(AiAj) lt 1 ndash Example
14 Questions of category Location
Ai is equivalent to Aj if
bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China
1 Rules for detecting relations between answers 4
bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example
bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China
bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China
bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example
Ai includes Aj if
bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida
bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California
bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California
15 Questions of category Entity
Ai is equivalent to Aj if
bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example
Ai includes Aj if
bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey
- Introduction
- Related Work
-
- Relating Answers
- Paraphrase Identification
-
- SimFinder ndash A methodology based on linguistic analysis of comparable corpora
- Methodologies purely based on lexical similarity
- Canonicalization as complement to lexical features
- Use of context to extract similar phrases
- Use of dissimilarities and semantics to provide better similarity measures
-
- Overview
-
- JustAsk architecture and work points
-
- JustAsk
-
- Question Interpretation
- Passage Retrieval
- Answer Extraction
-
- Experimental Setup
-
- Evaluation measures
- Multisix corpus
- TREC corpus
-
- Baseline
-
- Results with Multisix Corpus
- Results with TREC Corpus
-
- Error Analysis over the Answer Module
-
- Relating Answers
-
- Improving the Answer Extraction module Relations Module Implementation
- Additional Features
-
- Normalizers
- Proximity Measure
-
- Results
- Results Analysis
- Building an Evaluation Corpus
-
- Clustering Answers
-
- Testing Different Similarity Metrics
-
- Clusterer Metrics
- Results
- Results Analysis
-
- New Distance Metric - JFKDistance
-
- Description
- Results
- Results Analysis
-
- Possible new features ndash Combination of distance metrics to enhance overall results
- Posterior analysis
-
- Conclusions and Future Work
-
1 Rules for detecting relations between answers 4
bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example
bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China
bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China
bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example
Ai includes Aj if
bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida
bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California
bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California
15 Questions of category Entity
Ai is equivalent to Aj if
bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example
Ai includes Aj if
bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey
- Introduction
- Related Work
-
- Relating Answers
- Paraphrase Identification
-
- SimFinder ndash A methodology based on linguistic analysis of comparable corpora
- Methodologies purely based on lexical similarity
- Canonicalization as complement to lexical features
- Use of context to extract similar phrases
- Use of dissimilarities and semantics to provide better similarity measures
-
- Overview
-
- JustAsk architecture and work points
-
- JustAsk
-
- Question Interpretation
- Passage Retrieval
- Answer Extraction
-
- Experimental Setup
-
- Evaluation measures
- Multisix corpus
- TREC corpus
-
- Baseline
-
- Results with Multisix Corpus
- Results with TREC Corpus
-
- Error Analysis over the Answer Module
-
- Relating Answers
-
- Improving the Answer Extraction module Relations Module Implementation
- Additional Features
-
- Normalizers
- Proximity Measure
-
- Results
- Results Analysis
- Building an Evaluation Corpus
-
- Clustering Answers
-
- Testing Different Similarity Metrics
-
- Clusterer Metrics
- Results
- Results Analysis
-
- New Distance Metric - JFKDistance
-
- Description
- Results
- Results Analysis
-
- Possible new features ndash Combination of distance metrics to enhance overall results
- Posterior analysis
-
- Conclusions and Future Work
-